date:20100312

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844450#action_12844450
]

Michael McCandless commented on LUCENE-2309:

bq. The IndexWriter or rather DocInverterPerField are simply an attribute
consumer. None of them needs to know about Analyzer or TokenStream at all.
Neither needs the analyzer to iterate over tokens.

[Carrying over discussions on IRC with Chris Male Uwe...]

Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...

However, the .reset() method is redundant from indexer's standpoint --
ie when indexer calls Field.getTokenStream (say) whatever init'ing /
reset'ing should already have be done by that method by the time it
returns the TokenStream.

Also, .close and .end are redundant -- seems like we should only have
.end (few token streams do anything in .close...). But coalescing
those two would be a good chunk of work at this point :) Or maybe we
make a .finish that simply both by default ;)

Finally, indexer doesn't really need a Document; it just needs
something abstract that's provides an iterator over all fields that
need indexing (and separately, storing).

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

IndexWriter only needs an AttributeSource to do indexing.
Yet, today, it interacts with Field instances, holds a private
analyzers, invokes analyzer.reusableTokenStream, has to deal with a
wide variety (it's not analyzed; it is analyzed but it's a Reader,
String; it's pre-analyzed).
I'd like to have IW only interact with attr sources that already
arrived with the fields. This would be a powerful decoupling -- it
means others are free to make their own attr sources.
They need not even use any of Lucene's analysis impls; eg they can
integrate to other things like [OpenPipeline|http://www.openpipeline.org].
Or make something completely custom.
LUCENE-2302 is already a big step towards this: it makes IW agnostic
about which attr is the term, and only requires that it provide a
BytesRef (for flex).
Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
FieldType knows the analyzer to use, then we could simply create a
getAttrSource() method (say) on it and move all the logic IW has today
onto there. (We'd still need existing IW code for back-compat).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844455#action_12844455
]

Michael McCandless commented on LUCENE-2294:

Thanks Shai, I'll look...

bq. Note, check.py still alerts on some changes, though I don't see any
relevant change in the patch file. Should I ignore them?

Yes if they are indeed false positives...

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch,
LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

I would like to factor out of all IW configuration parameters into a single
configuration class, which I propose to name IndexWriterConfiguration (or
IndexWriterConfig). I want to store there almost everything besides the
Directory, and to reduce all the ctors down to one: IndexWriter(Directory,
IndexWriterConfiguration). What I was thinking of storing there are the
following parameters:
* All of ctors parameters, except for Directory.
* The different setters where it makes sense. For example I still think
infoStream should be set on IW directly.
I'm thinking that IWC should expose everything in a setter/getter methods,
and defaults to whatever IW defaults today. Except for Analyzer which will
need to be defined in the ctor of IWC and won't have a setter.
I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares
a DEFAULT (which is an int and not MaxFieldLength). Do we still think that
1 should be the default? Why not default to UNLIMITED and otherwise let
the application decide what LIMITED means for it? I would like to make MFL
optional on IWC and default to something, and I hope that default will be
UNLIMITED. We can document that on IWC, so that if anyone chooses to move to
the new API, he should be aware of that ...
I plan to deprecate all the ctors and getters/setters and replace them by:
* One ctor as described above
* getIndexWriterConfiguration, or simply getConfig, which can then be queried
for the setting of interest.
* About the setters, I think maybe we can just introduce a setConfig method
which will override everything that is overridable today, except for
Analyzer. So someone could do iw.getConfig().setSomething();
iw.setConfig(newConfig);
** The setters on IWC can return an IWC to allow chaining set calls ... so
the above will turn into
iw.setConfig(iw.getConfig().setSomething1().setSomething2());
BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it
will greatly simplify IW's API.
I'll start to work on a patch.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844461#action_12844461
]

Michael McCandless commented on LUCENE-2294:

bq. Note, check.py still alerts on some changes, though I don't see any
relevant change in the patch file. Should I ignore them?

Hmm some of these (at least TestAtomicUpdate was changed from Simple -
Whitespace) were in fact real changes I'll fix post a new patch.

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch,
LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844464#action_12844464
]

Simon Willnauer commented on LUCENE-2309:
-

bq. [Carrying over discussions on IRC with Chris Male Uwe...]

That make it very hard to participate. I can not afford to read through all IRC
stuff and I don't get the chance to participate directly unless I watch IRC
constantly. We should really move back to JIRA / devlist for such discussions.
There is too much going on in IRC.

{quote}
Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...
{quote}

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I
want to implement something aside the TokenStream delegator API?

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-2294:
---

Attachment: LUCENE-2294.patch

Attached new patch, just fixing a couple tests where analyzer had changed.

I it's ready to commit (take 2)! I'll wait a day or two...

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch,
LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844467#action_12844467
]

Robert Muir commented on LUCENE-2309:
-

Hello, i commented yesterday but did not receive much feedback, so
I want to elaborate some more:

I suppose what I was trying to mention in my earlier comment here:
https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189

is that while I really like the new TokenStream API, i would prefer
it if we thought about making this flexible enough to support
different paradigms, including perhaps something that looks a lot
like the old TokenStream API.

The reason is, I notice a lot of existing code still under this old API,
and I think that in some cases, perhaps its easier to work with, even
if you aren't a new user. I definitely think for newer users the old API
might have some advantages.

I think its useful to consider supporting such an API, perhaps as an extension
in contrib/analyzers, even if its not as fast or flexible as the new API,
perhaps the tradeoff of speed and flexibility would be worth the ease
for newer users.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity


[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844469#action_12844469
 ] 

Chris Male commented on LUCENE-2310:


The challenge presented in this work is the pervasiveness of the Fieldable 
class.  Its used in several hundred places through the source, but the majority 
are in tests, and in Document itself.  Therefore part of this work will be also 
to move as many of the tests over to using Field, and working on the Document 
API as well.

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male

 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489
]

Uwe Schindler commented on LUCENE-2309:
---

bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API.
The code just has an add/accept-method that accepts tokens. The idea is to, as
Simon proposed, let the docinverter implement something like AttributeAcceptor.
But still we must have the attribute api and the acceptor (DocInverter) must
always see the same attribute instances (else much time would be spent to each
time call getAttribute(...) for each token, if the accept method would take an
AttributeSource.

The current TokenStream api could get a method taking AttributeAcceptor and
simply do a while incrementToken() loop, calling accept() on DocInverter (the
AttributeAcceptor). Another approach for users would be to not use the
TokenStream API at all and simply call the accept() method for each token.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489
]

Uwe Schindler edited comment on LUCENE-2309 at 3/12/10 1:25 PM:

bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API.
The code just has an add/accept-method that accepts tokens. The idea is to, as
Simon proposed, let the docinverter implement something like AttributeAcceptor.
But still we must have the attribute api and the acceptor (DocInverter) must
always see the same attribute instances (else much time would be spent to
each time call getAttribute(...) for each token, if the accept method would
take an AttributeSource).

But both approaches still have te problem with the shared attributes. If you
want to record tokens you have to implement something like my Proxy
attributes. Else (as mentioned) above, most time would be spent in
getAttribute() calls.

was (Author: thetaphi):
bq. I could imagine a really simple interface like

During lunch an idea evolved:

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844498#action_12844498
]

Michael McCandless commented on LUCENE-2309:

bq. The idea is to, as Simon proposed, let the docinverter implement something
like AttributeAcceptor.

This is interesting! It inverts the stack/control flow, but, would continue to
use shared attrs.

So then somehow the indexer would pass its AttrAcceptor to the field? And the
field would have whatever control logic it wants to feed the tokens...

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844500#action_12844500
]

Michael McCandless commented on LUCENE-2309:

{quote}
bq. Actually, TokenStream is already AttrSource + incrementing, so it seems
like the right start...

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I
want to implement something aside the TokenStream delegator API?
{quote}

True, but we need at least some way to increment? AttrSource doesn't have that.

But I don't think we need reset nor close from TokenStream.

Maybe we could factor out an abstract class / interface that TokenStream impls,
minus the reset close methods?

Then people could freely use Lucene to index off a foreign analysis chain...

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Welcome Chris Male as Contrib committer!

2010-03-12 Thread Mark Miller


I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!

Chris has been making a lot of headway in cleaning up the spacial contrib 
lately,
and hopefully now we can get more of those improvements into svn!

Congrats Chris, and welcome!


--
- Mark

http://www.lucidimagination.com

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844509#action_12844509
]

Shai Erera commented on LUCENE-2309:

bq. We should really move back to JIRA / devlist for such discussions

+1 !! I also find it very hard to track so many sources of discussions (JIRA,
java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and
searchable (I think?) which makes it impossible to trace back a discussion,
and/or randomly stumble upon it in Google.

I'd like to donate my two cents here - we've just recently changed the
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only
now the API has changed slightly. The proposals here, w/ the
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field
will call back to IW seems too much complicated to me. Users that write
Analyzers/TokenStreams/AttributeSources, should not care how they are
indexed/stored etc. Forcing them to implement this push logic to IW seems to me
like a real unnecessary overhead and complexity.

And having the Field control the flow of indexing seems also dangerous ...
might expose Lucene to lots of bugs by users. Today when IW controls it, it's
one place to look for, but tomorrow when Field will control it, where do we
look? In the app's custom Field code? In IW's atts consuming methods?

Will the Field also control how stored fields are added? Or only
AttributeSourced ones?

Maybe I need to get used to this change, but currently it looks wrong to
reverse the control flow. Maybe in principle the DocInverter now accepts tokens
from IW, and so it looks as if we can pass it to the Field (as IW's
AttAcceptor), but still the concept is different. We (IW) control the indexing
flow, and not the user.

I also may not understand what will that give to users. Shouldn't users get
enough flexibility w/ the current API and the Flex (once out) stuff? Do they
really need to be bothered w/ pushing tokens to IW?

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-12 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844511#action_12844511
]

Shai Erera commented on LUCENE-2294:

Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it
found some more (probably some I missed in the 1st pass), only this time really
few more. So I fixed them as well. But I didn't run it 3rd time :) ...

I can't wait for this to be in ... an exhausting issue ;).

Create IndexWriterConfiguration and store all of IW configuration there
---

Key: LUCENE-2294
URL: https://issues.apache.org/jira/browse/LUCENE-2294
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
Fix For: 3.1

Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch,
LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844515#action_12844515
]

Uwe Schindler commented on LUCENE-2309:
---

bq. I'd like to donate my two cents here - we've just recently changed the
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only
now the API has changed slightly. The proposals here, w/ the
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field
will call back to IW seems too much complicated to me. Users that write
Analyzers/TokenStreams/AttributeSources, should not care how they are
indexed/stored etc. Forcing them to implement this push logic to IW seems to me
like a real unnecessary overhead and complexity.

The idea was not to change this behaviour, but also give the user the
posibility to reverse that. For some tokenstreams it would simplify things
much. The current IndexWriter code works exactly like that:
# DocInverter gets TokenStream
# DocInverter calls reset() -- to be left out and moved to field/analyzer
# DocInverter does while-loop on incrementToken - it iterates. On each Token it
calls add() on the field consumer
# DocInverter calls end() and updates end offset
# DocInverter calls close() -- to be left out and moved to field/analyzer

The change is simply that step (3) is removed from DocInverter which only
provides the add() method for accepting Tokens. The current while loop simply
is done in the current TokenStream/Field code, so nobody needs to change his
code. But somebody that actively wants to push tokens can now do this. If he
wants to do this currently he has no chance without heavy buffering.

So the push API will be very expert and the current TokenStreams is just a user
of this API.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Robert Muir

Congratulations!

On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote:
 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com






-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Mark Miller (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516
]

Mark Miller commented on LUCENE-2309:
-

bq. Also IRC is not logged/archived and searchable (I think?) which makes it
impossible to trace back a discussion, and/or randomly stumble upon it in
Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a
great way for people to communicate and hash stuff out, but its not necessary
you follow it. If you have questions or want further elaboration, just ask. No
one can expect you to follow IRC, nor is it a valid reference for where
something was decided. IRC is great - I think its really benefited having devs
discuss there - but the official position is, if it didn't happen on the list,
it didnt actually happen.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter


 [ 
https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2015.
-

Resolution: Fixed

Committed revision 922277.

Thanks Cédrik!

 ASCIIFoldingFilter: expose folding logic + small improvements to 
 ISOLatin1AccentFilter
 --

 Key: LUCENE-2015
 URL: https://issues.apache.org/jira/browse/LUCENE-2015
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Cédrik LIME
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: ASCIIFoldingFilter-no_formatting.patch, 
 ASCIIFoldingFilter-no_formatting.patch, Filters.patch, 
 ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch


 This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: 
 left  right single quotation marks, en dash, em dash) which we very 
 frequently encounter in our projects. I know that this class is now 
 deprecated; this improvement is for legacy code that hasn't migrated yet.
 It also enables easy access to the ascii folding technique use in 
 ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler

Congrats Mark. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!





-- 
- Mark
 
http://www.lucidimagination.com

RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler

 

Congrats Chris. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!





-- 
- Mark
 
http://www.lucidimagination.com

RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler

I wish you heavy committing, too. But I meant Chris, sorry J

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Friday, March 12, 2010 3:36 PM
To: java-dev@lucene.apache.org
Subject: RE: Welcome Chris Male as Contrib committer!

 

Congrats Mark. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de http://www.thetaphi.de/ 

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!

 

-- 
- Mark
 
http://www.lucidimagination.com

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Chris Male

Hi,

Thanks Mark!

All is forgiven Uwe :)

Cheers
Chris

On Fri, Mar 12, 2010 at 3:38 PM, Uwe Schindler u...@thetaphi.de wrote:

  I wish you heavy committing, too. But I meant Chris, sorry J



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Uwe Schindler [mailto:u...@thetaphi.de]
 *Sent:* Friday, March 12, 2010 3:36 PM

 *To:* java-dev@lucene.apache.org
 *Subject:* RE: Welcome Chris Male as Contrib committer!



 Congrats Mark. I wish you heavy committing!



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Mark Miller [mailto:markrmil...@gmail.com]
 *Sent:* Friday, March 12, 2010 3:17 PM
 *To:* java-dev@lucene.apache.org
 *Subject:* Welcome Chris Male as Contrib committer!



 I am happy to announce the Lucene PMC has accepted Chris Male as a

 contrib committer!



 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately,

 and hopefully now we can get more of those improvements into svn!



 Congrats Chris, and welcome!



 --

 - Mark



 http://www.lucidimagination.com








-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844523#action_12844523
]

Simon Willnauer commented on LUCENE-2309:
-

bq. Then people could freely use Lucene to index off a foreign analysis chain...
That is what I was talking about!

{quote}
I'd like to donate my two cents here - we've just recently changed the
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only
now the API has changed slightly. The proposals here, w/ the
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field
will call back to IW seems too much complicated to me. Users that write
Analyzers/TokenStreams/AttributeSources, should not care how they are
indexed/stored etc. Forcing them to implement this push logic to IW seems to me
like a real unnecessary overhead and complexity.
{quote}

We can surely hide this implementation completely from field. I consider this
being similar to Collector where you pass it explicitly to the search method if
you want to have a different behavior. Maybe something like a
AttributeProducer. I don't think adding this to field makes a lot of sense at
all and it is not worth the complexity.

bq. Will the Field also control how stored fields are added? Or only
AttributeSourced ones?
IMO this is only about inverted fields.

bq. We (IW) control the indexing flow, and not the user.
The user only gets the possibility to exchange the analysis chain but not the
control flow. The user already can mess around with stuff in incrementToken(),
the only thing we change / invert is that the indexer does not know about
TokenStreams anymore. it does not change the controlflow though.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Grant Ingersoll

Congrats!  

Tradition has it, Chris, that you provide a brief intro on yourself upon 
becoming a new committer, so let's hear it!

-Grant

On Mar 12, 2010, at 9:17 AM, Mark Miller wrote:

  I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!
 
 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately, 
 and hopefully now we can get more of those improvements into svn!
 
 Congrats Chris, and welcome!
 
 -- 
 - Mark
 
 http://www.lucidimagination.com

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Simon Willnauer

Congrats Chris :)

On Fri, Mar 12, 2010 at 3:51 PM, Grant Ingersoll gsing...@apache.org wrote:
 Congrats!
 Tradition has it, Chris, that you provide a brief intro on yourself upon
 becoming a new committer, so let's hear it!
 -Grant
 On Mar 12, 2010, at 9:17 AM, Mark Miller wrote:

 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844528#action_12844528
]

Uwe Schindler commented on LUCENE-2309:
---

There is one problem that cannot be easy solved (for all proposals here), if we
want to provide an old-style API that does not require reuse of tokens:
The problem with AttributeProvider is that if we want to support something
(like rmuir proposed before) that looks like the old Token next(), we need an
AttributeProvider that passes the AttributeSource to the indexer on each Token!
And that would lead to lots of getAttribute() calls, that would slowdown
indexing! So with the current APIs we cannot get around the requirement to
reuse the same Attribute instances during the whole indexing without a major
speed impact. This can only be solved with my nice BCEL proxy Attributes, so
you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9
(yes, we can reactivate that API somehow as an easy use-addendum).

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

[
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844533#action_12844533
]

Robert Muir commented on LUCENE-2309:
-

{quote}
So with the current APIs we cannot get around the requirement to reuse the same
Attribute instances during the whole indexing without a major speed impact.
{quote}

I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily
need something that looks like the old TokenStream API, but I feel it would
be worth our time to think about supporting 'some alternative API' that makes
it easier to work with lots of context across different Tokens.

I personally do not mind how this is done with the capture/restore state API,
but I feel that its pretty unnatural for many developers, and in the future
folks
might want to do more complex analysis (maybe even light pos-tagging, etc)
that requires said context, and we should plan for this.

I feel this wasn't such an issue with the old TokenStream API, but maybe there
is another way to address this potential problem.

Fully decouple IndexWriter from analyzers
-

Key: LUCENE-2309
URL: https://issues.apache.org/jira/browse/LUCENE-2309
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Grant Ingersoll


On Mar 12, 2010, at 10:00 AM, Chris Male wrote:

 Although I live in Amsterdam, I am actually from New Zealand so it feels good 
 to finally have kiwi representation.

+1.  I've always wanted to go there!  I'll have to pick your brain on it next 
time I'm in Amsterdam over a pint.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Michael McCandless

Welcome aboard Chris!

Mike

On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote:
 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Shalin Shekhar Mangar

Welcome Chris!

On Fri, Mar 12, 2010 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote:

  I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!


 --
 - Mark
 http://www.lucidimagination.com




-- 
Regards,
Shalin Shekhar Mangar.

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844578#action_12844578
]

Robert Muir commented on LUCENE-2308:
-

{quote}
details like omitTfAP, omitNorms
{quote}

personal pet peeve, i wonder if we could consider improving on 'omit' here,
I think things like omit(false), disable(false) are a little awkward.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

This came up from dicussions on IRC. I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a more
ambitious (I think?) issue. I think this is a good first baby step. We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
off on that for starters...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844579#action_12844579
]

Chris Male commented on LUCENE-2308:

So you are thinking more along the lines indexNorms(true|false)?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844585#action_12844585
]

Robert Muir commented on LUCENE-2308:
-

bq. So you are thinking more along the lines indexNorms(true|false)?

or whatever you come up with, that doesn't create double-negatives!
but yeah, i think something like that is a little easier... no big deal
just figured I would bring it up if this stuff was getting refactored anyway

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587
]

Chris Male commented on LUCENE-2308:

I agree entirely. This is definitely the moment to remove any ambiguity or
confusion in this API. I'll make sure to incorporate this idea.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Erick Erickson

Congrats Chris!

I vote for thinkAboutNotIncludingNormsMaybe(true|false) G.

Seriously double negatives are ugly IMO, +1 for changing

Erick

On Fri, Mar 12, 2010 at 12:56 PM, Chris Male (JIRA) j...@apache.org wrote:

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587]

Chris Male commented on LUCENE-2308:

I agree entirely. This is definitely the moment to remove any ambiguity or
confusion in this API. I'll make sure to incorporate this idea.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

This came up from dicussions on IRC. I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a more
ambitious (I think?) issue. I think this is a good first baby step. We
could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
off on that for starters...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844626#action_12844626
]

Marvin Humphrey commented on LUCENE-2308:
-

I think we might consider matchOnly() instead of omitNorms(). If a field is
match only, we don't need boost bytes a.k.a. norms because they are only
used as a scoring multiplier.

Haven't got a good synonym for omitTFAP, but I'd sure like one.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844629#action_12844629
]

Shai Erera commented on LUCENE-2308:

How about enable(TYPE/FEATURE) and corresponding disable? So Type/Feature will
have NORMS, TF, POSITIONS and calls would look like:
f.enable(Type.NORMS), f.disable(Type.TF)?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844630#action_12844630
]

Robert Muir commented on LUCENE-2308:
-

Just also to mention (probably too much for this one issue)!

I think it would be nice of OmitTF was separately selectable
from OmitPositions, as Shai implied. We would have to
actually implement this though I think!

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637
]

Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions. The freq
tells you how many positions there are.

I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Mark Miller

Committers are competant in different areas of the code. Even mike
wasn't big into the search side until per segment. Commiters are
trusted to mess with the pieces they know.

I don't see anyone even remotely suggesting that users should have to
understand all of the implications of posting format modifications.

Just sounds like a nasty jab to me.

- Mark

http://www.lucidimagination.com

On Mar 12, 2010, at 2:43 PM, Marvin Humphrey (JIRA)
j...@apache.org wrote:

[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637
]

Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions. The
freq

tells you how many positions there are.

I think it's asking an awful lot of our users to require that they
understand

all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

This came up from dicussions on IRC. I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema! It's just refactoring what we already
specify today. EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a
more
ambitious (I think?) issue. I think this is a good first baby
step. We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but
maybe hold

off on that for starters...

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844653#action_12844653
]

Robert Muir commented on LUCENE-2308:
-

{quote}
If you disable term freq, you also have to disable positions. The freq
tells you how many positions there are.
{quote}

Marvin: as stated, we would have to actually implement this.
There's an issue open for it too: LUCENE-2048.
I was just discussing this with someone the other day.

{quote}
I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.
{quote}

I don't know what I did to piss you off, but I just thought it would be nice
for completeness, to mention that this feature is still open and its
something we should think about.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844659#action_12844659
]

Marvin Humphrey commented on LUCENE-2308:
-

I'm simply suggesting that the proposed API is too hard to understand.

Most users know whether their fields can be match-only but have no idea what
TFAP is. And even advanced users will have difficulty understanding all the
implications for matching and scoring when they selectively disable portions
of the posting format.

I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags).
Something that ordinary users can grok would be used more often and more
effectively.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844661#action_12844661
]

Chris Male commented on LUCENE-2308:

What I covered with Mike earlier was whether FieldType methods would be
immutable or not.

If they are, which seems a good idea, then everything will be enabled/disabled
in the construction of the FieldType so we would only need to support property
getter methods.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey

On Fri, Mar 12, 2010 at 03:01:27PM -0500, Mark Miller wrote:
 Committers are competant in different areas of the code.  Even mike  
 wasn't big into the search side until per segment.  Commiters are  
 trusted to mess with the pieces they know.

Absolutely.  I wouldn't expect every committer to undertand the gory details
of posting formats, and I've been a little caught off guard by the blowback
from what I thought was an inoccuous observation.

But by the same token, I wouldn't expect our users to have sufficient
expertise to understand all the variants of omit*() either.  This stuff
oughtta be implementation details.

 I don't see anyone even remotely suggesting that users should have to  
 understand all of the implications of posting format modifications.

That's what omitTFAP() and omitNorms() do, though.  And as Mike pointed out in
the baby steps thread, omitTFAP() is often misunderstood.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844684#action_12844684
]

Michael McCandless commented on LUCENE-2308:

Hmm one challenge with making FieldType immutable is we don't want
a zillion ctors over time. Also creating a FieldType with args like
new FieldType(true, false, false) isn't really readable.

It would be nice if we could do something similar to IndexWriterConfig
(LUCENE-2294), where you use incremental ctor/setters to set up the
configuration but then once it's used (bound to a Field), it's
immutable.

I'm torn on naming: yes, search-oriented names like matchOnly is
tempting, but then we really should tease apart termFreq and positions
(they are stuck together now with omitTFAP). And the two are not
fully independent as Marvin noted -- so maybe we use a cryptic enum
(DOCS, DOCS_TERM_FREQ, DOCS_TERM_FREQ_POSITIONS)? If we can only find
better names...

I'm not sure we can/should find better index-time names. What is
stored in the index is relatively independent from how/whether
searches make use of it. EG if you store termFreq (but not positions)
you can still do match only searching, or, you can do full scoring of
the query. You can't use positional queries.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844688#action_12844688
]

Marvin Humphrey commented on LUCENE-2308:
-

Also creating a FieldType with args like
new FieldType(true, false, false) isn't really readable.

Agreed Another option would be a flags integer and bitwise constants:

{code}
FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED);
{code}

I bet that'll be more popular than flags, but I thought it was worth
bringing it up anyway. :)

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Earwin Burrfoot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844690#action_12844690
]

Earwin Burrfoot commented on LUCENE-2308:
-

I'm strongly against names like 'matchOnly'. They are perfectly fine in some
'schema' layer over Lucene, but here, in lowlevel guts, I'd prefer names that
clearly state what the hell do they do, without forcing me to consult
javadocs/code.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844700#action_12844700
]

Yonik Seeley commented on LUCENE-2308:
--

For the non-expert user, it's just a label and won't have much meaning
regardless of what it's called, and they will need to consult the docs. Of
course, if one starts to dig deeper, norms actually does have a physical
meaning in the index, so preferring a label with norms in it seems completely
reasonable.

There's also history to consider - when you change the name of something, you
cut the link to the past in search engines, and in the memories of many
developers.

As it relates to Solr - I don't care so much since it makes sense for the Solr
schema to isolate these changes and stick with omitNorms regardless.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844702#action_12844702
]

Chris Male commented on LUCENE-2308:

{quote}
It would be nice if we could do something similar to IndexWriterConfig
(LUCENE-2294), where you use incremental ctor/setters to set up the
configuration but then once it's used (bound to a Field), it's
immutable.
{quote}

Yeah we could use something like a FieldTypeBuilder which could provide a fluid
interface for specifying each property, which then get built into an immutable
FieldType at the end.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844707#action_12844707
]

Yonik Seeley commented on LUCENE-2308:
--

I'm not sure if strict immutability is necessary - there's everything in
between too.
One can simply say that all changes should be made before first use, and after
that point it's undefined.

Unrelated question: I assume that this would retain the same flexibility as we
have today... the ability to change FieldType for field foo from one document
to the next?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710
]

Chris Male commented on LUCENE-2308:

{quote}
I'm not sure if strict immutability is necessary - there's everything in
between too.
One can simply say that all changes should be made before first use, and after
that point it's undefined.
{quote}

I'm really unsure about this if people are going to be using a FieldType
instance with multiple Fields. Perhaps this really is just an edge case.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we
have today... the ability to change FieldType for field foo from one document
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents
while defining separate FieldTypes? Or is creating new Field instances okay?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710
]

Chris Male edited comment on LUCENE-2308 at 3/12/10 10:01 PM:
--

I'm really unsure about this if people are going to be using a FieldType
instance with multiple Fields. Perhaps this really is just an edge case though.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we
have today... the ability to change FieldType for field foo from one document
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents
while defining separate FieldTypes? Or is creating new Field instances okay?

was (Author: cmale):
{quote}
I'm not sure if strict immutability is necessary - there's everything in
between too.
One can simply say that all changes should be made before first use, and after
that point it's undefined.
{quote}

I'm really unsure about this if people are going to be using a FieldType
instance with multiple Fields. Perhaps this really is just an edge case.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we
have today... the ability to change FieldType for field foo from one document
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents
while defining separate FieldTypes? Or is creating new Field instances okay?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844716#action_12844716
]

Yonik Seeley commented on LUCENE-2308:
--

bq. I'm really unsure about this if people are going to be using a FieldType
instance with multiple Fields.

I will, if I can (provided the FieldType does not contain the field name).
That shouldn't have anything to do with immutability though.

bq. Are you wanting to be able to reuse the same Field instance in both
documents while defining separate FieldTypes? Or is creating new Field
instances okay?

new Field instances should be fine - it's not really my use case anyway. But
we're designing for the 1000's of use cases that are out there and we should be
careful about adding new constraints.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844720#action_12844720
]

Chris Male commented on LUCENE-2308:

{quote}
I will, if I can (provided the FieldType does not contain the field name). That
shouldn't have anything to do with immutability though.
{quote}

Yeah the field name will stay inside the Field. To me the reuse issue relates
immutability in that a change to a property in one FieldType after construction
means the change effects all the Fields that use that type.

But as you say, if we document that its best to set everything at instantiation
and that whatever happens after that is undefined, then I imagine it'll be fine.

{quote}
new Field instances should be fine - it's not really my use case anyway. But
we're designing for the 1000's of use cases that are out there and we should be
careful about adding new constraints.
{quote}

Yeah I appreciate that this API will be used in lots of different ways. Baby
steps as Mike said :) But to answer your question, yes the flexibility will
remain.

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2308) Separately specify a field's type

[
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844722#action_12844722
]

Yonik Seeley commented on LUCENE-2308:
--

Of course... given that Fieldable is an interface, one could create an
implementation that just delegated all the calls like omitNorms to a shared
instance, except for the value part. Add a getAnalyzer() method to Fieldable,
and it's the same thing in the end?

Separately specify a field's type
-

Key: LUCENE-2308
URL: https://issues.apache.org/jira/browse/LUCENE-2308
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2312) Search on IndexWriter's RAM Buffer

Search on IndexWriter's RAM Buffer
--

Key: LUCENE-2312
URL: https://issues.apache.org/jira/browse/LUCENE-2312
Project: Lucene - Java
Issue Type: New Feature
Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
Fix For: 3.0.2

Todays Lucene based NRT systems must incur the cost of merging
segments, which can slow indexing.

Michael Busch has good suggestions regarding how to handle deletes using max
doc ids.
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923

The area that isn't fully fleshed out is the terms dictionary,
which needs to be sorted prior to queries executing. Currently
IW implements a specialized hash table. Michael B has a
suggestion here:
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844749#action_12844749
]

Jason Rutherglen commented on LUCENE-2312:
--

In regards to the terms dictionary, keeping it sorted or not, I think it's best
to sort it on demand because otherwise there will be yet another parameter to
pass into IW (i.e. sortRAMBufTerms or something like that).

Search on IndexWriter's RAM Buffer
--

In order to offer user's near realtime search, without incurring
an indexing performance penalty, we can implement search on
IndexWriter's RAM buffer. This is the buffer that is filled in
RAM as documents are indexed. Currently the RAM buffer is
flushed to the underlying directory (usually disk) before being
made searchable.
Todays Lucene based NRT systems must incur the cost of merging
segments, which can slow indexing.
Michael Busch has good suggestions regarding how to handle deletes using max
doc ids.
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
The area that isn't fully fleshed out is the terms dictionary,
which needs to be sorted prior to queries executing. Currently
IW implements a specialized hash table. Michael B has a
suggestion here:
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Different behavior of Directory.fieldLength()

2010-03-12 Thread Marcelo Ochoa

Hi:
  During some test of Lucene Domain Index
(http://docs.google.com/View?id=ddgw7sjp_54fgj9kg) with big data
sources we found an exception caused for calling
Directory.fieldLength() method on non existing file.
  FSDirectory implements this method as:
  /** Returns the length in bytes of a file in the directory. */
  public long fileLength(String name) {
ensureOpen();
File file = new File(directory, name);
return file.length();
  }

  According to JDK1.5 calling to File constructor causes a file
creation without throwing an exception:
http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#File(java.lang.String,
java.lang.String)
  But either RAMDirectory nor OJVMDirectory do this:
RAMDirectory:
  /** Returns the length in bytes of a file in the directory.
   * @throws IOException if the file does not exist
   */
  public final long fileLength(String name) throws IOException {
ensureOpen();
RAMFile file;
synchronized (this) {
  file = (RAMFile)fileMap.get(name);
}
if (file==null)
  throw new FileNotFoundException(name);
return file.getLength();
  }

  If OJVMDirectory throws an exception if a file doesn't exist it
causes that the IndexWriter fail to do the job, here the stack trace:
IW 3 [Root Thread]: DW:   RAM: now flush @ usedMB=15.001
allocMB=15.001 deletesMB=0 triggerMB=15
IW 3 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
docStoreOffset=0 flushDocs=true flushDeletes=false
flushDocStores=false numDocs=109169 numBufDelTerms=0
IW 3 [Root Thread]:   index before flush
IW 3 [Root Thread]: DW: flush postings as segment _0 numDocs=109169
*** 2010-03-11 17:27:15.696
IW 3 [Root Thread]: DW: docWriter: now abort
IW 3 [Root Thread]: hit exception flushing segment _0
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.tii
IFD [Root Thread]: delete _0.tii
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fnm
IFD [Root Thread]: delete _0.fnm
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fdx
IFD [Root Thread]: delete _0.fdx
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fdt
IFD [Root Thread]: delete _0.fdt
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.prx
IFD [Root Thread]: delete _0.prx
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.nrm
IFD [Root Thread]: delete _0.nrm
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.frq
IFD [Root Thread]: delete _0.frq
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.tis
IFD [Root Thread]: delete _0.tis
Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
ODCIIndexCreate
SEVERE: failed to create index: cannot verify file: _0.fdx. Reason:
Exhausted Resultset
Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
ODCIIndexCreate
FINER: THROW
java.io.IOException: cannot verify file: _0.fdx. Reason: Exhausted Resultset
at 
org.apache.lucene.store.OJVMDirectory.fileLength(OJVMDirectory.java:633)
at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:271)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:593)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4311)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2497)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2451)
at org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java:374)
at 
org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomainIndex.java:568)
IW 3 [Root Thread]: now flush at close
IW 3 [Root Thread]:   flush: segment=null docStoreSegment=null
docStoreOffset=0 flushDocs=false flushDeletes=true
flushDocStores=false numDocs=0 numBufDelTerms=0
IW 3 [Root Thread]:   index before flush
IW 3 [Root Thread]: CMS: now merge
IW 3 [Root Thread]: CMS:   index:
IW 3 [Root Thread]: CMS:   no more merges pending; now return
IW 3 [Root Thread]: now call final commit()
IW 3 [Root Thread]: startCommit(): start sizeInBytes=0
IW 3 [Root Thread]: startCommit index= changeCount=1
IW 3 [Root Thread]: done all syncs
IW 3 [Root Thread]: commit: pendingCommit != null
IW 3 [Root Thread]: commit: wrote segments file segments_2
IFD [Root Thread]: now checkpoint segments_2 [0 segments ; isCommit = true]
IFD [Root Thread]: deleteCommits: now decRef commit segments_1
IFD [Root Thread]: delete segments_1
IW 3 [Root Thread]: commit: done
IW 3 [Root Thread]: at close:

   Which is the correct behavior for this method?
   We changed OJVMDirectory.fileLength() method to returns 0 if no
file exists instead of throwing an exception and IndexWriter works
properly,

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-12 Thread Marvin Humphrey

On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
  So there would be polymorphism in the decoding phase while we're supplying
  information the Similarity object needs to make its similarity judgments.
  However, that polymorphism would be handled internally -- it wouldn't be the
  responsibility of the user to determine whether a codec supported a 
  particular
  scoring model.
 
 Is that yes (a user can do MatchOnlySim at search time if the field
 were indexed with B25Sim)?

In essence, yes.  Technically, no.  

Under the covers, doc-id-only postings iteration probably wouldn't be
implemented by spawning a doc-id-only Similarity object.  It would probably be
something more like, ask the Similarity for a PostingDecoder with no extra
attributes.  And then docID-freq-boost postings iteration might be achieved by
asking the Similarity for a PostingDecoder with TermFreq and DocBoost
attributes. 

 How will Lucy know which switchups (Sim at indexing vs Sim at
 searching) are OK...

I think the theme is that each Similarity class will have a whitelist of
supported posting iteration configurations.  So long as the requested config
is in the whitelist, you get an iterator back -- otherwise, you get NULL.

Exactly what form the request specification would take, that's up in the air.
But it would be an implementation detail for now.  So long as the file format
supports the data, we can build an iterator that reads it, regardless of
encoding.

  Yeah so, I don't like that in Lucene you call Field.setOmitTFAP
  instead of saying Field.matchOnly (or something).  So I do agree
  that it'd be better if the API made it clear what the *search* time
  impact is of using this advanced Field API.
 
  In my opinion, it makes sense to communicate match only by way of the
  Similarity object as opposed to a boolean.  I think it's a good way to
  introduce the Similarity class and get people comfortable with it, and I 
  also
  think that it's good to keep stuff out of the FieldType API when we can.
 
 But say we want to also allow storing tf but not positions, because
 really the two choices should not be coupled (as they are today with
 Lucene's omitTFAP).
 
 So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
 you omitTF).
 
 What Sim do you call that at indexing time?

Well, those are pretty esoteric posting formats.  It's common to not need
scores and therefore not need boost bytes (the Lucene omitNorms case).  It's
also common to not need any matching info beyond doc id (the Lucene omitTFAP
case).  But omitTF and omitP aren't common needs, or Lucene would have them by
now, right?

And since they are infrequently used, Huffman-driven naming philosophy
suggests that they should have long, low-value names: OmitPositionsSimilarity,
OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be
an accurate abbreviation in this scenario as opposed to the current Lucene
omitTFAP).

In other words, I don't much care what those are named because they aren't
likely to be used except by people who A) have very, very specific use cases
and B) really know what they're doing.

In contrast, I think it's important that we come up with good names for the
doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases.

  We get users who are baffled that their phrase queries no longer work
  after setting omitTFAP.
 
  This is still a weakness of MatchSimilarity.
 
 Well MatchSimilarity arguably should mean match all queries
 correctly, just don't score them.  Ie, positional queries should in
 fact work... just not receive a score.

Right.  However, now that I've thought about it, if a user indicates that a
field is match-only by supplying a MatchSimilarity, we know that we can
omit boost bytes.  

So we can re-conceive MatchSimilarity as being analogous to omitNorms.
Huzzah!

One down, one to go.  :)

  On the other hand, typical candidates for MatchSimilarity...
 
   * unique_id
   * category
   * tags
 
  ... either won't contain multiple tokens, or won't generally return sensible
  results for phrase queries.
 
 Maybe we need to splinter MatchSim into the two cases.  Whether
 positions are stored, and whether scoring is done, is really
 orthogonal.

Maybe MinimalSimilarity as the analogue for Lucene omitTFAP?  I dunno,
that might be kind of generic, but maybe it makes sense in context.

The idea is to get the user to describe how the field will be scored.  Based on
that info, we can customize the posting format, possibly making optimizations
and omitting certain posting data.  

When people ask on the user list...

How can I make my index smaller?
   
... we can reply like so:

Make some fields match-only by specifying MatchSimilarity in the
FieldType, or even better if you don't need phrase queries, by specifying
MinimalSimilarity.  You'll be throwing away data Lucy needs for
sophisticated queries, but your index will get smaller.

I think that

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

[
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844826#action_12844826
]

Jason Rutherglen commented on LUCENE-2312:
--

I set out implementing a simple method DocumentsWriter.getTerms
which should return a sorted array of terms over the current RAM
buffer. While I think this can be implemented, there's a lot of
code in the index package to handle multiple threads, which is
fine, except I'm concerned the interleaving of postings won't
perform well. So I think we'd want to implement what's been
discussed in LUCENE-2293, per thread ram buffers. With that
change, it seems implementing this issue could be
straightforward.

Search on IndexWriter's RAM Buffer
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency