[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844420#action_12844420
 ] 

Simon Willnauer commented on LUCENE-2309:
-

The IndexWriter or rather DocInverterPerField are simply an attribute consumer. 
None of them needs to know about Analyzer or TokenStream at all. Neither needs 
the analyzer to iterate over tokens. The IndexWriter should instead implement 
an interface or use a class that is called for each successful 
incrementToken() no matter how this increment is implemented.

I could imagine a really simple interface like
{code}

interface AttributeConsumer {
  
  void setAttributeSource(AttributeSource src);

  void next();

  void end();

}
{code}

IW would then pass itself or an istance it uses (DocInverterPerField) to an API 
expecting such a consumer like:

{code}
field.consume(this);
{code}

or something similar. That way we have not dependency on whatever Attribute 
producer is used. The default implementation is for sure based on an analyzer / 
tokenstream and alternatives can be exposed via expert API. Even Backwards 
compatibility could be solved that way easily.

bq. Only tests would rely on the analyzers module. I think that's OK? core 
itself would have no dependence.
+1 test dependencies should not block modularization, its just about 
configuring the classpath though!



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844450#action_12844450
 ] 

Michael McCandless commented on LUCENE-2309:


bq. The IndexWriter or rather DocInverterPerField are simply an attribute 
consumer. None of them needs to know about Analyzer or TokenStream at all. 
Neither needs the analyzer to iterate over tokens.

[Carrying over discussions on IRC with Chris Male  Uwe...]

Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...

However, the .reset() method is redundant from indexer's standpoint --
ie when indexer calls Field.getTokenStream (say) whatever init'ing /
reset'ing should already have be done by that method by the time it
returns the TokenStream.

Also, .close and .end are redundant -- seems like we should only have
.end (few token streams do anything in .close...).  But coalescing
those two would be a good chunk of work at this point :) Or maybe we
make a .finish that simply both by default ;)

Finally, indexer doesn't really need a Document; it just needs
something abstract that's provides an iterator over all fields that
need indexing (and separately, storing).


 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844455#action_12844455
 ] 

Michael McCandless commented on LUCENE-2294:


Thanks Shai, I'll look...

bq. Note, check.py still alerts on some changes, though I don't see any 
relevant change in the patch file. Should I ignore them?

Yes if they are indeed false positives...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
 LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844461#action_12844461
 ] 

Michael McCandless commented on LUCENE-2294:


bq. Note, check.py still alerts on some changes, though I don't see any 
relevant change in the patch file. Should I ignore them?

Hmm some of these (at least TestAtomicUpdate was changed from Simple - 
Whitespace) were in fact real changes I'll fix  post a new patch.

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
 LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844464#action_12844464
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. [Carrying over discussions on IRC with Chris Male  Uwe...]

That make it very hard to participate. I can not afford to read through all IRC 
stuff and I don't get the chance to participate directly unless I watch IRC 
constantly. We should really move back to JIRA / devlist for such discussions. 
There is too much going on in IRC.

{quote}
Actually, TokenStream is already AttrSource + incrementing, so it
seems like the right start...
{quote}

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I 
want to implement something aside the TokenStream delegator API?



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2294:
---

Attachment: LUCENE-2294.patch

Attached new patch, just fixing a couple tests where analyzer had changed.

I it's ready to commit (take 2)!  I'll wait a day or two...

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
 LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844467#action_12844467
 ] 

Robert Muir commented on LUCENE-2309:
-

Hello, i commented yesterday but did not receive much feedback, so
I want to elaborate some more:

I suppose what I was trying to mention in my earlier comment here:
https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189

is that while I really like the new TokenStream API, i would prefer
it if we thought about making this flexible enough to support
different paradigms, including perhaps something that looks a lot
like the old TokenStream API. 

The reason is, I notice a lot of existing code still under this old API,
and I think that in some cases, perhaps its easier to work with, even
if you aren't a new user. I definitely think for newer users the old API
might have some advantages.

I think its useful to consider supporting such an API, perhaps as an extension
in contrib/analyzers, even if its not as fast or flexible as the new API,
perhaps the tradeoff of speed and flexibility would be worth the ease
for newer users.


 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844469#action_12844469
 ] 

Chris Male commented on LUCENE-2310:


The challenge presented in this work is the pervasiveness of the Fieldable 
class.  Its used in several hundred places through the source, but the majority 
are in tests, and in Document itself.  Therefore part of this work will be also 
to move as many of the tests over to using Field, and working on the Document 
API as well.

 Reduce Fieldable, AbstractField and Field complexity
 

 Key: LUCENE-2310
 URL: https://issues.apache.org/jira/browse/LUCENE-2310
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Index
Reporter: Chris Male

 In order to move field type like functionality into its own class, we really 
 need to try to tackle the hierarchy of Fieldable, AbstractField and Field.  
 Currently AbstractField depends on Field, and does not provide much more 
 functionality that storing fields, most of which are being moved over to 
 FieldType.  Therefore it seems ideal to try to deprecate AbstractField (and 
 possible Fieldable), moving much of the functionality into Field and 
 FieldType.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489
 ] 

Uwe Schindler commented on LUCENE-2309:
---

bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. 
The code just has an add/accept-method that accepts tokens. The idea is to, as 
Simon proposed, let the docinverter implement something like AttributeAcceptor. 
But still we must have the attribute api and the acceptor (DocInverter) must 
always see the same attribute instances (else much time would be spent to each 
time call getAttribute(...) for each token, if the accept method would take an 
AttributeSource.

The current TokenStream api could get a method taking AttributeAcceptor and 
simply do a while incrementToken() loop, calling accept() on DocInverter (the 
AttributeAcceptor). Another approach for users would be to not use the 
TokenStream API at all and simply call the accept() method for each token.

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489
 ] 

Uwe Schindler edited comment on LUCENE-2309 at 3/12/10 1:25 PM:


bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. 
The code just has an add/accept-method that accepts tokens. The idea is to, as 
Simon proposed, let the docinverter implement something like AttributeAcceptor. 
But still we must have the attribute api and the acceptor (DocInverter) must 
always see the same attribute instances (else much time would be spent to 
each time call getAttribute(...) for each token, if the accept method would 
take an AttributeSource).

The current TokenStream api could get a method taking AttributeAcceptor and 
simply do a while incrementToken() loop, calling accept() on DocInverter (the 
AttributeAcceptor). Another approach for users would be to not use the 
TokenStream API at all and simply call the accept() method for each token on 
the Acceptor.

But both approaches still have te problem with the shared attributes. If you 
want to record tokens you have to implement something like my Proxy 
attributes. Else (as mentioned) above, most time would be spent in 
getAttribute() calls.

  was (Author: thetaphi):
bq. I could imagine a really simple interface like

During lunch an idea evolved:

If you look at current DocInverter code, it does not use a consumer-like API. 
The code just has an add/accept-method that accepts tokens. The idea is to, as 
Simon proposed, let the docinverter implement something like AttributeAcceptor. 
But still we must have the attribute api and the acceptor (DocInverter) must 
always see the same attribute instances (else much time would be spent to each 
time call getAttribute(...) for each token, if the accept method would take an 
AttributeSource.

The current TokenStream api could get a method taking AttributeAcceptor and 
simply do a while incrementToken() loop, calling accept() on DocInverter (the 
AttributeAcceptor). Another approach for users would be to not use the 
TokenStream API at all and simply call the accept() method for each token.
  
 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844498#action_12844498
 ] 

Michael McCandless commented on LUCENE-2309:


bq. The idea is to, as Simon proposed, let the docinverter implement something 
like AttributeAcceptor.

This is interesting!  It inverts the stack/control flow, but, would continue to 
use shared attrs.

So then somehow the indexer would pass its AttrAcceptor to the field?  And the 
field would have whatever control logic it wants to feed the tokens...

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844500#action_12844500
 ] 

Michael McCandless commented on LUCENE-2309:


{quote}
bq. Actually, TokenStream is already AttrSource + incrementing, so it seems 
like the right start...

But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I 
want to implement something aside the TokenStream delegator API?
{quote}

True, but we need at least some way to increment?  AttrSource doesn't have that.

But I don't think we need reset nor close from TokenStream.

Maybe we could factor out an abstract class / interface that TokenStream impls, 
minus the reset  close methods?

Then people could freely use Lucene to index off a foreign analysis chain...

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Welcome Chris Male as Contrib committer!

2010-03-12 Thread Mark Miller

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!

Chris has been making a lot of headway in cleaning up the spacial contrib 
lately,
and hopefully now we can get more of those improvements into svn!

Congrats Chris, and welcome!


--
- Mark

http://www.lucidimagination.com





[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844509#action_12844509
 ] 

Shai Erera commented on LUCENE-2309:


bq. We should really move back to JIRA / devlist for such discussions

+1 !! I also find it very hard to track so many sources of discussions (JIRA, 
java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and 
searchable (I think?) which makes it impossible to trace back a discussion, 
and/or randomly stumble upon it in Google.

I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.

And having the Field control the flow of indexing seems also dangerous ... 
might expose Lucene to lots of bugs by users. Today when IW controls it, it's 
one place to look for, but tomorrow when Field will control it, where do we 
look? In the app's custom Field code? In IW's atts consuming methods?

Will the Field also control how stored fields are added? Or only 
AttributeSourced ones?

Maybe I need to get used to this change, but currently it looks wrong to 
reverse the control flow. Maybe in principle the DocInverter now accepts tokens 
from IW, and so it looks as if we can pass it to the Field (as IW's 
AttAcceptor), but still the concept is different. We (IW) control the indexing 
flow, and not the user.

I also may not understand what will that give to users. Shouldn't users get 
enough flexibility w/ the current API and the Flex (once out) stuff? Do they 
really need to be bothered w/ pushing tokens to IW?

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there

2010-03-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844511#action_12844511
 ] 

Shai Erera commented on LUCENE-2294:


Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it 
found some more (probably some I missed in the 1st pass), only this time really 
few more. So I fixed them as well. But I didn't run it 3rd time :) ...

I can't wait for this to be in ... an exhausting issue ;).

 Create IndexWriterConfiguration and store all of IW configuration there
 ---

 Key: LUCENE-2294
 URL: https://issues.apache.org/jira/browse/LUCENE-2294
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, 
 LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch


 I would like to factor out of all IW configuration parameters into a single 
 configuration class, which I propose to name IndexWriterConfiguration (or 
 IndexWriterConfig). I want to store there almost everything besides the 
 Directory, and to reduce all the ctors down to one: IndexWriter(Directory, 
 IndexWriterConfiguration). What I was thinking of storing there are the 
 following parameters:
 * All of ctors parameters, except for Directory.
 * The different setters where it makes sense. For example I still think 
 infoStream should be set on IW directly.
 I'm thinking that IWC should expose everything in a setter/getter methods, 
 and defaults to whatever IW defaults today. Except for Analyzer which will 
 need to be defined in the ctor of IWC and won't have a setter.
 I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares 
 a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 
 1 should be the default? Why not default to UNLIMITED and otherwise let 
 the application decide what LIMITED means for it? I would like to make MFL 
 optional on IWC and default to something, and I hope that default will be 
 UNLIMITED. We can document that on IWC, so that if anyone chooses to move to 
 the new API, he should be aware of that ...
 I plan to deprecate all the ctors and getters/setters and replace them by:
 * One ctor as described above
 * getIndexWriterConfiguration, or simply getConfig, which can then be queried 
 for the setting of interest.
 * About the setters, I think maybe we can just introduce a setConfig method 
 which will override everything that is overridable today, except for 
 Analyzer. So someone could do iw.getConfig().setSomething(); 
 iw.setConfig(newConfig);
 ** The setters on IWC can return an IWC to allow chaining set calls ... so 
 the above will turn into 
 iw.setConfig(iw.getConfig().setSomething1().setSomething2()); 
 BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it 
 will greatly simplify IW's API.
 I'll start to work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844515#action_12844515
 ] 

Uwe Schindler commented on LUCENE-2309:
---

bq. I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.

The idea was not to change this behaviour, but also give the user the 
posibility to reverse that. For some tokenstreams it would simplify things 
much. The current IndexWriter code works exactly like that:
# DocInverter gets TokenStream
# DocInverter calls reset() -- to be left out and moved to field/analyzer
# DocInverter does while-loop on incrementToken - it iterates. On each Token it 
calls add() on the field consumer
# DocInverter calls end() and updates end offset
# DocInverter calls close() -- to be left out and moved to field/analyzer

The change is simply that step (3) is removed from DocInverter which only 
provides the add() method for accepting Tokens. The current while loop simply 
is done in the current TokenStream/Field code, so nobody needs to change his 
code. But somebody that actively wants to push tokens can now do this. If he 
wants to do this currently he has no chance without heavy buffering.

So the push API will be very expert and the current TokenStreams is just a user 
of this API.

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Robert Muir
Congratulations!

On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote:
 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com






-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516
 ] 

Mark Miller commented on LUCENE-2309:
-

bq.  Also IRC is not logged/archived and searchable (I think?) which makes it 
impossible to trace back a discussion, and/or randomly stumble upon it in 
Google.

Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a 
great way for people to communicate and hash stuff out, but its not necessary 
you follow it. If you have questions or want further elaboration, just ask. No 
one can expect you to follow IRC, nor is it a valid reference for where 
something was decided. IRC is great - I think its really benefited having devs 
discuss there - but the official position is, if it didn't happen on the list, 
it didnt actually happen.

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter

2010-03-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2015.
-

Resolution: Fixed

Committed revision 922277.

Thanks Cédrik!

 ASCIIFoldingFilter: expose folding logic + small improvements to 
 ISOLatin1AccentFilter
 --

 Key: LUCENE-2015
 URL: https://issues.apache.org/jira/browse/LUCENE-2015
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Analysis
Reporter: Cédrik LIME
Assignee: Robert Muir
Priority: Minor
 Fix For: 3.1

 Attachments: ASCIIFoldingFilter-no_formatting.patch, 
 ASCIIFoldingFilter-no_formatting.patch, Filters.patch, 
 ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch


 This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: 
 left  right single quotation marks, en dash, em dash) which we very 
 frequently encounter in our projects. I know that this class is now 
 deprecated; this improvement is for legacy code that hasn't migrated yet.
 It also enables easy access to the ascii folding technique use in 
 ASCIIFoldingFilter for potential re-use in non-Lucene-related code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler
Congrats Mark. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!





-- 
- Mark
 
http://www.lucidimagination.com
 
 


RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler
 

Congrats Chris. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!





-- 
- Mark
 
http://www.lucidimagination.com
 
 


RE: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Uwe Schindler
I wish you heavy committing, too. But I meant Chris, sorry J

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de/ http://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Friday, March 12, 2010 3:36 PM
To: java-dev@lucene.apache.org
Subject: RE: Welcome Chris Male as Contrib committer!

 

Congrats Mark. I wish you heavy committing!

 

-

Uwe Schindler

H.-H.-Meier-Allee 63, D-28213 Bremen

http://www.thetaphi.de http://www.thetaphi.de/ 

eMail: u...@thetaphi.de

 

From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, March 12, 2010 3:17 PM
To: java-dev@lucene.apache.org
Subject: Welcome Chris Male as Contrib committer!

 

I am happy to announce the Lucene PMC has accepted Chris Male as a
contrib committer!
 
Chris has been making a lot of headway in cleaning up the spacial contrib 
lately, 
and hopefully now we can get more of those improvements into svn!
 
Congrats Chris, and welcome!

 

-- 
- Mark
 
http://www.lucidimagination.com
 
 


Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Chris Male
Hi,

Thanks Mark!

All is forgiven Uwe :)

Cheers
Chris

On Fri, Mar 12, 2010 at 3:38 PM, Uwe Schindler u...@thetaphi.de wrote:

  I wish you heavy committing, too. But I meant Chris, sorry J



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Uwe Schindler [mailto:u...@thetaphi.de]
 *Sent:* Friday, March 12, 2010 3:36 PM

 *To:* java-dev@lucene.apache.org
 *Subject:* RE: Welcome Chris Male as Contrib committer!



 Congrats Mark. I wish you heavy committing!



 -

 Uwe Schindler

 H.-H.-Meier-Allee 63, D-28213 Bremen

 http://www.thetaphi.de

 eMail: u...@thetaphi.de



 *From:* Mark Miller [mailto:markrmil...@gmail.com]
 *Sent:* Friday, March 12, 2010 3:17 PM
 *To:* java-dev@lucene.apache.org
 *Subject:* Welcome Chris Male as Contrib committer!



 I am happy to announce the Lucene PMC has accepted Chris Male as a

 contrib committer!



 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately,

 and hopefully now we can get more of those improvements into svn!



 Congrats Chris, and welcome!



 --

 - Mark



 http://www.lucidimagination.com








-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844523#action_12844523
 ] 

Simon Willnauer commented on LUCENE-2309:
-

bq. Then people could freely use Lucene to index off a foreign analysis chain...
That is what I was talking about!

{quote}
I'd like to donate my two cents here - we've just recently changed the 
TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only 
now the API has changed slightly. The proposals here, w/ the 
AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field 
will call back to IW seems too much complicated to me. Users that write 
Analyzers/TokenStreams/AttributeSources, should not care how they are 
indexed/stored etc. Forcing them to implement this push logic to IW seems to me 
like a real unnecessary overhead and complexity.
{quote}

We can surely hide this implementation completely from field. I consider this 
being similar to Collector where you pass it explicitly to the search method if 
you want to have a different behavior. Maybe something like a 
AttributeProducer. I don't think adding this to field makes a lot of sense at 
all and it is not worth the complexity.

bq. Will the Field also control how stored fields are added? Or only 
AttributeSourced ones?
IMO this is only about inverted fields.

bq. We (IW) control the indexing flow, and not the user.
The user only gets the possibility to exchange the analysis chain but not the 
control flow. The user already can mess around with stuff in incrementToken(), 
the only thing we change / invert is that the indexer does not know about 
TokenStreams anymore. it does not change the controlflow though.



 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Grant Ingersoll
Congrats!  

Tradition has it, Chris, that you provide a brief intro on yourself upon 
becoming a new committer, so let's hear it!

-Grant

On Mar 12, 2010, at 9:17 AM, Mark Miller wrote:

  I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!
 
 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately, 
 and hopefully now we can get more of those improvements into svn!
 
 Congrats Chris, and welcome!
 
 -- 
 - Mark
 
 http://www.lucidimagination.com
 
 



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Simon Willnauer
Congrats Chris :)

On Fri, Mar 12, 2010 at 3:51 PM, Grant Ingersoll gsing...@apache.org wrote:
 Congrats!
 Tradition has it, Chris, that you provide a brief intro on yourself upon
 becoming a new committer, so let's hear it!
 -Grant
 On Mar 12, 2010, at 9:17 AM, Mark Miller wrote:

 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844528#action_12844528
 ] 

Uwe Schindler commented on LUCENE-2309:
---

There is one problem that cannot be easy solved (for all proposals here), if we 
want to provide an old-style API that does not require reuse of tokens:
The problem with AttributeProvider is that if we want to support something 
(like rmuir proposed before) that looks like the old Token next(), we need an 
AttributeProvider that passes the AttributeSource to the indexer on each Token! 
And that would lead to lots of getAttribute() calls, that would slowdown 
indexing! So with the current APIs we cannot get around the requirement to 
reuse the same Attribute instances during the whole indexing without a major 
speed impact. This can only be solved with my nice BCEL proxy Attributes, so 
you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 
(yes, we can reactivate that API somehow as an easy use-addendum).

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844533#action_12844533
 ] 

Robert Muir commented on LUCENE-2309:
-

{quote}
So with the current APIs we cannot get around the requirement to reuse the same 
Attribute instances during the whole indexing without a major speed impact.
{quote}

I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily 
need something that looks like the old TokenStream API, but I feel it would
be worth our time to think about supporting 'some alternative API' that makes
it easier to work with lots of context across different Tokens.

I personally do not mind how this is done with the capture/restore state API,
but I feel that its pretty unnatural for many developers, and in the future 
folks
might want to do more complex analysis (maybe even light pos-tagging, etc)
that requires said context, and we should plan for this.

I feel this wasn't such an issue with the old TokenStream API, but maybe there
is another way to address this potential problem.

 Fully decouple IndexWriter from analyzers
 -

 Key: LUCENE-2309
 URL: https://issues.apache.org/jira/browse/LUCENE-2309
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 IndexWriter only needs an AttributeSource to do indexing.
 Yet, today, it interacts with Field instances, holds a private
 analyzers, invokes analyzer.reusableTokenStream, has to deal with a
 wide variety (it's not analyzed; it is analyzed but it's a Reader,
 String; it's pre-analyzed).
 I'd like to have IW only interact with attr sources that already
 arrived with the fields.  This would be a powerful decoupling -- it
 means others are free to make their own attr sources.
 They need not even use any of Lucene's analysis impls; eg they can
 integrate to other things like [OpenPipeline|http://www.openpipeline.org].
 Or make something completely custom.
 LUCENE-2302 is already a big step towards this: it makes IW agnostic
 about which attr is the term, and only requires that it provide a
 BytesRef (for flex).
 Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the
 FieldType knows the analyzer to use, then we could simply create a
 getAttrSource() method (say) on it and move all the logic IW has today
 onto there.  (We'd still need existing IW code for back-compat).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Grant Ingersoll

On Mar 12, 2010, at 10:00 AM, Chris Male wrote:

 Although I live in Amsterdam, I am actually from New Zealand so it feels good 
 to finally have kiwi representation.

+1.  I've always wanted to go there!  I'll have to pick your brain on it next 
time I'm in Amsterdam over a pint.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Michael McCandless
Welcome aboard Chris!

Mike

On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote:
 I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!

 --
 - Mark

 http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Welcome Chris Male as Contrib committer!

2010-03-12 Thread Shalin Shekhar Mangar
Welcome Chris!

On Fri, Mar 12, 2010 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote:

  I am happy to announce the Lucene PMC has accepted Chris Male as a
 contrib committer!

 Chris has been making a lot of headway in cleaning up the spacial contrib 
 lately,
 and hopefully now we can get more of those improvements into svn!

 Congrats Chris, and welcome!


 --
 - Mark
 http://www.lucidimagination.com




-- 
Regards,
Shalin Shekhar Mangar.


[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844578#action_12844578
 ] 

Robert Muir commented on LUCENE-2308:
-

{quote}
details like omitTfAP, omitNorms
{quote}

personal pet peeve, i wonder if we could consider improving on 'omit' here,
I think things like omit(false), disable(false) are a little awkward.


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844579#action_12844579
 ] 

Chris Male commented on LUCENE-2308:


So you are thinking more along the lines indexNorms(true|false)?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844585#action_12844585
 ] 

Robert Muir commented on LUCENE-2308:
-

bq. So you are thinking more along the lines indexNorms(true|false)? 

or whatever you come up with, that doesn't create double-negatives!
but yeah, i think something like that is a little easier... no big deal 
just figured I would bring it up if this stuff was getting refactored anyway

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587
 ] 

Chris Male commented on LUCENE-2308:


I agree entirely.  This is definitely the moment to remove any ambiguity or 
confusion in this API.  I'll make sure to incorporate this idea.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Erick Erickson
Congrats Chris!

I vote for thinkAboutNotIncludingNormsMaybe(true|false) G.

Seriously double negatives are ugly IMO, +1 for changing

Erick

On Fri, Mar 12, 2010 at 12:56 PM, Chris Male (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587]

 Chris Male commented on LUCENE-2308:
 

 I agree entirely.  This is definitely the moment to remove any ambiguity or
 confusion in this API.  I'll make sure to incorporate this idea.

  Separately specify a field's type
  -
 
  Key: LUCENE-2308
  URL: https://issues.apache.org/jira/browse/LUCENE-2308
  Project: Lucene - Java
   Issue Type: Improvement
   Components: Index
 Reporter: Michael McCandless
 
  This came up from dicussions on IRC.  I'm summarizing here...
  Today when you make a Field to add to a document you can set things
  index or not, stored or not, analyzed or not, details like omitTfAP,
  omitNorms, index term vectors (separately controlling
  offsets/positions), etc.
  I think we should factor these out into a new class (FieldType?).
  Then you could re-use this FieldType instance across multiple fields.
  The Field instance would still hold the actual value.
  We could then do per-field analyzers by adding a setAnalyzer on the
  FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
  for per-field codecs (with flex), where we now have
  PerFieldCodecWrapper).
  This would NOT be a schema!  It's just refactoring what we already
  specify today.  EG it's not serialized into the index.
  This has been discussed before, and I know Michael Busch opened a more
  ambitious (I think?) issue.  I think this is a good first baby step.  We
 could
  consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
  off on that for starters...

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844626#action_12844626
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

I think we might consider matchOnly() instead of omitNorms().  If a field is
match only, we don't need boost bytes a.k.a. norms because they are only
used as a scoring multiplier.

Haven't got a good synonym for omitTFAP, but I'd sure like one.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844629#action_12844629
 ] 

Shai Erera commented on LUCENE-2308:


How about enable(TYPE/FEATURE) and corresponding disable? So Type/Feature will 
have NORMS, TF, POSITIONS and calls would look like:
f.enable(Type.NORMS), f.disable(Type.TF)?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844630#action_12844630
 ] 

Robert Muir commented on LUCENE-2308:
-

Just also to mention (probably too much for this one issue)!

I think it would be nice of OmitTF was separately selectable 
from OmitPositions, as Shai implied. We would have to
actually implement this though I think!


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions.  The freq 
tells you how many positions there are.

I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers 
have difficulty mastering all the subtleties.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Mark Miller
Committers are competant in different areas of the code.  Even mike  
wasn't big into the search side until per segment.  Commiters are  
trusted to mess with the pieces they know.


I don't see anyone even remotely suggesting that users should have to  
understand all of the implications of posting format modifications.


Just sounds like a nasty jab to me.

- Mark

http://www.lucidimagination.com

On Mar 12, 2010, at 2:43 PM, Marvin Humphrey (JIRA)  
j...@apache.org wrote:




   [ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637 
 ]


Marvin Humphrey commented on LUCENE-2308:
-

If you disable term freq, you also have to disable positions.  The  
freq

tells you how many positions there are.

I think it's asking an awful lot of our users to require that they  
understand

all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.


Separately specify a field's type
-

   Key: LUCENE-2308
   URL: https://issues.apache.org/jira/browse/LUCENE-2308
   Project: Lucene - Java
Issue Type: Improvement
Components: Index
  Reporter: Michael McCandless

This came up from dicussions on IRC.  I'm summarizing here...
Today when you make a Field to add to a document you can set things
index or not, stored or not, analyzed or not, details like omitTfAP,
omitNorms, index term vectors (separately controlling
offsets/positions), etc.
I think we should factor these out into a new class (FieldType?).
Then you could re-use this FieldType instance across multiple fields.
The Field instance would still hold the actual value.
We could then do per-field analyzers by adding a setAnalyzer on the
FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
for per-field codecs (with flex), where we now have
PerFieldCodecWrapper).
This would NOT be a schema!  It's just refactoring what we already
specify today.  EG it's not serialized into the index.
This has been discussed before, and I know Michael Busch opened a  
more
ambitious (I think?) issue.  I think this is a good first baby  
step.  We could
consider a hierarchy of FIeldType (NumericFieldType, etc.) but  
maybe hold

off on that for starters...


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844653#action_12844653
 ] 

Robert Muir commented on LUCENE-2308:
-

{quote}
If you disable term freq, you also have to disable positions. The freq
tells you how many positions there are. 
{quote}

Marvin: as stated, we would have to actually implement this.
There's an issue open for it too: LUCENE-2048.
I was just discussing this with someone the other day.

{quote}
I think it's asking an awful lot of our users to require that they understand
all the implications of posting format modifications when committers
have difficulty mastering all the subtleties.
{quote}

I don't know what I did to piss you off, but I just thought it would be nice
for completeness, to mention that this feature is still open and its
something we should think about.


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844659#action_12844659
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

I'm simply suggesting that the proposed API is too hard to understand.  

Most users know whether their fields can be match-only but have no idea what
TFAP is.  And even advanced users will have difficulty understanding all the
implications for matching and scoring when they selectively disable portions
of the posting format.

I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags).
Something that ordinary users can grok would be used more often and more
effectively.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844661#action_12844661
 ] 

Chris Male commented on LUCENE-2308:


What I covered with Mike earlier was whether FieldType methods would be 
immutable or not.  

If they are, which seems a good idea, then everything will be enabled/disabled 
in the construction of the FieldType so we would only need to support property 
getter methods.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey
On Fri, Mar 12, 2010 at 03:01:27PM -0500, Mark Miller wrote:
 Committers are competant in different areas of the code.  Even mike  
 wasn't big into the search side until per segment.  Commiters are  
 trusted to mess with the pieces they know.

Absolutely.  I wouldn't expect every committer to undertand the gory details
of posting formats, and I've been a little caught off guard by the blowback
from what I thought was an inoccuous observation.

But by the same token, I wouldn't expect our users to have sufficient
expertise to understand all the variants of omit*() either.  This stuff
oughtta be implementation details.

 I don't see anyone even remotely suggesting that users should have to  
 understand all of the implications of posting format modifications.

That's what omitTFAP() and omitNorms() do, though.  And as Mike pointed out in
the baby steps thread, omitTFAP() is often misunderstood.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844684#action_12844684
 ] 

Michael McCandless commented on LUCENE-2308:


Hmm one challenge with making FieldType immutable is we don't want
a zillion ctors over time.  Also creating a FieldType with args like
new FieldType(true, false, false) isn't really readable.

It would be nice if we could do something similar to IndexWriterConfig
(LUCENE-2294), where you use incremental ctor/setters to set up the
configuration but then once it's used (bound to a Field), it's
immutable.

I'm torn on naming: yes, search-oriented names like matchOnly is
tempting, but then we really should tease apart termFreq and positions
(they are stuck together now with omitTFAP).  And the two are not
fully independent as Marvin noted -- so maybe we use a cryptic enum
(DOCS, DOCS_TERM_FREQ, DOCS_TERM_FREQ_POSITIONS)?  If we can only find
better names...

I'm not sure we can/should find better index-time names.  What is
stored in the index is relatively independent from how/whether
searches make use of it.  EG if you store termFreq (but not positions)
you can still do match only searching, or, you can do full scoring of
the query.  You can't use positional queries.


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844688#action_12844688
 ] 

Marvin Humphrey commented on LUCENE-2308:
-

 Also creating a FieldType with args like
 new FieldType(true, false, false) isn't really readable. 

Agreed Another option would be a flags integer and bitwise constants:

{code}
FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED);
{code}

 It would be nice if we could do something similar to IndexWriterConfig
 (LUCENE-2294), where you use incremental ctor/setters to set up the
 configuration but then once it's used (bound to a Field), it's
 immutable.

I bet that'll be more popular than flags, but I thought it was worth
bringing it up anyway. :)

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844690#action_12844690
 ] 

Earwin Burrfoot commented on LUCENE-2308:
-

I'm strongly against names like 'matchOnly'. They are perfectly fine in some 
'schema' layer over Lucene, but here, in lowlevel guts, I'd prefer names that 
clearly state what the hell do they do, without forcing me to consult 
javadocs/code.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844700#action_12844700
 ] 

Yonik Seeley commented on LUCENE-2308:
--

For the non-expert user, it's just a label and won't have much meaning 
regardless of what it's called, and they will need to consult the docs.  Of 
course, if one starts to dig deeper, norms actually does have a physical 
meaning in the index, so preferring a label with norms in it seems completely 
reasonable.

There's also history to consider - when you change the name of something, you 
cut the link to the past in search engines, and in the memories of many 
developers.

As it relates to Solr - I don't care so much since it makes sense for the Solr 
schema to isolate these changes and stick with omitNorms regardless.


 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844702#action_12844702
 ] 

Chris Male commented on LUCENE-2308:


{quote}
It would be nice if we could do something similar to IndexWriterConfig
(LUCENE-2294), where you use incremental ctor/setters to set up the
configuration but then once it's used (bound to a Field), it's
immutable.
{quote}

Yeah we could use something like a FieldTypeBuilder which could provide a fluid 
interface for specifying each property, which then get built into an immutable 
FieldType at the end.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844707#action_12844707
 ] 

Yonik Seeley commented on LUCENE-2308:
--

I'm not sure if strict immutability is necessary - there's everything in 
between too.
One can simply say that all changes should be made before first use, and after 
that point it's undefined.

Unrelated question: I assume that this would retain the same flexibility as we 
have today... the ability to change FieldType for field foo from one document 
to the next?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710
 ] 

Chris Male commented on LUCENE-2308:


{quote}
I'm not sure if strict immutability is necessary - there's everything in 
between too.
One can simply say that all changes should be made before first use, and after 
that point it's undefined.
{quote}

I'm really unsure about this if people are going to be using a FieldType 
instance with multiple Fields.  Perhaps this really is just an edge case.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we 
have today... the ability to change FieldType for field foo from one document 
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents 
while defining separate FieldTypes? Or is creating new Field instances okay?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710
 ] 

Chris Male edited comment on LUCENE-2308 at 3/12/10 10:01 PM:
--

{quote}
I'm not sure if strict immutability is necessary - there's everything in 
between too.
One can simply say that all changes should be made before first use, and after 
that point it's undefined.
{quote}

I'm really unsure about this if people are going to be using a FieldType 
instance with multiple Fields.  Perhaps this really is just an edge case though.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we 
have today... the ability to change FieldType for field foo from one document 
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents 
while defining separate FieldTypes? Or is creating new Field instances okay?

  was (Author: cmale):
{quote}
I'm not sure if strict immutability is necessary - there's everything in 
between too.
One can simply say that all changes should be made before first use, and after 
that point it's undefined.
{quote}

I'm really unsure about this if people are going to be using a FieldType 
instance with multiple Fields.  Perhaps this really is just an edge case.

{quote}
Unrelated question: I assume that this would retain the same flexibility as we 
have today... the ability to change FieldType for field foo from one document 
to the next?
{quote}

Are you wanting to be able to reuse the same Field instance in both documents 
while defining separate FieldTypes? Or is creating new Field instances okay?
  
 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844716#action_12844716
 ] 

Yonik Seeley commented on LUCENE-2308:
--

bq. I'm really unsure about this if people are going to be using a FieldType 
instance with multiple Fields.

I will, if I can (provided the FieldType does not contain the field name).  
That shouldn't have anything to do with immutability though.

bq. Are you wanting to be able to reuse the same Field instance in both 
documents while defining separate FieldTypes? Or is creating new Field 
instances okay?

new Field instances should be fine - it's not really my use case anyway.  But 
we're designing for the 1000's of use cases that are out there and we should be 
careful about adding new constraints.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844720#action_12844720
 ] 

Chris Male commented on LUCENE-2308:


{quote}
I will, if I can (provided the FieldType does not contain the field name). That 
shouldn't have anything to do with immutability though.
{quote}

Yeah the field name will stay inside the Field.  To me the reuse issue relates 
immutability in that a change to a property in one FieldType after construction 
means the change effects all the Fields that use that type.  

But as you say, if we document that its best to set everything at instantiation 
and that whatever happens after that is undefined, then I imagine it'll be fine.

{quote}
new Field instances should be fine - it's not really my use case anyway. But 
we're designing for the 1000's of use cases that are out there and we should be 
careful about adding new constraints.
{quote}

Yeah I appreciate that this API will be used in lots of different ways.  Baby 
steps as Mike said :)  But to answer your question, yes the flexibility will 
remain.

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2308) Separately specify a field's type

2010-03-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844722#action_12844722
 ] 

Yonik Seeley commented on LUCENE-2308:
--

Of course... given that Fieldable is an interface, one could create an 
implementation that just delegated all the calls like omitNorms to a shared 
instance, except for the value part.  Add a getAnalyzer() method to Fieldable, 
and it's the same thing in the end?

 Separately specify a field's type
 -

 Key: LUCENE-2308
 URL: https://issues.apache.org/jira/browse/LUCENE-2308
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless

 This came up from dicussions on IRC.  I'm summarizing here...
 Today when you make a Field to add to a document you can set things
 index or not, stored or not, analyzed or not, details like omitTfAP,
 omitNorms, index term vectors (separately controlling
 offsets/positions), etc.
 I think we should factor these out into a new class (FieldType?).
 Then you could re-use this FieldType instance across multiple fields.
 The Field instance would still hold the actual value.
 We could then do per-field analyzers by adding a setAnalyzer on the
 FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise
 for per-field codecs (with flex), where we now have
 PerFieldCodecWrapper).
 This would NOT be a schema!  It's just refactoring what we already
 specify today.  EG it's not serialized into the index.
 This has been discussed before, and I know Michael Busch opened a more
 ambitious (I think?) issue.  I think this is a good first baby step.  We could
 consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold
 off on that for starters...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-12 Thread Jason Rutherglen (JIRA)
Search on IndexWriter's RAM Buffer
--

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
 Fix For: 3.0.2


In order to offer user's near realtime search, without incurring
an indexing performance penalty, we can implement search on
IndexWriter's RAM buffer. This is the buffer that is filled in
RAM as documents are indexed. Currently the RAM buffer is
flushed to the underlying directory (usually disk) before being
made searchable. 

Todays Lucene based NRT systems must incur the cost of merging
segments, which can slow indexing. 

Michael Busch has good suggestions regarding how to handle deletes using max 
doc ids.  
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923

The area that isn't fully fleshed out is the terms dictionary,
which needs to be sorted prior to queries executing. Currently
IW implements a specialized hash table. Michael B has a
suggestion here: 
https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844749#action_12844749
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

In regards to the terms dictionary, keeping it sorted or not, I think it's best 
to sort it on demand because otherwise there will be yet another parameter to 
pass into IW (i.e. sortRAMBufTerms or something like that).  

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
 Fix For: 3.0.2


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Different behavior of Directory.fieldLength()

2010-03-12 Thread Marcelo Ochoa
Hi:
  During some test of Lucene Domain Index
(http://docs.google.com/View?id=ddgw7sjp_54fgj9kg) with big data
sources we found an exception caused for calling
Directory.fieldLength() method on non existing file.
  FSDirectory implements this method as:
  /** Returns the length in bytes of a file in the directory. */
  public long fileLength(String name) {
ensureOpen();
File file = new File(directory, name);
return file.length();
  }

  According to JDK1.5 calling to File constructor causes a file
creation without throwing an exception:
http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#File(java.lang.String,
java.lang.String)
  But either RAMDirectory nor OJVMDirectory do this:
RAMDirectory:
  /** Returns the length in bytes of a file in the directory.
   * @throws IOException if the file does not exist
   */
  public final long fileLength(String name) throws IOException {
ensureOpen();
RAMFile file;
synchronized (this) {
  file = (RAMFile)fileMap.get(name);
}
if (file==null)
  throw new FileNotFoundException(name);
return file.getLength();
  }

  If OJVMDirectory throws an exception if a file doesn't exist it
causes that the IndexWriter fail to do the job, here the stack trace:
IW 3 [Root Thread]: DW:   RAM: now flush @ usedMB=15.001
allocMB=15.001 deletesMB=0 triggerMB=15
IW 3 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
docStoreOffset=0 flushDocs=true flushDeletes=false
flushDocStores=false numDocs=109169 numBufDelTerms=0
IW 3 [Root Thread]:   index before flush
IW 3 [Root Thread]: DW: flush postings as segment _0 numDocs=109169
*** 2010-03-11 17:27:15.696
IW 3 [Root Thread]: DW: docWriter: now abort
IW 3 [Root Thread]: hit exception flushing segment _0
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.tii
IFD [Root Thread]: delete _0.tii
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fnm
IFD [Root Thread]: delete _0.fnm
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fdx
IFD [Root Thread]: delete _0.fdx
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.fdt
IFD [Root Thread]: delete _0.fdt
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.prx
IFD [Root Thread]: delete _0.prx
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.nrm
IFD [Root Thread]: delete _0.nrm
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.frq
IFD [Root Thread]: delete _0.frq
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file _0.tis
IFD [Root Thread]: delete _0.tis
Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
ODCIIndexCreate
SEVERE: failed to create index: cannot verify file: _0.fdx. Reason:
Exhausted Resultset
Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex
ODCIIndexCreate
FINER: THROW
java.io.IOException: cannot verify file: _0.fdx. Reason: Exhausted Resultset
at 
org.apache.lucene.store.OJVMDirectory.fileLength(OJVMDirectory.java:633)
at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:271)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:593)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4311)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2497)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2451)
at org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java:374)
at 
org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomainIndex.java:568)
IW 3 [Root Thread]: now flush at close
IW 3 [Root Thread]:   flush: segment=null docStoreSegment=null
docStoreOffset=0 flushDocs=false flushDeletes=true
flushDocStores=false numDocs=0 numBufDelTerms=0
IW 3 [Root Thread]:   index before flush
IW 3 [Root Thread]: CMS: now merge
IW 3 [Root Thread]: CMS:   index:
IW 3 [Root Thread]: CMS:   no more merges pending; now return
IW 3 [Root Thread]: now call final commit()
IW 3 [Root Thread]: startCommit(): start sizeInBytes=0
IW 3 [Root Thread]: startCommit index= changeCount=1
IW 3 [Root Thread]: done all syncs
IW 3 [Root Thread]: commit: pendingCommit != null
IW 3 [Root Thread]: commit: wrote segments file segments_2
IFD [Root Thread]: now checkpoint segments_2 [0 segments ; isCommit = true]
IFD [Root Thread]: deleteCommits: now decRef commit segments_1
IFD [Root Thread]: delete segments_1
IW 3 [Root Thread]: commit: done
IW 3 [Root Thread]: at close:

   Which is the correct behavior for this method?
   We changed OJVMDirectory.fileLength() method to returns 0 if no
file exists instead of throwing an exception and IndexWriter works
properly, 

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-12 Thread Marvin Humphrey
On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote:
  So there would be polymorphism in the decoding phase while we're supplying
  information the Similarity object needs to make its similarity judgments.
  However, that polymorphism would be handled internally -- it wouldn't be the
  responsibility of the user to determine whether a codec supported a 
  particular
  scoring model.
 
 Is that yes (a user can do MatchOnlySim at search time if the field
 were indexed with B25Sim)?

In essence, yes.  Technically, no.  

Under the covers, doc-id-only postings iteration probably wouldn't be
implemented by spawning a doc-id-only Similarity object.  It would probably be
something more like, ask the Similarity for a PostingDecoder with no extra
attributes.  And then docID-freq-boost postings iteration might be achieved by
asking the Similarity for a PostingDecoder with TermFreq and DocBoost
attributes. 

 How will Lucy know which switchups (Sim at indexing vs Sim at
 searching) are OK...

I think the theme is that each Similarity class will have a whitelist of
supported posting iteration configurations.  So long as the requested config
is in the whitelist, you get an iterator back -- otherwise, you get NULL.

Exactly what form the request specification would take, that's up in the air.
But it would be an implementation detail for now.  So long as the file format
supports the data, we can build an iterator that reads it, regardless of
encoding.

  Yeah so, I don't like that in Lucene you call Field.setOmitTFAP
  instead of saying Field.matchOnly (or something).  So I do agree
  that it'd be better if the API made it clear what the *search* time
  impact is of using this advanced Field API.
 
  In my opinion, it makes sense to communicate match only by way of the
  Similarity object as opposed to a boolean.  I think it's a good way to
  introduce the Similarity class and get people comfortable with it, and I 
  also
  think that it's good to keep stuff out of the FieldType API when we can.
 
 But say we want to also allow storing tf but not positions, because
 really the two choices should not be coupled (as they are today with
 Lucene's omitTFAP).
 
 So I have omitTF and omitP (only 3 combos are allowed -- must omitP if
 you omitTF).
 
 What Sim do you call that at indexing time?

Well, those are pretty esoteric posting formats.  It's common to not need
scores and therefore not need boost bytes (the Lucene omitNorms case).  It's
also common to not need any matching info beyond doc id (the Lucene omitTFAP
case).  But omitTF and omitP aren't common needs, or Lucene would have them by
now, right?

And since they are infrequently used, Huffman-driven naming philosophy
suggests that they should have long, low-value names: OmitPositionsSimilarity,
OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be
an accurate abbreviation in this scenario as opposed to the current Lucene
omitTFAP).

In other words, I don't much care what those are named because they aren't
likely to be used except by people who A) have very, very specific use cases
and B) really know what they're doing.

In contrast, I think it's important that we come up with good names for the
doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases.

  We get users who are baffled that their phrase queries no longer work
  after setting omitTFAP.
 
  This is still a weakness of MatchSimilarity.
 
 Well MatchSimilarity arguably should mean match all queries
 correctly, just don't score them.  Ie, positional queries should in
 fact work... just not receive a score.

Right.  However, now that I've thought about it, if a user indicates that a
field is match-only by supplying a MatchSimilarity, we know that we can
omit boost bytes.  

So we can re-conceive MatchSimilarity as being analogous to omitNorms.
Huzzah!

One down, one to go.  :)

  On the other hand, typical candidates for MatchSimilarity...
 
   * unique_id
   * category
   * tags
 
  ... either won't contain multiple tokens, or won't generally return sensible
  results for phrase queries.
 
 Maybe we need to splinter MatchSim into the two cases.  Whether
 positions are stored, and whether scoring is done, is really
 orthogonal.

Maybe MinimalSimilarity as the analogue for Lucene omitTFAP?  I dunno,
that might be kind of generic, but maybe it makes sense in context.

The idea is to get the user to describe how the field will be scored.  Based on
that info, we can customize the posting format, possibly making optimizations
and omitting certain posting data.  

When people ask on the user list...

How can I make my index smaller?
   
... we can reply like so:

Make some fields match-only by specifying MatchSimilarity in the
FieldType, or even better if you don't need phrase queries, by specifying
MinimalSimilarity.  You'll be throwing away data Lucy needs for
sophisticated queries, but your index will get smaller.

I think that 

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-03-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844826#action_12844826
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

I set out implementing a simple method DocumentsWriter.getTerms
which should return a sorted array of terms over the current RAM
buffer. While I think this can be implemented, there's a lot of
code in the index package to handle multiple threads, which is
fine, except I'm concerned the interleaving of postings won't
perform well. So I think we'd want to implement what's been
discussed in LUCENE-2293, per thread ram buffers. With that
change, it seems implementing this issue could be
straightforward.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 3.0.1
Reporter: Jason Rutherglen
 Fix For: 3.0.2


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency

2010-03-12 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844828#action_12844828
 ] 

Jason Rutherglen commented on LUCENE-2293:
--

{quote}but does anyone out there wanna work out the private RAM
segments?{quote}

I didn't see this before, I figured private RAM segments was on
the roadmap for this issue, it sounds like it'll be a different
one? 

Mike, can you outline what would need to change? It seems like
large amounts of code could be removed (i.e.
FreqProxFieldMergeState)? The *PerThread classes? If so, I think
it would go over my head (because I don't have a mental mapping
of how all the classes tie together). 

 IndexWriter has hard limit on max concurrency
 -

 Key: LUCENE-2293
 URL: https://issues.apache.org/jira/browse/LUCENE-2293
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


 DocumentsWriter has this nasty hardwired constant:
 {code}
 private final static int MAX_THREAD_STATE = 5;
 {code}
 which probably I should have attached a //nocommit to the moment I
 wrote it ;)
 That constant sets the max number of thread states to 5.  This means,
 if more than 5 threads enter IndexWriter at once, they will share
 only 5 thread states, meaning we gate CPU concurrency to 5 running
 threads inside IW (each thread must first wait for the last thread to
 finish using the thread state before grabbing it).
 This is bad because modern hardware can make use of more than 5
 threads.  So I think an immediate fix is to make this settable
 (expert), and increase the default (8?).
 It's tricky, though, because the more thread states, the less RAM
 efficiency you have, meaning the worse indexing throughput.  So you
 shouldn't up and set this to 50: you'll be flushing too often.
 But... I think a better fix is to re-think how threads write state
 into DocumentsWriter.  Today, a single docID stream is assigned across
 threads (eg one thread gets docID=0, next one docID=1, etc.), and each
 thread writes to a private RAM buffer (living in the thread state),
 and then on flush we do a merge sort.  The merge sort is inefficient
 (does not currently use a PQ)... and, wasteful because we must
 re-decode every posting byte.
 I think we could change this, so that threads write to private RAM
 buffers, with a private docID stream, but then instead of merging on
 flush, we directly flush each thread as its own segment (and, allocate
 private docIDs to each thread).  We can then leave merging to CMS
 which can already run merges in the BG without blocking ongoing
 indexing (unlike the merge we do in flush, today).
 This would also allow us to separately flush thread states.  Ie, we
 need not flush all thread states at once -- we can flush one when it
 gets too big, and then let the others keep running.  This should be a
 good concurrency gain since is uses IO  CPU resources throughout
 indexing instead of big burst of CPU only then big burst of IO
 only that we have today (flush today stops the world).
 One downside I can think of is... docIDs would now be less
 monotonic, meaning if N threads are indexing, you'll roughly get
 in-time-order assignment of docIDs.  But with this change, all of one
 thread state would get 0..N docIDs, the next thread state'd get
 N+1...M docIDs, etc.  However, a single thread would still get
 monotonic assignment of docIDs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org