[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844420#action_12844420 ] Simon Willnauer commented on LUCENE-2309: - The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. The IndexWriter should instead implement an interface or use a class that is called for each successful incrementToken() no matter how this increment is implemented. I could imagine a really simple interface like {code} interface AttributeConsumer { void setAttributeSource(AttributeSource src); void next(); void end(); } {code} IW would then pass itself or an istance it uses (DocInverterPerField) to an API expecting such a consumer like: {code} field.consume(this); {code} or something similar. That way we have not dependency on whatever Attribute producer is used. The default implementation is for sure based on an analyzer / tokenstream and alternatives can be exposed via expert API. Even Backwards compatibility could be solved that way easily. bq. Only tests would rely on the analyzers module. I think that's OK? core itself would have no dependence. +1 test dependencies should not block modularization, its just about configuring the classpath though! Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844450#action_12844450 ] Michael McCandless commented on LUCENE-2309: bq. The IndexWriter or rather DocInverterPerField are simply an attribute consumer. None of them needs to know about Analyzer or TokenStream at all. Neither needs the analyzer to iterate over tokens. [Carrying over discussions on IRC with Chris Male Uwe...] Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... However, the .reset() method is redundant from indexer's standpoint -- ie when indexer calls Field.getTokenStream (say) whatever init'ing / reset'ing should already have be done by that method by the time it returns the TokenStream. Also, .close and .end are redundant -- seems like we should only have .end (few token streams do anything in .close...). But coalescing those two would be a good chunk of work at this point :) Or maybe we make a .finish that simply both by default ;) Finally, indexer doesn't really need a Document; it just needs something abstract that's provides an iterator over all fields that need indexing (and separately, storing). Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844455#action_12844455 ] Michael McCandless commented on LUCENE-2294: Thanks Shai, I'll look... bq. Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them? Yes if they are indeed false positives... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844461#action_12844461 ] Michael McCandless commented on LUCENE-2294: bq. Note, check.py still alerts on some changes, though I don't see any relevant change in the patch file. Should I ignore them? Hmm some of these (at least TestAtomicUpdate was changed from Simple - Whitespace) were in fact real changes I'll fix post a new patch. Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844464#action_12844464 ] Simon Willnauer commented on LUCENE-2309: - bq. [Carrying over discussions on IRC with Chris Male Uwe...] That make it very hard to participate. I can not afford to read through all IRC stuff and I don't get the chance to participate directly unless I watch IRC constantly. We should really move back to JIRA / devlist for such discussions. There is too much going on in IRC. {quote} Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... {quote} But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2294: --- Attachment: LUCENE-2294.patch Attached new patch, just fixing a couple tests where analyzer had changed. I it's ready to commit (take 2)! I'll wait a day or two... Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844467#action_12844467 ] Robert Muir commented on LUCENE-2309: - Hello, i commented yesterday but did not receive much feedback, so I want to elaborate some more: I suppose what I was trying to mention in my earlier comment here: https://issues.apache.org/jira/browse/LUCENE-2309?focusedCommentId=12844189page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12844189 is that while I really like the new TokenStream API, i would prefer it if we thought about making this flexible enough to support different paradigms, including perhaps something that looks a lot like the old TokenStream API. The reason is, I notice a lot of existing code still under this old API, and I think that in some cases, perhaps its easier to work with, even if you aren't a new user. I definitely think for newer users the old API might have some advantages. I think its useful to consider supporting such an API, perhaps as an extension in contrib/analyzers, even if its not as fast or flexible as the new API, perhaps the tradeoff of speed and flexibility would be worth the ease for newer users. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2310) Reduce Fieldable, AbstractField and Field complexity
[ https://issues.apache.org/jira/browse/LUCENE-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844469#action_12844469 ] Chris Male commented on LUCENE-2310: The challenge presented in this work is the pervasiveness of the Fieldable class. Its used in several hundred places through the source, but the majority are in tests, and in Document itself. Therefore part of this work will be also to move as many of the tests over to using Field, and working on the Document API as well. Reduce Fieldable, AbstractField and Field complexity Key: LUCENE-2310 URL: https://issues.apache.org/jira/browse/LUCENE-2310 Project: Lucene - Java Issue Type: Sub-task Components: Index Reporter: Chris Male In order to move field type like functionality into its own class, we really need to try to tackle the hierarchy of Fieldable, AbstractField and Field. Currently AbstractField depends on Field, and does not provide much more functionality that storing fields, most of which are being moved over to FieldType. Therefore it seems ideal to try to deprecate AbstractField (and possible Fieldable), moving much of the functionality into Field and FieldType. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489 ] Uwe Schindler commented on LUCENE-2309: --- bq. I could imagine a really simple interface like During lunch an idea evolved: If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always see the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource. The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844489#action_12844489 ] Uwe Schindler edited comment on LUCENE-2309 at 3/12/10 1:25 PM: bq. I could imagine a really simple interface like During lunch an idea evolved: If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always see the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource). The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token on the Acceptor. But both approaches still have te problem with the shared attributes. If you want to record tokens you have to implement something like my Proxy attributes. Else (as mentioned) above, most time would be spent in getAttribute() calls. was (Author: thetaphi): bq. I could imagine a really simple interface like During lunch an idea evolved: If you look at current DocInverter code, it does not use a consumer-like API. The code just has an add/accept-method that accepts tokens. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. But still we must have the attribute api and the acceptor (DocInverter) must always see the same attribute instances (else much time would be spent to each time call getAttribute(...) for each token, if the accept method would take an AttributeSource. The current TokenStream api could get a method taking AttributeAcceptor and simply do a while incrementToken() loop, calling accept() on DocInverter (the AttributeAcceptor). Another approach for users would be to not use the TokenStream API at all and simply call the accept() method for each token. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844498#action_12844498 ] Michael McCandless commented on LUCENE-2309: bq. The idea is to, as Simon proposed, let the docinverter implement something like AttributeAcceptor. This is interesting! It inverts the stack/control flow, but, would continue to use shared attrs. So then somehow the indexer would pass its AttrAcceptor to the field? And the field would have whatever control logic it wants to feed the tokens... Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844500#action_12844500 ] Michael McCandless commented on LUCENE-2309: {quote} bq. Actually, TokenStream is already AttrSource + incrementing, so it seems like the right start... But that binds the Indexer to a tokenstream which is unnecessary IMO. What if I want to implement something aside the TokenStream delegator API? {quote} True, but we need at least some way to increment? AttrSource doesn't have that. But I don't think we need reset nor close from TokenStream. Maybe we could factor out an abstract class / interface that TokenStream impls, minus the reset close methods? Then people could freely use Lucene to index off a foreign analysis chain... Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Welcome Chris Male as Contrib committer!
I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844509#action_12844509 ] Shai Erera commented on LUCENE-2309: bq. We should really move back to JIRA / devlist for such discussions +1 !! I also find it very hard to track so many sources of discussions (JIRA, java-dev, java-user, general, and now IRC). Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. And having the Field control the flow of indexing seems also dangerous ... might expose Lucene to lots of bugs by users. Today when IW controls it, it's one place to look for, but tomorrow when Field will control it, where do we look? In the app's custom Field code? In IW's atts consuming methods? Will the Field also control how stored fields are added? Or only AttributeSourced ones? Maybe I need to get used to this change, but currently it looks wrong to reverse the control flow. Maybe in principle the DocInverter now accepts tokens from IW, and so it looks as if we can pass it to the Field (as IW's AttAcceptor), but still the concept is different. We (IW) control the indexing flow, and not the user. I also may not understand what will that give to users. Shouldn't users get enough flexibility w/ the current API and the Flex (once out) stuff? Do they really need to be bothered w/ pushing tokens to IW? Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2294) Create IndexWriterConfiguration and store all of IW configuration there
[ https://issues.apache.org/jira/browse/LUCENE-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844511#action_12844511 ] Shai Erera commented on LUCENE-2294: Thanks Mike. I ran the tool once, fix all that it complained. Then 2nd time it found some more (probably some I missed in the 1st pass), only this time really few more. So I fixed them as well. But I didn't run it 3rd time :) ... I can't wait for this to be in ... an exhausting issue ;). Create IndexWriterConfiguration and store all of IW configuration there --- Key: LUCENE-2294 URL: https://issues.apache.org/jira/browse/LUCENE-2294 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: check.py, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch, LUCENE-2294.patch I would like to factor out of all IW configuration parameters into a single configuration class, which I propose to name IndexWriterConfiguration (or IndexWriterConfig). I want to store there almost everything besides the Directory, and to reduce all the ctors down to one: IndexWriter(Directory, IndexWriterConfiguration). What I was thinking of storing there are the following parameters: * All of ctors parameters, except for Directory. * The different setters where it makes sense. For example I still think infoStream should be set on IW directly. I'm thinking that IWC should expose everything in a setter/getter methods, and defaults to whatever IW defaults today. Except for Analyzer which will need to be defined in the ctor of IWC and won't have a setter. I am not sure why MaxFieldLength is required in all IW ctors, yet IW declares a DEFAULT (which is an int and not MaxFieldLength). Do we still think that 1 should be the default? Why not default to UNLIMITED and otherwise let the application decide what LIMITED means for it? I would like to make MFL optional on IWC and default to something, and I hope that default will be UNLIMITED. We can document that on IWC, so that if anyone chooses to move to the new API, he should be aware of that ... I plan to deprecate all the ctors and getters/setters and replace them by: * One ctor as described above * getIndexWriterConfiguration, or simply getConfig, which can then be queried for the setting of interest. * About the setters, I think maybe we can just introduce a setConfig method which will override everything that is overridable today, except for Analyzer. So someone could do iw.getConfig().setSomething(); iw.setConfig(newConfig); ** The setters on IWC can return an IWC to allow chaining set calls ... so the above will turn into iw.setConfig(iw.getConfig().setSomething1().setSomething2()); BTW, this is needed for Parallel Indexing (see LUCENE-1879), but I think it will greatly simplify IW's API. I'll start to work on a patch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844515#action_12844515 ] Uwe Schindler commented on LUCENE-2309: --- bq. I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. The idea was not to change this behaviour, but also give the user the posibility to reverse that. For some tokenstreams it would simplify things much. The current IndexWriter code works exactly like that: # DocInverter gets TokenStream # DocInverter calls reset() -- to be left out and moved to field/analyzer # DocInverter does while-loop on incrementToken - it iterates. On each Token it calls add() on the field consumer # DocInverter calls end() and updates end offset # DocInverter calls close() -- to be left out and moved to field/analyzer The change is simply that step (3) is removed from DocInverter which only provides the add() method for accepting Tokens. The current while loop simply is done in the current TokenStream/Field code, so nobody needs to change his code. But somebody that actively wants to push tokens can now do this. If he wants to do this currently he has no chance without heavy buffering. So the push API will be very expert and the current TokenStreams is just a user of this API. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Chris Male as Contrib committer!
Congratulations! On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote: I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844516#action_12844516 ] Mark Miller commented on LUCENE-2309: - bq. Also IRC is not logged/archived and searchable (I think?) which makes it impossible to trace back a discussion, and/or randomly stumble upon it in Google. Apaches rule is, if it didn't happen on this lists, it didn't happen. #IRC is a great way for people to communicate and hash stuff out, but its not necessary you follow it. If you have questions or want further elaboration, just ask. No one can expect you to follow IRC, nor is it a valid reference for where something was decided. IRC is great - I think its really benefited having devs discuss there - but the official position is, if it didn't happen on the list, it didnt actually happen. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2015) ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter
[ https://issues.apache.org/jira/browse/LUCENE-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2015. - Resolution: Fixed Committed revision 922277. Thanks Cédrik! ASCIIFoldingFilter: expose folding logic + small improvements to ISOLatin1AccentFilter -- Key: LUCENE-2015 URL: https://issues.apache.org/jira/browse/LUCENE-2015 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Cédrik LIME Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: ASCIIFoldingFilter-no_formatting.patch, ASCIIFoldingFilter-no_formatting.patch, Filters.patch, ISOLatin1AccentFilter.patch, LUCENE-2015.patch, LUCENE-2015.patch This patch adds a couple of non-ascii chars to ISOLatin1AccentFilter (namely: left right single quotation marks, en dash, em dash) which we very frequently encounter in our projects. I know that this class is now deprecated; this improvement is for legacy code that hasn't migrated yet. It also enables easy access to the ascii folding technique use in ASCIIFoldingFilter for potential re-use in non-Lucene-related code. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Welcome Chris Male as Contrib committer!
Congrats Mark. I wish you heavy committing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, March 12, 2010 3:17 PM To: java-dev@lucene.apache.org Subject: Welcome Chris Male as Contrib committer! I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
RE: Welcome Chris Male as Contrib committer!
Congrats Chris. I wish you heavy committing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, March 12, 2010 3:17 PM To: java-dev@lucene.apache.org Subject: Welcome Chris Male as Contrib committer! I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
RE: Welcome Chris Male as Contrib committer!
I wish you heavy committing, too. But I meant Chris, sorry J - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de/ http://www.thetaphi.de eMail: u...@thetaphi.de From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Friday, March 12, 2010 3:36 PM To: java-dev@lucene.apache.org Subject: RE: Welcome Chris Male as Contrib committer! Congrats Mark. I wish you heavy committing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, March 12, 2010 3:17 PM To: java-dev@lucene.apache.org Subject: Welcome Chris Male as Contrib committer! I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
Re: Welcome Chris Male as Contrib committer!
Hi, Thanks Mark! All is forgiven Uwe :) Cheers Chris On Fri, Mar 12, 2010 at 3:38 PM, Uwe Schindler u...@thetaphi.de wrote: I wish you heavy committing, too. But I meant Chris, sorry J - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Uwe Schindler [mailto:u...@thetaphi.de] *Sent:* Friday, March 12, 2010 3:36 PM *To:* java-dev@lucene.apache.org *Subject:* RE: Welcome Chris Male as Contrib committer! Congrats Mark. I wish you heavy committing! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de *From:* Mark Miller [mailto:markrmil...@gmail.com] *Sent:* Friday, March 12, 2010 3:17 PM *To:* java-dev@lucene.apache.org *Subject:* Welcome Chris Male as Contrib committer! I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844523#action_12844523 ] Simon Willnauer commented on LUCENE-2309: - bq. Then people could freely use Lucene to index off a foreign analysis chain... That is what I was talking about! {quote} I'd like to donate my two cents here - we've just recently changed the TokenStream API, but we still kept its concept - i.e. IW consumes tokens, only now the API has changed slightly. The proposals here, w/ the AttConsumer/Acceptor, that IW will delegate itself to a Field, so the Field will call back to IW seems too much complicated to me. Users that write Analyzers/TokenStreams/AttributeSources, should not care how they are indexed/stored etc. Forcing them to implement this push logic to IW seems to me like a real unnecessary overhead and complexity. {quote} We can surely hide this implementation completely from field. I consider this being similar to Collector where you pass it explicitly to the search method if you want to have a different behavior. Maybe something like a AttributeProducer. I don't think adding this to field makes a lot of sense at all and it is not worth the complexity. bq. Will the Field also control how stored fields are added? Or only AttributeSourced ones? IMO this is only about inverted fields. bq. We (IW) control the indexing flow, and not the user. The user only gets the possibility to exchange the analysis chain but not the control flow. The user already can mess around with stuff in incrementToken(), the only thing we change / invert is that the indexer does not know about TokenStreams anymore. it does not change the controlflow though. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Chris Male as Contrib committer!
Congrats! Tradition has it, Chris, that you provide a brief intro on yourself upon becoming a new committer, so let's hear it! -Grant On Mar 12, 2010, at 9:17 AM, Mark Miller wrote: I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com
Re: Welcome Chris Male as Contrib committer!
Congrats Chris :) On Fri, Mar 12, 2010 at 3:51 PM, Grant Ingersoll gsing...@apache.org wrote: Congrats! Tradition has it, Chris, that you provide a brief intro on yourself upon becoming a new committer, so let's hear it! -Grant On Mar 12, 2010, at 9:17 AM, Mark Miller wrote: I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844528#action_12844528 ] Uwe Schindler commented on LUCENE-2309: --- There is one problem that cannot be easy solved (for all proposals here), if we want to provide an old-style API that does not require reuse of tokens: The problem with AttributeProvider is that if we want to support something (like rmuir proposed before) that looks like the old Token next(), we need an AttributeProvider that passes the AttributeSource to the indexer on each Token! And that would lead to lots of getAttribute() calls, that would slowdown indexing! So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. This can only be solved with my nice BCEL proxy Attributes, so you can exchange the inner attribute impl. Or do it like TokenWrapper in 2.9 (yes, we can reactivate that API somehow as an easy use-addendum). Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2309) Fully decouple IndexWriter from analyzers
[ https://issues.apache.org/jira/browse/LUCENE-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844533#action_12844533 ] Robert Muir commented on LUCENE-2309: - {quote} So with the current APIs we cannot get around the requirement to reuse the same Attribute instances during the whole indexing without a major speed impact. {quote} I agree. I guess I'll try to simplifiy my concern: maybe we don't necessarily need something that looks like the old TokenStream API, but I feel it would be worth our time to think about supporting 'some alternative API' that makes it easier to work with lots of context across different Tokens. I personally do not mind how this is done with the capture/restore state API, but I feel that its pretty unnatural for many developers, and in the future folks might want to do more complex analysis (maybe even light pos-tagging, etc) that requires said context, and we should plan for this. I feel this wasn't such an issue with the old TokenStream API, but maybe there is another way to address this potential problem. Fully decouple IndexWriter from analyzers - Key: LUCENE-2309 URL: https://issues.apache.org/jira/browse/LUCENE-2309 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless IndexWriter only needs an AttributeSource to do indexing. Yet, today, it interacts with Field instances, holds a private analyzers, invokes analyzer.reusableTokenStream, has to deal with a wide variety (it's not analyzed; it is analyzed but it's a Reader, String; it's pre-analyzed). I'd like to have IW only interact with attr sources that already arrived with the fields. This would be a powerful decoupling -- it means others are free to make their own attr sources. They need not even use any of Lucene's analysis impls; eg they can integrate to other things like [OpenPipeline|http://www.openpipeline.org]. Or make something completely custom. LUCENE-2302 is already a big step towards this: it makes IW agnostic about which attr is the term, and only requires that it provide a BytesRef (for flex). Then I think LUCENE-2308 would get us most of the remaining way -- ie, if the FieldType knows the analyzer to use, then we could simply create a getAttrSource() method (say) on it and move all the logic IW has today onto there. (We'd still need existing IW code for back-compat). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Chris Male as Contrib committer!
On Mar 12, 2010, at 10:00 AM, Chris Male wrote: Although I live in Amsterdam, I am actually from New Zealand so it feels good to finally have kiwi representation. +1. I've always wanted to go there! I'll have to pick your brain on it next time I'm in Amsterdam over a pint. -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Chris Male as Contrib committer!
Welcome aboard Chris! Mike On Fri, Mar 12, 2010 at 9:17 AM, Mark Miller markrmil...@gmail.com wrote: I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Welcome Chris Male as Contrib committer!
Welcome Chris! On Fri, Mar 12, 2010 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote: I am happy to announce the Lucene PMC has accepted Chris Male as a contrib committer! Chris has been making a lot of headway in cleaning up the spacial contrib lately, and hopefully now we can get more of those improvements into svn! Congrats Chris, and welcome! -- - Mark http://www.lucidimagination.com -- Regards, Shalin Shekhar Mangar.
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844578#action_12844578 ] Robert Muir commented on LUCENE-2308: - {quote} details like omitTfAP, omitNorms {quote} personal pet peeve, i wonder if we could consider improving on 'omit' here, I think things like omit(false), disable(false) are a little awkward. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844579#action_12844579 ] Chris Male commented on LUCENE-2308: So you are thinking more along the lines indexNorms(true|false)? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844585#action_12844585 ] Robert Muir commented on LUCENE-2308: - bq. So you are thinking more along the lines indexNorms(true|false)? or whatever you come up with, that doesn't create double-negatives! but yeah, i think something like that is a little easier... no big deal just figured I would bring it up if this stuff was getting refactored anyway Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587 ] Chris Male commented on LUCENE-2308: I agree entirely. This is definitely the moment to remove any ambiguity or confusion in this API. I'll make sure to incorporate this idea. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type
Congrats Chris! I vote for thinkAboutNotIncludingNormsMaybe(true|false) G. Seriously double negatives are ugly IMO, +1 for changing Erick On Fri, Mar 12, 2010 at 12:56 PM, Chris Male (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844587#action_12844587] Chris Male commented on LUCENE-2308: I agree entirely. This is definitely the moment to remove any ambiguity or confusion in this API. I'll make sure to incorporate this idea. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844626#action_12844626 ] Marvin Humphrey commented on LUCENE-2308: - I think we might consider matchOnly() instead of omitNorms(). If a field is match only, we don't need boost bytes a.k.a. norms because they are only used as a scoring multiplier. Haven't got a good synonym for omitTFAP, but I'd sure like one. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844629#action_12844629 ] Shai Erera commented on LUCENE-2308: How about enable(TYPE/FEATURE) and corresponding disable? So Type/Feature will have NORMS, TF, POSITIONS and calls would look like: f.enable(Type.NORMS), f.disable(Type.TF)? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844630#action_12844630 ] Robert Muir commented on LUCENE-2308: - Just also to mention (probably too much for this one issue)! I think it would be nice of OmitTF was separately selectable from OmitPositions, as Shai implied. We would have to actually implement this though I think! Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637 ] Marvin Humphrey commented on LUCENE-2308: - If you disable term freq, you also have to disable positions. The freq tells you how many positions there are. I think it's asking an awful lot of our users to require that they understand all the implications of posting format modifications when committers have difficulty mastering all the subtleties. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type
Committers are competant in different areas of the code. Even mike wasn't big into the search side until per segment. Commiters are trusted to mess with the pieces they know. I don't see anyone even remotely suggesting that users should have to understand all of the implications of posting format modifications. Just sounds like a nasty jab to me. - Mark http://www.lucidimagination.com On Mar 12, 2010, at 2:43 PM, Marvin Humphrey (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844637#action_12844637 ] Marvin Humphrey commented on LUCENE-2308: - If you disable term freq, you also have to disable positions. The freq tells you how many positions there are. I think it's asking an awful lot of our users to require that they understand all the implications of posting format modifications when committers have difficulty mastering all the subtleties. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844653#action_12844653 ] Robert Muir commented on LUCENE-2308: - {quote} If you disable term freq, you also have to disable positions. The freq tells you how many positions there are. {quote} Marvin: as stated, we would have to actually implement this. There's an issue open for it too: LUCENE-2048. I was just discussing this with someone the other day. {quote} I think it's asking an awful lot of our users to require that they understand all the implications of posting format modifications when committers have difficulty mastering all the subtleties. {quote} I don't know what I did to piss you off, but I just thought it would be nice for completeness, to mention that this feature is still open and its something we should think about. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844659#action_12844659 ] Marvin Humphrey commented on LUCENE-2308: - I'm simply suggesting that the proposed API is too hard to understand. Most users know whether their fields can be match-only but have no idea what TFAP is. And even advanced users will have difficulty understanding all the implications for matching and scoring when they selectively disable portions of the posting format. I'm not a fan of omitTFAP, omitTF, omitNorms, omitPositions, or omit(flags). Something that ordinary users can grok would be used more often and more effectively. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844661#action_12844661 ] Chris Male commented on LUCENE-2308: What I covered with Mike earlier was whether FieldType methods would be immutable or not. If they are, which seems a good idea, then everything will be enabled/disabled in the construction of the FieldType so we would only need to support property getter methods. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2308) Separately specify a field's type
On Fri, Mar 12, 2010 at 03:01:27PM -0500, Mark Miller wrote: Committers are competant in different areas of the code. Even mike wasn't big into the search side until per segment. Commiters are trusted to mess with the pieces they know. Absolutely. I wouldn't expect every committer to undertand the gory details of posting formats, and I've been a little caught off guard by the blowback from what I thought was an inoccuous observation. But by the same token, I wouldn't expect our users to have sufficient expertise to understand all the variants of omit*() either. This stuff oughtta be implementation details. I don't see anyone even remotely suggesting that users should have to understand all of the implications of posting format modifications. That's what omitTFAP() and omitNorms() do, though. And as Mike pointed out in the baby steps thread, omitTFAP() is often misunderstood. Marvin Humphrey - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844684#action_12844684 ] Michael McCandless commented on LUCENE-2308: Hmm one challenge with making FieldType immutable is we don't want a zillion ctors over time. Also creating a FieldType with args like new FieldType(true, false, false) isn't really readable. It would be nice if we could do something similar to IndexWriterConfig (LUCENE-2294), where you use incremental ctor/setters to set up the configuration but then once it's used (bound to a Field), it's immutable. I'm torn on naming: yes, search-oriented names like matchOnly is tempting, but then we really should tease apart termFreq and positions (they are stuck together now with omitTFAP). And the two are not fully independent as Marvin noted -- so maybe we use a cryptic enum (DOCS, DOCS_TERM_FREQ, DOCS_TERM_FREQ_POSITIONS)? If we can only find better names... I'm not sure we can/should find better index-time names. What is stored in the index is relatively independent from how/whether searches make use of it. EG if you store termFreq (but not positions) you can still do match only searching, or, you can do full scoring of the query. You can't use positional queries. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844688#action_12844688 ] Marvin Humphrey commented on LUCENE-2308: - Also creating a FieldType with args like new FieldType(true, false, false) isn't really readable. Agreed Another option would be a flags integer and bitwise constants: {code} FieldType type = new FieldType(analyzer, FieldType.INDEXED | FieldType.STORED); {code} It would be nice if we could do something similar to IndexWriterConfig (LUCENE-2294), where you use incremental ctor/setters to set up the configuration but then once it's used (bound to a Field), it's immutable. I bet that'll be more popular than flags, but I thought it was worth bringing it up anyway. :) Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844690#action_12844690 ] Earwin Burrfoot commented on LUCENE-2308: - I'm strongly against names like 'matchOnly'. They are perfectly fine in some 'schema' layer over Lucene, but here, in lowlevel guts, I'd prefer names that clearly state what the hell do they do, without forcing me to consult javadocs/code. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844700#action_12844700 ] Yonik Seeley commented on LUCENE-2308: -- For the non-expert user, it's just a label and won't have much meaning regardless of what it's called, and they will need to consult the docs. Of course, if one starts to dig deeper, norms actually does have a physical meaning in the index, so preferring a label with norms in it seems completely reasonable. There's also history to consider - when you change the name of something, you cut the link to the past in search engines, and in the memories of many developers. As it relates to Solr - I don't care so much since it makes sense for the Solr schema to isolate these changes and stick with omitNorms regardless. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844702#action_12844702 ] Chris Male commented on LUCENE-2308: {quote} It would be nice if we could do something similar to IndexWriterConfig (LUCENE-2294), where you use incremental ctor/setters to set up the configuration but then once it's used (bound to a Field), it's immutable. {quote} Yeah we could use something like a FieldTypeBuilder which could provide a fluid interface for specifying each property, which then get built into an immutable FieldType at the end. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844707#action_12844707 ] Yonik Seeley commented on LUCENE-2308: -- I'm not sure if strict immutability is necessary - there's everything in between too. One can simply say that all changes should be made before first use, and after that point it's undefined. Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field foo from one document to the next? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710 ] Chris Male commented on LUCENE-2308: {quote} I'm not sure if strict immutability is necessary - there's everything in between too. One can simply say that all changes should be made before first use, and after that point it's undefined. {quote} I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. Perhaps this really is just an edge case. {quote} Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field foo from one document to the next? {quote} Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844710#action_12844710 ] Chris Male edited comment on LUCENE-2308 at 3/12/10 10:01 PM: -- {quote} I'm not sure if strict immutability is necessary - there's everything in between too. One can simply say that all changes should be made before first use, and after that point it's undefined. {quote} I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. Perhaps this really is just an edge case though. {quote} Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field foo from one document to the next? {quote} Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay? was (Author: cmale): {quote} I'm not sure if strict immutability is necessary - there's everything in between too. One can simply say that all changes should be made before first use, and after that point it's undefined. {quote} I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. Perhaps this really is just an edge case. {quote} Unrelated question: I assume that this would retain the same flexibility as we have today... the ability to change FieldType for field foo from one document to the next? {quote} Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844716#action_12844716 ] Yonik Seeley commented on LUCENE-2308: -- bq. I'm really unsure about this if people are going to be using a FieldType instance with multiple Fields. I will, if I can (provided the FieldType does not contain the field name). That shouldn't have anything to do with immutability though. bq. Are you wanting to be able to reuse the same Field instance in both documents while defining separate FieldTypes? Or is creating new Field instances okay? new Field instances should be fine - it's not really my use case anyway. But we're designing for the 1000's of use cases that are out there and we should be careful about adding new constraints. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844720#action_12844720 ] Chris Male commented on LUCENE-2308: {quote} I will, if I can (provided the FieldType does not contain the field name). That shouldn't have anything to do with immutability though. {quote} Yeah the field name will stay inside the Field. To me the reuse issue relates immutability in that a change to a property in one FieldType after construction means the change effects all the Fields that use that type. But as you say, if we document that its best to set everything at instantiation and that whatever happens after that is undefined, then I imagine it'll be fine. {quote} new Field instances should be fine - it's not really my use case anyway. But we're designing for the 1000's of use cases that are out there and we should be careful about adding new constraints. {quote} Yeah I appreciate that this API will be used in lots of different ways. Baby steps as Mike said :) But to answer your question, yes the flexibility will remain. Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2308) Separately specify a field's type
[ https://issues.apache.org/jira/browse/LUCENE-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844722#action_12844722 ] Yonik Seeley commented on LUCENE-2308: -- Of course... given that Fieldable is an interface, one could create an implementation that just delegated all the calls like omitNorms to a shared instance, except for the value part. Add a getAnalyzer() method to Fieldable, and it's the same thing in the end? Separately specify a field's type - Key: LUCENE-2308 URL: https://issues.apache.org/jira/browse/LUCENE-2308 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless This came up from dicussions on IRC. I'm summarizing here... Today when you make a Field to add to a document you can set things index or not, stored or not, analyzed or not, details like omitTfAP, omitNorms, index term vectors (separately controlling offsets/positions), etc. I think we should factor these out into a new class (FieldType?). Then you could re-use this FieldType instance across multiple fields. The Field instance would still hold the actual value. We could then do per-field analyzers by adding a setAnalyzer on the FieldType, instead of the separate PerFieldAnalzyerWrapper (likewise for per-field codecs (with flex), where we now have PerFieldCodecWrapper). This would NOT be a schema! It's just refactoring what we already specify today. EG it's not serialized into the index. This has been discussed before, and I know Michael Busch opened a more ambitious (I think?) issue. I think this is a good first baby step. We could consider a hierarchy of FIeldType (NumericFieldType, etc.) but maybe hold off on that for starters... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2312) Search on IndexWriter's RAM Buffer
Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Fix For: 3.0.2 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844749#action_12844749 ] Jason Rutherglen commented on LUCENE-2312: -- In regards to the terms dictionary, keeping it sorted or not, I think it's best to sort it on demand because otherwise there will be yet another parameter to pass into IW (i.e. sortRAMBufTerms or something like that). Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Fix For: 3.0.2 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Different behavior of Directory.fieldLength()
Hi: During some test of Lucene Domain Index (http://docs.google.com/View?id=ddgw7sjp_54fgj9kg) with big data sources we found an exception caused for calling Directory.fieldLength() method on non existing file. FSDirectory implements this method as: /** Returns the length in bytes of a file in the directory. */ public long fileLength(String name) { ensureOpen(); File file = new File(directory, name); return file.length(); } According to JDK1.5 calling to File constructor causes a file creation without throwing an exception: http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html#File(java.lang.String, java.lang.String) But either RAMDirectory nor OJVMDirectory do this: RAMDirectory: /** Returns the length in bytes of a file in the directory. * @throws IOException if the file does not exist */ public final long fileLength(String name) throws IOException { ensureOpen(); RAMFile file; synchronized (this) { file = (RAMFile)fileMap.get(name); } if (file==null) throw new FileNotFoundException(name); return file.getLength(); } If OJVMDirectory throws an exception if a file doesn't exist it causes that the IndexWriter fail to do the job, here the stack trace: IW 3 [Root Thread]: DW: RAM: now flush @ usedMB=15.001 allocMB=15.001 deletesMB=0 triggerMB=15 IW 3 [Root Thread]: flush: segment=_0 docStoreSegment=_0 docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=false numDocs=109169 numBufDelTerms=0 IW 3 [Root Thread]: index before flush IW 3 [Root Thread]: DW: flush postings as segment _0 numDocs=109169 *** 2010-03-11 17:27:15.696 IW 3 [Root Thread]: DW: docWriter: now abort IW 3 [Root Thread]: hit exception flushing segment _0 IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.tii IFD [Root Thread]: delete _0.tii IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.fnm IFD [Root Thread]: delete _0.fnm IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.fdx IFD [Root Thread]: delete _0.fdx IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.fdt IFD [Root Thread]: delete _0.fdt IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.prx IFD [Root Thread]: delete _0.prx IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.nrm IFD [Root Thread]: delete _0.nrm IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.frq IFD [Root Thread]: delete _0.frq IFD [Root Thread]: refresh [prefix=_0]: removing newly created unreferenced file _0.tis IFD [Root Thread]: delete _0.tis Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex ODCIIndexCreate SEVERE: failed to create index: cannot verify file: _0.fdx. Reason: Exhausted Resultset Mar 11, 2010 5:27:15 PM org.apache.lucene.indexer.LuceneDomainIndex ODCIIndexCreate FINER: THROW java.io.IOException: cannot verify file: _0.fdx. Reason: Exhausted Resultset at org.apache.lucene.store.OJVMDirectory.fileLength(OJVMDirectory.java:633) at org.apache.lucene.index.SegmentInfo.sizeInBytes(SegmentInfo.java:271) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:593) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4311) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4209) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4200) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2497) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2451) at org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java:374) at org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomainIndex.java:568) IW 3 [Root Thread]: now flush at close IW 3 [Root Thread]: flush: segment=null docStoreSegment=null docStoreOffset=0 flushDocs=false flushDeletes=true flushDocStores=false numDocs=0 numBufDelTerms=0 IW 3 [Root Thread]: index before flush IW 3 [Root Thread]: CMS: now merge IW 3 [Root Thread]: CMS: index: IW 3 [Root Thread]: CMS: no more merges pending; now return IW 3 [Root Thread]: now call final commit() IW 3 [Root Thread]: startCommit(): start sizeInBytes=0 IW 3 [Root Thread]: startCommit index= changeCount=1 IW 3 [Root Thread]: done all syncs IW 3 [Root Thread]: commit: pendingCommit != null IW 3 [Root Thread]: commit: wrote segments file segments_2 IFD [Root Thread]: now checkpoint segments_2 [0 segments ; isCommit = true] IFD [Root Thread]: deleteCommits: now decRef commit segments_1 IFD [Root Thread]: delete segments_1 IW 3 [Root Thread]: commit: done IW 3 [Root Thread]: at close: Which is the correct behavior for this method? We changed OJVMDirectory.fileLength() method to returns 0 if no file exists instead of throwing an exception and IndexWriter works properly,
Re: Baby steps towards making Lucene's scoring more flexible...
On Thu, Mar 11, 2010 at 05:59:03AM -0500, Michael McCandless wrote: So there would be polymorphism in the decoding phase while we're supplying information the Similarity object needs to make its similarity judgments. However, that polymorphism would be handled internally -- it wouldn't be the responsibility of the user to determine whether a codec supported a particular scoring model. Is that yes (a user can do MatchOnlySim at search time if the field were indexed with B25Sim)? In essence, yes. Technically, no. Under the covers, doc-id-only postings iteration probably wouldn't be implemented by spawning a doc-id-only Similarity object. It would probably be something more like, ask the Similarity for a PostingDecoder with no extra attributes. And then docID-freq-boost postings iteration might be achieved by asking the Similarity for a PostingDecoder with TermFreq and DocBoost attributes. How will Lucy know which switchups (Sim at indexing vs Sim at searching) are OK... I think the theme is that each Similarity class will have a whitelist of supported posting iteration configurations. So long as the requested config is in the whitelist, you get an iterator back -- otherwise, you get NULL. Exactly what form the request specification would take, that's up in the air. But it would be an implementation detail for now. So long as the file format supports the data, we can build an iterator that reads it, regardless of encoding. Yeah so, I don't like that in Lucene you call Field.setOmitTFAP instead of saying Field.matchOnly (or something). So I do agree that it'd be better if the API made it clear what the *search* time impact is of using this advanced Field API. In my opinion, it makes sense to communicate match only by way of the Similarity object as opposed to a boolean. I think it's a good way to introduce the Similarity class and get people comfortable with it, and I also think that it's good to keep stuff out of the FieldType API when we can. But say we want to also allow storing tf but not positions, because really the two choices should not be coupled (as they are today with Lucene's omitTFAP). So I have omitTF and omitP (only 3 combos are allowed -- must omitP if you omitTF). What Sim do you call that at indexing time? Well, those are pretty esoteric posting formats. It's common to not need scores and therefore not need boost bytes (the Lucene omitNorms case). It's also common to not need any matching info beyond doc id (the Lucene omitTFAP case). But omitTF and omitP aren't common needs, or Lucene would have them by now, right? And since they are infrequently used, Huffman-driven naming philosophy suggests that they should have long, low-value names: OmitPositionsSimilarity, OmitTFandPositionsSimilarity (or OmitTFAPSimilarity, which would actually be an accurate abbreviation in this scenario as opposed to the current Lucene omitTFAP). In other words, I don't much care what those are named because they aren't likely to be used except by people who A) have very, very specific use cases and B) really know what they're doing. In contrast, I think it's important that we come up with good names for the doc-id-tf-positions-but-no-boost-bytes (aka omitNorms) and doc-id-only cases. We get users who are baffled that their phrase queries no longer work after setting omitTFAP. This is still a weakness of MatchSimilarity. Well MatchSimilarity arguably should mean match all queries correctly, just don't score them. Ie, positional queries should in fact work... just not receive a score. Right. However, now that I've thought about it, if a user indicates that a field is match-only by supplying a MatchSimilarity, we know that we can omit boost bytes. So we can re-conceive MatchSimilarity as being analogous to omitNorms. Huzzah! One down, one to go. :) On the other hand, typical candidates for MatchSimilarity... * unique_id * category * tags ... either won't contain multiple tokens, or won't generally return sensible results for phrase queries. Maybe we need to splinter MatchSim into the two cases. Whether positions are stored, and whether scoring is done, is really orthogonal. Maybe MinimalSimilarity as the analogue for Lucene omitTFAP? I dunno, that might be kind of generic, but maybe it makes sense in context. The idea is to get the user to describe how the field will be scored. Based on that info, we can customize the posting format, possibly making optimizations and omitting certain posting data. When people ask on the user list... How can I make my index smaller? ... we can reply like so: Make some fields match-only by specifying MatchSimilarity in the FieldType, or even better if you don't need phrase queries, by specifying MinimalSimilarity. You'll be throwing away data Lucy needs for sophisticated queries, but your index will get smaller. I think that
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844826#action_12844826 ] Jason Rutherglen commented on LUCENE-2312: -- I set out implementing a simple method DocumentsWriter.getTerms which should return a sorted array of terms over the current RAM buffer. While I think this can be implemented, there's a lot of code in the index package to handle multiple threads, which is fine, except I'm concerned the interleaving of postings won't perform well. So I think we'd want to implement what's been discussed in LUCENE-2293, per thread ram buffers. With that change, it seems implementing this issue could be straightforward. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Fix For: 3.0.2 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2293) IndexWriter has hard limit on max concurrency
[ https://issues.apache.org/jira/browse/LUCENE-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12844828#action_12844828 ] Jason Rutherglen commented on LUCENE-2293: -- {quote}but does anyone out there wanna work out the private RAM segments?{quote} I didn't see this before, I figured private RAM segments was on the roadmap for this issue, it sounds like it'll be a different one? Mike, can you outline what would need to change? It seems like large amounts of code could be removed (i.e. FreqProxFieldMergeState)? The *PerThread classes? If so, I think it would go over my head (because I don't have a mental mapping of how all the classes tie together). IndexWriter has hard limit on max concurrency - Key: LUCENE-2293 URL: https://issues.apache.org/jira/browse/LUCENE-2293 Project: Lucene - Java Issue Type: Bug Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 DocumentsWriter has this nasty hardwired constant: {code} private final static int MAX_THREAD_STATE = 5; {code} which probably I should have attached a //nocommit to the moment I wrote it ;) That constant sets the max number of thread states to 5. This means, if more than 5 threads enter IndexWriter at once, they will share only 5 thread states, meaning we gate CPU concurrency to 5 running threads inside IW (each thread must first wait for the last thread to finish using the thread state before grabbing it). This is bad because modern hardware can make use of more than 5 threads. So I think an immediate fix is to make this settable (expert), and increase the default (8?). It's tricky, though, because the more thread states, the less RAM efficiency you have, meaning the worse indexing throughput. So you shouldn't up and set this to 50: you'll be flushing too often. But... I think a better fix is to re-think how threads write state into DocumentsWriter. Today, a single docID stream is assigned across threads (eg one thread gets docID=0, next one docID=1, etc.), and each thread writes to a private RAM buffer (living in the thread state), and then on flush we do a merge sort. The merge sort is inefficient (does not currently use a PQ)... and, wasteful because we must re-decode every posting byte. I think we could change this, so that threads write to private RAM buffers, with a private docID stream, but then instead of merging on flush, we directly flush each thread as its own segment (and, allocate private docIDs to each thread). We can then leave merging to CMS which can already run merges in the BG without blocking ongoing indexing (unlike the merge we do in flush, today). This would also allow us to separately flush thread states. Ie, we need not flush all thread states at once -- we can flush one when it gets too big, and then let the others keep running. This should be a good concurrency gain since is uses IO CPU resources throughout indexing instead of big burst of CPU only then big burst of IO only that we have today (flush today stops the world). One downside I can think of is... docIDs would now be less monotonic, meaning if N threads are indexing, you'll roughly get in-time-order assignment of docIDs. But with this change, all of one thread state would get 0..N docIDs, the next thread state'd get N+1...M docIDs, etc. However, a single thread would still get monotonic assignment of docIDs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org