[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734015#action_12734015 ] Luis Alves commented on LUCENE-1486: I share same opinion as Michael, the implementation has a lot of undefined/undocumented behaviors, simple because it reuses the queryparser to parse the text inside a phrase. All the lucene syntax needs to be accounted on this design, but it does not seem to be the case. Problems like Adriano described, phrase inside a phrase, position reporting for errors. I also have a lot of concerns about having the full lucene syntax inside phrases and trying to restrict this by throwing exceptions for particular cases does not seem the best design. Here is a example of with OR, AND, PARENTESIS with a proximity search (( jakarta OR green) AND (blue AND orange) AND black~2) apache~10 What should a user expect from this query, without looking at the code. I'm not sure. Does it even make sense to support this complex syntax? In my opinion. no I think we should define what is the subset of the language we want to support inside the phrases with a well defined behavior. If Mark describes all the syntax he wants to support inside phrases, I actually don't mind to implement a new parser.for this. My view is, contrib is probably a better place to have this code, until we figure out a implementation that does not impose as many restrictions on changes to the original queryparser and describes a well defined syntax to be applied inside phrases. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API
[ https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734016#action_12734016 ] Simon Willnauer commented on LUCENE-1460: - bq. It seems like 1728 is ready to commit? Simon said on java-dev he will try to finish it by end of this week? That is correct. I can commit it today I think. Will make this issue dependent on 1728 and finish it by the end of today. simon Change all contrib TokenStreams/Filters to use the new TokenStream API -- Key: LUCENE-1460 URL: https://issues.apache.org/jira/browse/LUCENE-1460 Project: Lucene - Java Issue Type: Task Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, LUCENE-1460_partial.txt Now that we have the new TokenStream API (LUCENE-1422) we should change all contrib modules to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734022#action_12734022 ] Michael Busch commented on LUCENE-1448: --- OK I think I have this basically working with old and new API (including 1693 changes). The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the following method to TokenSteam: {code:java} /** * This method is called by the consumer after the last token has been consumed, * i.e. after {...@link #incrementToken()} returned codefalse/code (using the new TokenStream API) * or after {...@link #next(Token)} or {...@link #next()} returned codenull/code (old TokenStream API). * p/ * This method can be used to perform any end-of-stream operations, such as setting the final * offset of a stream. The final offset of a stream might differ from the offset of the last token * e.g. in case one or more whitespaces followed after the last token, but a {...@link WhitespaceTokenizer} * was used. * p/ * * @throws IOException */ public void end() throws IOException { // do nothing by default } {code} Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset(). E.g. in CharTokenizer the implementations looks like this: {code:java} public void end() { // set final offset int finalOffset = input.correctOffset(offset); offsetAtt.setOffset(finalOffset, finalOffset); } {code} I changed DocInverterPerField to call end() after the stream is fully consumed and use what offsetAttribute.endOffset() returns as final offset. I also added all new tests from Mike's latest patch. All unit tests, including the new ones, pass. Also test-tag. I'm not posting a patch yet, because this depends on 1693. Mike, Uwe, others: could you please review if this approach makes sense? add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API
[ https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734024#action_12734024 ] Michael Busch commented on LUCENE-1460: --- Cool! Thanks, Simon. Change all contrib TokenStreams/Filters to use the new TokenStream API -- Key: LUCENE-1460 URL: https://issues.apache.org/jira/browse/LUCENE-1460 Project: Lucene - Java Issue Type: Task Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, LUCENE-1460_partial.txt Now that we have the new TokenStream API (LUCENE-1422) we should change all contrib modules to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734023#action_12734023 ] Michael Busch commented on LUCENE-1448: --- Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter. But it should be simple: CachingTokenFilter.end() should call input.end() when it is called for the first time and store the captured state locally as finalState. Then whenever CachingTokenFilter.end() is called again, it just restores the finalState. For Tee/Sink it should work similarly: The tee just puts a finalState into the sink(s) the first time end() is called. And when end() of a sink is called it restores the finalState. This should work? add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734025#action_12734025 ] Michael Busch commented on LUCENE-1448: --- Hmm another reason why I don't like two Tees feeding one Sink: What is the finalOffset and finalState then? add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Description: This patch makes the following improvements to AttributeSource and TokenStream/Filter: - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - removes the set/getUseNewAPI() methods (including the standard ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden). - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test). - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. This issue contains one backwards-compatibility break: TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for the explaination). Some of these core classes are not final and so one could override the next() or next(Token) methods. In this case, the backwards-wrapper would automatically use incrementToken(), because it is implemented, so the overridden method is never called. To prevent users from errors not visible during compilation or testing (the streams just behave wrong), this patch makes all implementation methods final (next(), next(Token), incrementToken()), whenever the class itsself is not final. This is a BW break, but users will clearly see, that they have done something unsupoorted and should better create a custom TokenFilter with their additional implementation (instead of extending a core implementation). For further changing contrib token streams the following procedere should be used: * rewrite and replace next(Token)/next() implementations by new API * if the
[jira] Updated: (LUCENE-1693) AttributeSource/TokenStream API improvements
[ https://issues.apache.org/jira/browse/LUCENE-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1693: -- Description: This patch makes the following improvements to AttributeSource and TokenStream/Filter: - introduces interfaces for all Attributes. The corresponding implementations have the postfix 'Impl', e.g. TermAttribute and TermAttributeImpl. AttributeSource now has a factory for creating the Attribute instances; the default implementation looks for implementing classes with the postfix 'Impl'. Token now implements all 6 TokenAttribute interfaces. - new method added to AttributeSource: addAttributeImpl(AttributeImpl). Using reflection it walks up in the class hierarchy of the passed in object and finds all interfaces that the class or superclasses implement and that extend the Attribute interface. It then adds the interface-instance mappings to the attribute map for each of the found interfaces. - removes the set/getUseNewAPI() methods (including the standard ones). Instead it is now enough to only implement the new API, if one old TokenStream implements still the old API (next()/next(Token)), it is wrapped automatically. The delegation path is determined via reflection (the patch determines, which of the three methods was overridden). - Token is no longer deprecated, instead it implements all 6 standard token interfaces (see above). The wrapper for next() and next(Token) uses this, to automatically map all attribute interfaces to one TokenWrapper instance (implementing all 6 interfaces), that contains a Token instance. next() and next(Token) exchange the inner Token instance as needed. For the new incrementToken(), only one TokenWrapper instance is visible, delegating to the currect reusable Token. This API also preserves custom Token subclasses, that maybe created by very special token streams (see example in Backwards-Test). - AttributeImpl now has a default implementation of toString that uses reflection to print out the values of the attributes in a default formatting. This makes it a bit easier to implement AttributeImpl, because toString() was declared abstract before. - Cloning is now done much more efficiently in captureState. The method figures out which unique AttributeImpl instances are contained as values in the attributes map, because those are the ones that need to be cloned. It creates a single linked list that supports deep cloning (in the inner class AttributeSource.State). AttributeSource keeps track of when this state changes, i.e. whenever new attributes are added to the AttributeSource. Only in that case will captureState recompute the state, otherwise it will simply clone the precomputed state and return the clone. restoreState(AttributeSource.State) walks the linked list and uses the copyTo() method of AttributeImpl to copy all values over into the attribute that the source stream (e.g. SinkTokenizer) uses. - Tee- and SinkTokenizer were deprecated, because they use Token instances for caching. This is not compatible to the new API using AttributeSource.State objects. You can still use the old deprecated ones, but new features provided by new Attribute types may get lost in the chain. A replacement is a new TeeSinkTokenFilter, which has a factory to create new Sink instances, that have compatible attributes. Sink instances created by one Tee can also be added to another Tee, as long as the attribute implementations are compatible (it is not possible to add a sink from a tee using one Token instance to a tee using the six separate attribute impls). In this case UOE is thrown. The cloning performance can be greatly improved if not multiple AttributeImpl instances are used in one TokenStream. A user can e.g. simply add a Token instance to the stream instead of the individual attributes. Or the user could implement a subclass of AttributeImpl that implements exactly the Attribute interfaces needed. I think this should be considered an expert API (addAttributeImpl), as this manual optimization is only needed if cloning performance is crucial. I ran some quick performance tests using Tee/Sink tokenizers (which do cloning) and the performance was roughly 20% faster with the new API. I'll run some more performance tests and post more numbers then. Note also that when we add serialization to the Attributes, e.g. for supporting storing serialized TokenStreams in the index, then the serialization should benefit even significantly more from the new API than cloning. This issue contains one backwards-compatibility break: TokenStreams/Filters/Tokenizers should normally be final (see LUCENE-1753 for the explaination). Some of these core classes are not final and so one could override the next() or next(Token) methods. In this case, the backwards-wrapper would automatically use incrementToken(), because it
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734063#action_12734063 ] Uwe Schindler commented on LUCENE-1448: --- This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the tee, the second one overwrites the first one). add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734063#action_12734063 ] Uwe Schindler edited comment on LUCENE-1448 at 7/22/09 3:25 AM: This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the sink, the second one overwrites the first one). was (Author: thetaphi): This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the tee, the second one overwrites the first one). add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734067#action_12734067 ] Uwe Schindler commented on LUCENE-1644: --- Sorry that I came back too late to this issue, I am in holidays at the moment. In my opinion, the Parameter instead of boolean is a good idea. The latest patch is also a good idea, I only hve some small problems with it: - Why did you make so many internal things public? The additional ctor to MultiTermQueryrapperFilter should be package-private or protected (the class is not abstract, but should be used like abstract, so it ,must have only protected ctors). Only the public instances TermRangeFilter should have public ctors. - getFilter()/getEnum should stay protected. - I do not like the wired caching of Terms. A more cleaner API would be a new class CachingFilteredTermEnum, that can turn on caching for e.g. the first 20 terms and then reset. In this case, the API would stay clear and the filter code does not need to be changed at all (it just harvests the TermEnum, if it is cached or not). I would propose something like: new CachingFilteredTermEnum(originalEnum), use it normally, then termEnum.reset() to consume again and termEnum.purgeCache() if caching no longer needed and to be switched off (after the first 25 terms or so). The problem with MultiTermQueryWrapper filter is, that the filter is normally stateless (no reader or termenum). So normally the method getDocIdSet() should get the termenum or wrapper in addition to the indexreader. This is not very good (it took me some time, to understand, what you are doing). Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734070#action_12734070 ] Uwe Schindler commented on LUCENE-1644: --- The biggest problem is, that this caching gets completely wired with multi-segment indexes: The rewriting is done on the top-level reader. In this case the boolean query would be built and the terms cached. If there are too many terms, it creates a filter instance with the cached terms. The rewritten query is then executed against all sub-readers using the cached terms and a fixed term enum. Normally this would create a docidset for the current index reader, the rewrite did it for the top-level index reader - the wron doc ids are returned and so on. So you cannot reuse the collected terms from the rewrite operation in the getDocIdSet calls. So please turn of this caching at all! As noted before, the important thing is, that the returned filter by rewrite is stateless and should not know anythis about index readers. The idex reader is passed in getDocIdSet any is different for non-optimized indexes. You have seen no tests fail, because all RangeQuery tests use optimized indexes. Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[ApacheCon US] Travel Assistance
The Travel Assistance Committee is taking in applications for those wanting to attend ApacheCon US 2009 (Oakland) which takes place between the 2nd and 6th November 2009. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon US 2009 who may need some financial support in order to get there. There are limited places available, and all applications will be scored on their individual merit. Applications are open to all open source developers who feel that their attendance would benefit themselves, their project(s), the ASF and open source in general. Financial assistance is available for flights, accommodation, subsistence and Conference fees either in full or in part, depending on circumstances. It is intended that all our ApacheCon events are covered, so it may be prudent for those in Europe and/or Asia to wait until an event closer to them comes up - you are all welcome to apply for ApacheCon US of course, but there should be compelling reasons for you to attend an event further away that your home location for your application to be considered above those closer to the event location. More information can be found on the main Apache website at http://www.apache.org/travel/index.html - where you will also find a link to the online application and details for submitting. Applications for applying for travel assistance will open on 27th July 2009 and close of the 17th August 2009. Good luck to all those that will apply. Regards, The Travel Assistance Committee
[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734096#action_12734096 ] Michael McCandless commented on LUCENE-1644: bq. The biggest problem is, that this caching gets completely wired with multi-segment indexes Right, I caught this as well (there is one test that fails when I forcefully swap in constant-boolean-query as the constant score method), and I'm now turning off the caching. I've fixed it locally -- will post a new rev soon. Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1460) Change all contrib TokenStreams/Filters to use the new TokenStream API
[ https://issues.apache.org/jira/browse/LUCENE-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734097#action_12734097 ] Robert Muir commented on LUCENE-1460: - Michael, after 1728 I can take another look at this. the reason is, that I added some tests to these analyzers and found a bug in the Thai offsets. When i submitted this, i only duplicated the existing behavior, but I don't want to reintroduce the bug into incrementToken() Change all contrib TokenStreams/Filters to use the new TokenStream API -- Key: LUCENE-1460 URL: https://issues.apache.org/jira/browse/LUCENE-1460 Project: Lucene - Java Issue Type: Task Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1460_contrib_partial.txt, LUCENE-1460_core.txt, LUCENE-1460_partial.txt Now that we have the new TokenStream API (LUCENE-1422) we should change all contrib modules to use it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Alves updated LUCENE-1486: --- Attachment: junit_complex_phrase_qp_07_22_2009.patch I added 2 testcases that return doc 3, but do not make much sense just to prove the point that we need more docs describing the use case for complex phrase qp, and define what is the subset of the supported syntax we want to support. checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734141#action_12734141 ] Luis Alves edited comment on LUCENE-1486 at 7/22/09 7:55 AM: - I added 2 testcases that return doc 3. These queries do not make much sense, I added it just to prove the point that we need more information describing the use case for complex phrase qp. We also should define a subset of the supported syntax we want to support inside phrases, with well defined behaviors. checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND was (Author: lafa): I added 2 testcases that return doc 3, but do not make much sense just to prove the point that we need more docs describing the use case for complex phrase qp, and define what is the subset of the supported syntax we want to support. checkMatches(\(goos~0.5 AND (mike OR smith) AND NOT ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND, NOT checkMatches(\(goos~0.5 AND (mike OR smith) AND ( percival AND john) ) vacation\~3,3); // proximity with fuzzy, OR, AND Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734148#action_12734148 ] Mark Harwood commented on LUCENE-1486: -- I'll try and catch up with some of the issues raised here: bq. What do you mean on the last check by phrase inside phrase, I don't see any phrase inside a phrase Correct, the inner phrase example was a term not a phrase. This is perhaps a better example: checkBadQuery(\jo* \percival smith\ \); //phrases inside phrases is bad bq. I'm trying now to figure out what is supported The Junit is currently the main form of documentation - unlike the XMLQueryParser (which has a DTD) there is no syntax to formally capture the logic. Here is a basic summary of the syntax supported and how it differs from normal non-phrase use of the same operators: * Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply single terms) * Brackets are used to group/define the acceptable variations for a given phrase element e.g. (john OR jonathon) smith * AND is irrelevant - there is effectively an implied AND_NEXT_TO binding all phrase elements To move this forward I would suggest we consider following one of these options: 1) Keep in core and improve error reporting and documentation 2) Move into contrib as experimental 3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example) 4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable within phrase operators e.g. *, ~, ( ) I think 1) is achievable if we carefully define where the existing parser breaks (e.g. ANDs and nested brackets) 2) is unnecessary if we can achieve 1). 3) would be a shame if we lost useful features for some very convoluted edge cases 4) is beyond my JavaCC skills. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2
[ https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734154#action_12734154 ] Tim Smith commented on LUCENE-1754: --- keeping null should be fine, as long as this is documented and all core query implementations handle this behavior, and all searcher code handles the null return properly at this point, NonMatchingScorer could be removed and null returned in its place (being package private, no one writing applications can make any assumptions on a NonMatchingScorer being returned) however, this should also be documented for the rewrite() method (currently this looks to always expect a non-null return value), also the searcher will throw null pointers if a null query is passed to it Get rid of NonMatchingScorer from BooleanScorer2 Key: LUCENE-1754 URL: https://issues.apache.org/jira/browse/LUCENE-1754 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1754.patch Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer from BS2, and return null in BooleanWeight.scorer(). I've checked and this can be easily done, so I'm going to post a patch shortly. For reference: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064. I've marked the issue as 2.9 just because it's small, and kind of related to all the search enhancements done for 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Java caching of low-level index data?
That's an interesting idea. I always wonder however how much exactly would we gain, vs. the effort spent to develop, debug and maintain it. Just some thoughts that we should consider regarding this: * For very large indices, where we think this will generally be good for, I believe it's reasonable to assume that the search index will sit on its own machine, or set of CPUs, RAM and HD. Therefore given that very few will run on the OS other than the search index, I assume the OS cache will be enough (if not better)? * In other cases, where the search app runs together w/ other apps, I'm not sure how much we'll gain. I can assume such apps will use a smaller index, or will not need to support high query load? If so, will they really care if we cache their data, vs. the OS? Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, just to think how much will it improve performance (vs. maybe even hurt it?). Often I find it that some optimizations that are done will benefit very large indices. But these usually get their decent share of resources, and the JVM itself is run w/ larger heap etc. So these optimizations turn out to not affect such indices much after all. And for smaller indices, performance is usually not a problem (well ... they might just fit entirely in RAM). Shai On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote: In discussions of Lucene search performance, the importance of OS caching of index data is frequently mentioned. The typical recommendation is to keep plenty of unallocated RAM available (e.g. don't gobble it all up with your JVM heap) and try to avoid large I/O operations that would purge the OS cache. I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would reverse the advice above and instead make your JVM heap as large as possible (or at least large enough to achieve a desired speed/space tradeoff). This approach seems like it would have some advantages: - Explicit control over how much you want cached (adjust your JVM heap and cache settings as desired) - Cached index data won't be purged by the OS doing other things - Index warming might be faster, or at least more predictable The obvious disadvantage for some situations is that more RAM would now be tied up by the JVM, rather than managed dynamically by the OS. Any thoughts? It seems like this would be pretty easy to implement (subclass FSDirectory, return subclass of FSIndexInput that checks the cache before reading, cache keyed on filename + position), but maybe I'm oversimplifying, and for that matter a similar implementation may already exist somewhere for all I know. Thanks, Chris
[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2
[ https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734164#action_12734164 ] Michael McCandless commented on LUCENE-1754: I think we continue to allow scorer() and getDocIdSet to return null to mean no matches, though they are not required too (ie, one cannot assume that a non-null return means there are matches). And we should make this clear in the javadocs. And remove NonMatchingScorer. Get rid of NonMatchingScorer from BooleanScorer2 Key: LUCENE-1754 URL: https://issues.apache.org/jira/browse/LUCENE-1754 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1754.patch Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer from BS2, and return null in BooleanWeight.scorer(). I've checked and this can be easily done, so I'm going to post a patch shortly. For reference: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064. I've marked the issue as 2.9 just because it's small, and kind of related to all the search enhancements done for 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2
[ https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734168#action_12734168 ] Shai Erera commented on LUCENE-1754: ok then I'll add a test case to the patch which uses QWF w/ a query that it's scorer returns null, and then fix IndexSearcher accordingly. And update the javadocs as needed. Get rid of NonMatchingScorer from BooleanScorer2 Key: LUCENE-1754 URL: https://issues.apache.org/jira/browse/LUCENE-1754 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1754.patch Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer from BS2, and return null in BooleanWeight.scorer(). I've checked and this can be easily done, so I'm going to post a patch shortly. For reference: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064. I've marked the issue as 2.9 just because it's small, and kind of related to all the search enhancements done for 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Java caching of low-level index data?
imo, it is too low level to do it better than OSs. I agree, cache unloading effect would be prevented with it, but I am not sure if it brings net-net benefit, you would get this problem fixed, but probably OS would kill you anyhow (you took valuable memory from OS) on queries that miss your internal cache... We could try to do better if we put more focus on higher levels and do the caching there... maybe even cache somhow some CPU work, e.g. keep dense Postings in faster, less compressed format, load TermDictionary into RAMDirectory and keep the rest on disk.. Ideas in that direction have better chance to bring us forward. Take for example FuzzyQuery, there you can do some LRU caching at Term level and and save huge amounts of IO and CPU... From: Shai Erera ser...@gmail.com To: java-dev@lucene.apache.org Sent: Wednesday, 22 July, 2009 17:32:34 Subject: Re: Java caching of low-level index data? That's an interesting idea. I always wonder however how much exactly would we gain, vs. the effort spent to develop, debug and maintain it. Just some thoughts that we should consider regarding this: * For very large indices, where we think this will generally be good for, I believe it's reasonable to assume that the search index will sit on its own machine, or set of CPUs, RAM and HD. Therefore given that very few will run on the OS other than the search index, I assume the OS cache will be enough (if not better)? * In other cases, where the search app runs together w/ other apps, I'm not sure how much we'll gain. I can assume such apps will use a smaller index, or will not need to support high query load? If so, will they really care if we cache their data, vs. the OS? Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, just to think how much will it improve performance (vs. maybe even hurt it?). Often I find it that some optimizations that are done will benefit very large indices. But these usually get their decent share of resources, and the JVM itself is run w/ larger heap etc. So these optimizations turn out to not affect such indices much after all. And for smaller indices, performance is usually not a problem (well ... they might just fit entirely in RAM). Shai On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote: In discussions of Lucene search performance, the importance of OS caching of index data is frequently mentioned. The typical recommendation is to keep plenty of unallocated RAM available (e.g. don't gobble it all up with your JVM heap) and try to avoid large I/O operations that would purge the OS cache. I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would reverse the advice above and instead make your JVM heap as large as possible (or at least large enough to achieve a desired speed/space tradeoff). This approach seems like it would have some advantages: - Explicit control over how much you want cached (adjust your JVM heap and cache settings as desired) - Cached index data won't be purged by the OS doing other things - Index warming might be faster, or at least more predictable The obvious disadvantage for some situations is that more RAM would now be tied up by the JVM, rather than managed dynamically by the OS. Any thoughts? It seems like this would be pretty easy to implement (subclass FSDirectory, return subclass of FSIndexInput that checks the cache before reading, cache keyed on filename + position), but maybe I'm oversimplifying, and for that matter a similar implementation may already exist somewhere for all I know. Thanks, Chris
[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges
[ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734169#action_12734169 ] Michael McCandless commented on LUCENE-1076: maxDoc() does reflect the number of docs in the index. It's simply the sum of docCount for all segments. Shuffling the order of the segments, or allowing non-contiguous segments to be merged, won't change how maxDoc() is computed. New docIDs are allocating by incrementing an integer (starting with 0) for the buffered docs. When a segment gets flushed, we reset that to 0. Ie, docIDs are stored within one segment; they have no context from prior segments. Allow MergePolicy to select non-contiguous merges - Key: LUCENE-1076 URL: https://issues.apache.org/jira/browse/LUCENE-1076 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1076.patch I started work on this but with LUCENE-1044 I won't make much progress on it for a while, so I want to checkpoint my current state/patch. For backwards compatibility we must leave the default MergePolicy as selecting contiguous merges. This is necessary because some applications rely on temporal monotonicity of doc IDs, which means even though merges can re-number documents, the renumbering will always reflect the order in which the documents were added to the index. Still, for those apps that do not rely on this, we should offer a MergePolicy that is free to select the best merges regardless of whether they are continuguous. This requires fixing IndexWriter to accept such a merge, and, fixing LogMergePolicy to optionally allow it the freedom to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges
[ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734174#action_12734174 ] Shai Erera commented on LUCENE-1076: Oh. Thanks for correcting me. In that case, I take what I said back. I think this together w/ LUCENE-1750 can really help speed up segment merges in certain scenarios. Will wait for you to come back to it :) Allow MergePolicy to select non-contiguous merges - Key: LUCENE-1076 URL: https://issues.apache.org/jira/browse/LUCENE-1076 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1076.patch I started work on this but with LUCENE-1044 I won't make much progress on it for a while, so I want to checkpoint my current state/patch. For backwards compatibility we must leave the default MergePolicy as selecting contiguous merges. This is necessary because some applications rely on temporal monotonicity of doc IDs, which means even though merges can re-number documents, the renumbering will always reflect the order in which the documents were added to the index. Still, for those apps that do not rely on this, we should offer a MergePolicy that is free to select the best merges regardless of whether they are continuguous. This requires fixing IndexWriter to accept such a merge, and, fixing LogMergePolicy to optionally allow it the freedom to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1720) TimeLimitedIndexReader and associated utility class
[ https://issues.apache.org/jira/browse/LUCENE-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734176#action_12734176 ] Mark Harwood commented on LUCENE-1720: -- bq. Hey Mark. Have you made any progress with that? Apologies, recently the lure of developing apps for the new iPhone has put paid to that :) I'm still happy that the pseudo-code we outlined in the last couple of comments is what is needed to finish this. bq.We can tag team if you want Yep, happy to do that. Let me know if you start work to avoid me duplicating effort and I'll do the same. Cheers Mark TimeLimitedIndexReader and associated utility class --- Key: LUCENE-1720 URL: https://issues.apache.org/jira/browse/LUCENE-1720 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Attachments: ActivityTimedOutException.java, ActivityTimeMonitor.java, TestTimeLimitedIndexReader.java, TimeLimitedIndexReader.java An alternative to TimeLimitedCollector that has the following advantages: 1) Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase. 2) Times out faster (i.e. runaway queries such as fuzzies detected quickly before last collect stage of query processing) Uses new utility timeout class that is independent of IndexReader. Initial contribution includes a performance test class but not had time as yet to work up a formal Junit test. TimeLimitedIndexReader is coded as JDK1.5 but can easily be undone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Java caching of low-level index data?
I think it's a neat idea! But you are in fact fighting the OS so I'm not sure how well this'll work in practice. EG the OS will happily swap out pages from your process if it thinks you're not using them, so it'd easily swap out your cache in favor of its own IO cache (this is the swappiness configuration on Linux), which would then kill performance (take a page hit when you finally did need to use your cache). In C (possibly requiring root) you could wire the pages, but we can't do that from javaland, so it's already not a fair fight. Mike On Wed, Jul 22, 2009 at 11:56 AM, eks deveks...@yahoo.co.uk wrote: imo, it is too low level to do it better than OSs. I agree, cache unloading effect would be prevented with it, but I am not sure if it brings net-net benefit, you would get this problem fixed, but probably OS would kill you anyhow (you took valuable memory from OS) on queries that miss your internal cache... We could try to do better if we put more focus on higher levels and do the caching there... maybe even cache somhow some CPU work, e.g. keep dense Postings in faster, less compressed format, load TermDictionary into RAMDirectory and keep the rest on disk.. Ideas in that direction have better chance to bring us forward. Take for example FuzzyQuery, there you can do some LRU caching at Term level and and save huge amounts of IO and CPU... From: Shai Erera ser...@gmail.com To: java-dev@lucene.apache.org Sent: Wednesday, 22 July, 2009 17:32:34 Subject: Re: Java caching of low-level index data? That's an interesting idea. I always wonder however how much exactly would we gain, vs. the effort spent to develop, debug and maintain it. Just some thoughts that we should consider regarding this: * For very large indices, where we think this will generally be good for, I believe it's reasonable to assume that the search index will sit on its own machine, or set of CPUs, RAM and HD. Therefore given that very few will run on the OS other than the search index, I assume the OS cache will be enough (if not better)? * In other cases, where the search app runs together w/ other apps, I'm not sure how much we'll gain. I can assume such apps will use a smaller index, or will not need to support high query load? If so, will they really care if we cache their data, vs. the OS? Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, just to think how much will it improve performance (vs. maybe even hurt it?). Often I find it that some optimizations that are done will benefit very large indices. But these usually get their decent share of resources, and the JVM itself is run w/ larger heap etc. So these optimizations turn out to not affect such indices much after all. And for smaller indices, performance is usually not a problem (well ... they might just fit entirely in RAM). Shai On Wed, Jul 22, 2009 at 6:21 PM, Nigel nigelspl...@gmail.com wrote: In discussions of Lucene search performance, the importance of OS caching of index data is frequently mentioned. The typical recommendation is to keep plenty of unallocated RAM available (e.g. don't gobble it all up with your JVM heap) and try to avoid large I/O operations that would purge the OS cache. I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would reverse the advice above and instead make your JVM heap as large as possible (or at least large enough to achieve a desired speed/space tradeoff). This approach seems like it would have some advantages: - Explicit control over how much you want cached (adjust your JVM heap and cache settings as desired) - Cached index data won't be purged by the OS doing other things - Index warming might be faster, or at least more predictable The obvious disadvantage for some situations is that more RAM would now be tied up by the JVM, rather than managed dynamically by the OS. Any thoughts? It seems like this would be pretty easy to implement (subclass FSDirectory, return subclass of FSIndexInput that checks the cache before reading, cache keyed on filename + position), but maybe I'm oversimplifying, and for that matter a similar implementation may already exist somewhere for all I know. Thanks, Chris - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1644: --- Attachment: LUCENE-1644.patch Attached patch: fixed some bugs in the last rev, updated test cases, javadocs, CHANGES. I also optimized MultiTermQueryWrapperFilter to use the bulk-read API from termDocs. I confirmed all tests pass if I temporarily switch CONSTANT_SCORE_FILTER_REWRITE to CONSTANT_SCORE_AUTO_REWRITE_DEFAULT. I changed QueryParser to use CONSTANT_SCORE_AUTO for rewrite (it was previously CONSTANT_FILTER). I still need to run some perf tests to get a rough sense of decent defaults for CONSTANT_SCORE_AUTO cutover thresholds. bq. getFilter()/getEnum should stay protected. OK I made getEnum protected again. I had tentatively made it public so that one could create their own [external] rewrite methods. But I think (if we leave it protected), one could still make an inner/nested class that can access getEnum(). Do we even need getFilter()? I removed it in the patch. Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges
[ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-1076: -- Assignee: (was: Michael McCandless) Unassigning myself. Allow MergePolicy to select non-contiguous merges - Key: LUCENE-1076 URL: https://issues.apache.org/jira/browse/LUCENE-1076 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Priority: Minor Attachments: LUCENE-1076.patch I started work on this but with LUCENE-1044 I won't make much progress on it for a while, so I want to checkpoint my current state/patch. For backwards compatibility we must leave the default MergePolicy as selecting contiguous merges. This is necessary because some applications rely on temporal monotonicity of doc IDs, which means even though merges can re-number documents, the renumbering will always reflect the order in which the documents were added to the index. Still, for those apps that do not rely on this, we should offer a MergePolicy that is free to select the best merges regardless of whether they are continuguous. This requires fixing IndexWriter to accept such a merge, and, fixing LogMergePolicy to optionally allow it the freedom to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1076) Allow MergePolicy to select non-contiguous merges
[ https://issues.apache.org/jira/browse/LUCENE-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734190#action_12734190 ] Michael McCandless commented on LUCENE-1076: bq. Will wait for you to come back to it Feel free to take it, too :) I think LUCENE-1737 is also very important for speeding up merging, especially because it's so unexpected that just by adding different fields to your docs, or the same fields in different orders, can so severely impact merge performance. Allow MergePolicy to select non-contiguous merges - Key: LUCENE-1076 URL: https://issues.apache.org/jira/browse/LUCENE-1076 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Attachments: LUCENE-1076.patch I started work on this but with LUCENE-1044 I won't make much progress on it for a while, so I want to checkpoint my current state/patch. For backwards compatibility we must leave the default MergePolicy as selecting contiguous merges. This is necessary because some applications rely on temporal monotonicity of doc IDs, which means even though merges can re-number documents, the renumbering will always reflect the order in which the documents were added to the index. Still, for those apps that do not rely on this, we should offer a MergePolicy that is free to select the best merges regardless of whether they are continuguous. This requires fixing IndexWriter to accept such a merge, and, fixing LogMergePolicy to optionally allow it the freedom to do so. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2
[ https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1754: --- Attachment: LUCENE-1754.patch * Added a test case to TestDocIdSet which returns a null DocIdSet and indeed IndexSearcher failed. * Fixed IndexSearcher and all other places in the code which called scorer() or getDocIdSet() and could potentially hit NPE. * Added EmptyDocIdSetIterator for use by classes (such as ChainFilter) that need a DISI, but got a null DocIdSet. * Updated CHANGES. All search tests pass. Get rid of NonMatchingScorer from BooleanScorer2 Key: LUCENE-1754 URL: https://issues.apache.org/jira/browse/LUCENE-1754 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1754.patch, LUCENE-1754.patch Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer from BS2, and return null in BooleanWeight.scorer(). I've checked and this can be easily done, so I'm going to post a patch shortly. For reference: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064. I've marked the issue as 2.9 just because it's small, and kind of related to all the search enhancements done for 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Java caching of low-level index data?
this should not be all that difficult to try. I accept it makes sense in some cases ... but which ones? Background: all my attempts to fight OS went bed :( Let us think again what does it mean what Mike gave as an example? You are explicitly deciding that Lucene should get bigger share of RAM. OS will unload these pages if OS needs Lucene RAM for something else and you are not using them. Right? If something else should get less resources, we are on target, but this is end result. For any shared setup where you have many things that run, this decision has its consequences, something else is going to be starved. The other case, where only lucene runs, well what is the difference if we evict unused pages or OS does it (better control is just what we get on benefit)? This is the case where you are anyhow in not really comfortable for real caching situation, otherwise even greedy OSs wouldn't swap (at least my experience with reasonably configured OSs)... after thinking about it again, I would say, yes, there are for sure some cases where it helps, but not many cases and even in these cases benefit will be small. I guess :) - Original Message From: Michael McCandless luc...@mikemccandless.com To: java-dev@lucene.apache.org Sent: Wednesday, 22 July, 2009 18:37:19 Subject: Re: Java caching of low-level index data? I think it's a neat idea! But you are in fact fighting the OS so I'm not sure how well this'll work in practice. EG the OS will happily swap out pages from your process if it thinks you're not using them, so it'd easily swap out your cache in favor of its own IO cache (this is the swappiness configuration on Linux), which would then kill performance (take a page hit when you finally did need to use your cache). In C (possibly requiring root) you could wire the pages, but we can't do that from javaland, so it's already not a fair fight. Mike On Wed, Jul 22, 2009 at 11:56 AM, eks devwrote: imo, it is too low level to do it better than OSs. I agree, cache unloading effect would be prevented with it, but I am not sure if it brings net-net benefit, you would get this problem fixed, but probably OS would kill you anyhow (you took valuable memory from OS) on queries that miss your internal cache... We could try to do better if we put more focus on higher levels and do the caching there... maybe even cache somhow some CPU work, e.g. keep dense Postings in faster, less compressed format, load TermDictionary into RAMDirectory and keep the rest on disk.. Ideas in that direction have better chance to bring us forward. Take for example FuzzyQuery, there you can do some LRU caching at Term level and and save huge amounts of IO and CPU... From: Shai Erera To: java-dev@lucene.apache.org Sent: Wednesday, 22 July, 2009 17:32:34 Subject: Re: Java caching of low-level index data? That's an interesting idea. I always wonder however how much exactly would we gain, vs. the effort spent to develop, debug and maintain it. Just some thoughts that we should consider regarding this: * For very large indices, where we think this will generally be good for, I believe it's reasonable to assume that the search index will sit on its own machine, or set of CPUs, RAM and HD. Therefore given that very few will run on the OS other than the search index, I assume the OS cache will be enough (if not better)? * In other cases, where the search app runs together w/ other apps, I'm not sure how much we'll gain. I can assume such apps will use a smaller index, or will not need to support high query load? If so, will they really care if we cache their data, vs. the OS? Like I said, these are just thoughts. I don't mean to cancel the idea w/ them, just to think how much will it improve performance (vs. maybe even hurt it?). Often I find it that some optimizations that are done will benefit very large indices. But these usually get their decent share of resources, and the JVM itself is run w/ larger heap etc. So these optimizations turn out to not affect such indices much after all. And for smaller indices, performance is usually not a problem (well ... they might just fit entirely in RAM). Shai On Wed, Jul 22, 2009 at 6:21 PM, Nigel wrote: In discussions of Lucene search performance, the importance of OS caching of index data is frequently mentioned. The typical recommendation is to keep plenty of unallocated RAM available (e.g. don't gobble it all up with your JVM heap) and try to avoid large I/O operations that would purge the OS cache. I'm curious if anyone has thought about (or even tried) caching the low-level index data in Java, rather than in the OS. For example, at the IndexInput level there could be an LRU cache of byte[] blocks, similar to how a RDBMS caches index pages. (Conveniently, BufferedIndexInput already reads in 1k chunks.) You would
[jira] Commented: (LUCENE-1754) Get rid of NonMatchingScorer from BooleanScorer2
[ https://issues.apache.org/jira/browse/LUCENE-1754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734202#action_12734202 ] Michael McCandless commented on LUCENE-1754: For some reason I can't apply the patch -- I get this: {code} $ patch -p0 /x/tmp/LUCENE-1754.patch.txt patching file CHANGES.txt patch: malformed patch at line 20: @@ -629,6 +638,11 @@ {code} Get rid of NonMatchingScorer from BooleanScorer2 Key: LUCENE-1754 URL: https://issues.apache.org/jira/browse/LUCENE-1754 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1754.patch, LUCENE-1754.patch Over in LUCENE-1614 Mike has made a comment about removing NonMatchinScorer from BS2, and return null in BooleanWeight.scorer(). I've checked and this can be easily done, so I'm going to post a patch shortly. For reference: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12715064page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12715064. I've marked the issue as 2.9 just because it's small, and kind of related to all the search enhancements done for 2.9. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734292#action_12734292 ] Michael Busch commented on LUCENE-1448: --- Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is committed. add getFinalOffset() to TokenStream --- Key: LUCENE-1448 URL: https://issues.apache.org/jira/browse/LUCENE-1448 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael Busch Priority: Minor Fix For: 2.9 Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong. This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field. But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token. To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means I don't know so you figure it out, meaning we fallback to the faulty logic we have today. This has come up several times on the user's list. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734296#action_12734296 ] Michael Busch commented on LUCENE-1486: --- I think the best thing to do here is do exactly define what syntax is supposed to be supported (which Mark H. did in his latest comment), and then implement the new syntax with the new queryparser. It will enforce correct syntax and give meaningful exceptions if a query was entered that is not supported. I think we can still reuse big portions of Mark's patch: we should be able to write a new QueryBuilder that produces the new ComplexPhraseQuery. Adriano/Luis: how long would it take to implement? Can we contain it for 2.9? This would mean that these new features would go into contrib in 2.9 as part of the new query parser framework, and then be moved to core in 3.0. Also from 3.0 these new features would then be part of Lucene's main query syntax. Would this makes sense? Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734301#action_12734301 ] Uwe Schindler edited comment on LUCENE-1644 at 7/22/09 1:50 PM: Hi Mike, patch looks good. I was a little bit confused about the high term number cut off, but it is using Math.max to limit it to the current BooleanQuery max clause count. Some small things: bq. OK I made getEnum protected again. ...but only in MultiTermQuery itsself. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected). I think the rewrite method is internal to MultiTermQuery and always implemented ina subclass of MTQ as inner class. Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the singleton instances :) - and will therefore fail to produce correct hashcode/equals tests. The problem behind: The singletons are serializable but do not return itsself in readResolve() (not implemented). All singletons that are serializable must implement readResolve and return the singleton instance (see Parameter base class or the parser singletons in FieldCache). The instance in the default Auto RewriteMethod is still modifiable. Is this correct? So one could modify the defaults by setting properties in this instance. Is this correct? was (Author: thetaphi): Hi Mike, patch looks good. I was a little bit confused about the high term number cut off, but it is using Math.max to limit it to the current BooleanQuery max clause count. Some small things: bq. OK I made getEnum protected again. ...but only in MultiTermQuery itsself. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the singleton instances :) - and will therefore fail to produce correct hashcode/equals tests. The problem behind: The singletons are serializable but do not return itsself in readResolve() (not implemented). All singletons that are serializable must implement readResolve and return the singleton instance (see Parameter base class or the parser singletons in FieldCache). The instance in the default Auto RewriteMethod is still modifiable. Is this correct? So one could modify the defaults by setting properties in this instance. Is this correct? Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323 ] Luis Alves commented on LUCENE-1486: Mark H - Question 1) I also have a question about position. I added a doc 5 and 6 DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be return by or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this does not seem to be working Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, can you describe what is the behavior here. Look like the and is convert into a OR, that the case. What is the behavior you want to implement. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323 ] Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:19 PM: - Mark H - Question 1) I also have a question about position. I added a doc 5 and 6 {monospaced} DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; {monospaced} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, Can you describe what is the behavior here. Looks like the and is converted into a OR. What is the behavior you want to implement? was (Author: lafa): Mark H - Question 1) I also have a question about position. I added a doc 5 and 6 {{monospaced}} DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; {{monospaced}} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, Can you describe what is the behavior here. Looks like the and is converted into a OR. What is the behavior you want to implement? Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323 ] Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:21 PM: - Mark H - Question 1) I also have a question about position. I added a doc 5 and 6 {code:title=TestComplexPhraseQuery.java|borderStyle=solid} ... DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; ... {code} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, Can you describe what is the behavior here. Looks like the and is converted into a OR. What is the behavior you want to implement? was (Author: lafa): Mark H - Question 1) I also have a question about position. I added a doc 5 and 6 {monospaced} DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; {monospaced} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, Can you describe what is the behavior here. Looks like the and is converted into a OR. What is the behavior you want to implement? Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands,
[jira] Issue Comment Edited: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734323#action_12734323 ] Luis Alves edited comment on LUCENE-1486 at 7/22/09 2:24 PM: - Mark H - Question 1) I added a doc 5 and 6 {code:title=TestComplexPhraseQuery.java|borderStyle=solid} ... DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; ... {code} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned. Is this the correct behavior? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) for query: checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, seems that like the AND is converted into a OR. What is the behavior you want to implement? was (Author: lafa): Mark H - Question 1) I added a doc 5 and 6 {code:title=TestComplexPhraseQuery.java|borderStyle=solid} ... DocData docsContent[] = { new DocData(john smith, 1), new DocData(johathon smith, 2), new DocData(john percival smith goes on a b c vacation, 3), new DocData(jackson waits tom, 4), new DocData(johathon smith john, 5), new DocData(johathon mary gomes smith, 6), }; ... {code} for test checkMatches(\(jo* -john) smyth\, 2); // boolean logic with would document 5 be returned or just doc 2 should be returned, I'm assuming position is always important and doc 5 is supposed to be returned, correct? Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with returns 1,2,5 and not 6, but I was only expecting 6 to be returned, Can you describe what is the behavior here. Looks like the and is converted into a OR. What is the behavior you want to implement? Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734333#action_12734333 ] Luis Alves commented on LUCENE-1486: Sorry for all the emails, I'm still new to JIRA and only now I realized that for every edit I do,a email is sent. But now that I found the preview button, it won't happen again. :) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734337#action_12734337 ] Mark Harwood commented on LUCENE-1486: -- bq. I think it's not a big deal, but I'm just trying to understand and raise a probable wrong test. Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported. bq. Is the operator between 'query' and 'parser' the implicit AND_NEXT_TO or the default boolean operator (usually OR)? In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this. Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default. bq. Mark H, can you please elaborate more on the these other operators + - ^ AND || NOT ! : [ ] { }. OK I'll try and deal with them one by one but these are not necessarily definitive answers or guarantees of correctly implemented support OR,||,+, AND, . ignored. The implicit operator is AND_NEXT_TO apart from in bracketed sections where all elements at this level are ORed ^ .boosts are carried through from TermQuerys to SpanTermQuerys NOT, ! Creates SpanNotQueries []{} range queries are supported as are wildcards *, fuzzies ~, ? bq. query: '(john OR jonathon) smith~0.3 order*' order:sell stock market I'll post the XML query syntax equivalent of what should be parsed here shortly (just seen your next comment come in) Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734349#action_12734349 ] Mark Harwood commented on LUCENE-1486: -- {quote}for test checkMatches(\(jo* -john) smyth\, 2); would document 5 be returned or just doc 2 should be returned, {quote} I presume you mean smith not smyth here otherwise nothing would match? If so, doc 5 should match and position is relevant (subject to slop factors). {quote} Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work {quote} I suppose there's an open question as to if the second example is legal (the brackets are unnecessary) {quote} Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. {quote} That looks like a bug related to slop factor? {quote} Question 4) The usage of AND and AND_NEXT_TO is confusing to me the query checkMatches(\(jo* AND mary) smith\, 1,2,5); // boolean logic with {quote} ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
getTermInfosIndexDivisor deprecated?
It's a get method but the UnsupportedOperationException says Please pass termInfosIndexDivisor up-front when opening IndexReader? I did pass it in. Writing a test case for Solr that checks it. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734355#action_12734355 ] Mark Harwood commented on LUCENE-1486: -- {quote} query: '(john OR jonathon) smith~0.3 order*' order:sell stock market {quote} Would be parsed as follows (shown as equivalent XMLQueryParser syntax) {code:xml} BooleanQuery Clause occurs=should SpanNear SpanOr SpanOrTermsjohn jonathon /SpanOrTerms /SpanOr SpanOr SpanOrTerms smith smyth/SpanOrTerms /SpanOr SpanOr SpanOrTerms order orders/SpanOrTerms /SpanOr /SpanNear /Clause Clause occurs=should TermQuery fieldName=order sell/TermQuery /Clause Clause occurs=should UserQuerystock market/UserQuery /Clause /BooleanQuery {code} Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: getTermInfosIndexDivisor deprecated?
Yeah this was deprecated in LUCENE-1609; I guess we could keep the getter alive? I'll reopen it. Mike On Wed, Jul 22, 2009 at 6:07 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: It's a get method but the UnsupportedOperationException says Please pass termInfosIndexDivisor up-front when opening IndexReader? I did pass it in. Writing a test case for Solr that checks it. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1609: Reopening to un-deprecate getTermInfosIndexDivisor. Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead --- Key: LUCENE-1609 URL: https://issues.apache.org/jira/browse/LUCENE-1609 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Environment: Solr Tomcat 5.5 Ubuntu 2.6.20-17-generic Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM Reporter: Dan Rosher Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch synchronized method ensureIndexIsRead in TermInfosReader causes contention under heavy load Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) and under a load/stress test application, and later, examining the Thread dump (kill -3) , many threads are blocked on 'waiting for monitor entry' to this method. Rather than using Double-Checked Locking which is known to have issues, this implementation uses a state pattern, where only one thread can move the object from IndexNotRead state to IndexRead, and in doing so alters the objects behavior, i.e. once the index is loaded, the index nolonger needs a synchronized method. In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [ApacheCon US] Travel Assistance
: Is the assistance restricted to people presenting and committers? nope... http://www.apache.org/travel/index.html -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: getTermInfosIndexDivisor deprecated?
OK done. Mike On Wed, Jul 22, 2009 at 7:37 PM, Michael McCandlessluc...@mikemccandless.com wrote: Yeah this was deprecated in LUCENE-1609; I guess we could keep the getter alive? I'll reopen it. Mike On Wed, Jul 22, 2009 at 6:07 PM, Jason Rutherglenjason.rutherg...@gmail.com wrote: It's a get method but the UnsupportedOperationException says Please pass termInfosIndexDivisor up-front when opening IndexReader? I did pass it in. Writing a test case for Solr that checks it. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1609) Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead
[ https://issues.apache.org/jira/browse/LUCENE-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1609. Resolution: Fixed Eliminate synchronization contention on initial index reading in TermInfosReader ensureIndexIsRead --- Key: LUCENE-1609 URL: https://issues.apache.org/jira/browse/LUCENE-1609 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.9 Environment: Solr Tomcat 5.5 Ubuntu 2.6.20-17-generic Intel(R) Pentium(R) 4 CPU 2.80GHz, 2Gb RAM Reporter: Dan Rosher Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch, LUCENE-1609.patch synchronized method ensureIndexIsRead in TermInfosReader causes contention under heavy load Simple to reproduce: e.g. Under Solr, with all caches turned off, do a simple range search e.g. id:[0 TO 99] on even a small index (in my case 28K docs) and under a load/stress test application, and later, examining the Thread dump (kill -3) , many threads are blocked on 'waiting for monitor entry' to this method. Rather than using Double-Checked Locking which is known to have issues, this implementation uses a state pattern, where only one thread can move the object from IndexNotRead state to IndexRead, and in doing so alters the objects behavior, i.e. once the index is loaded, the index nolonger needs a synchronized method. In my particular test, this uncreased throughput at least 30 times. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734398#action_12734398 ] Adriano Crestani commented on LUCENE-1486: -- {quote} I propose doing this using using the new QP implementation. (I can write the new javacc QP for this) (this implies that the code will be in contrib in 2.9 and be part of core on 3.0) {quote} That would be good! {quote} Granted, the test fails for a reason other than the one for which I wanted it to fail. We can probably strike the test and leave a note saying phrase-within-a-phrase just does not make sense and is not supported. {quote} Cool, I agree to remove it. But I still don't see how an user can type a phrase inside a phrase with the current syntax definition, can you give me an example? {quote} In brackets it's an OR - the brackets are used to suggest that the current phrase element at position X is composed of some choices that are evaluated as a subclause in the same way that in normal query logic sub-clauses are defined in brackets e.g. +a +(b OR c). There seems to be a reasonable logic to this. Ideally the ComplexPhraseQueryParser should explicitly turn this setting on while evaluating the bracketed innards of phrases just in case the base class has AND as the default. {quote} If we use the implemented java cc code Luis suggested, we would have already a query parser that throws ParseExceptions whenever the user types an AND inside a phrase. {quote} OR,||,+, AND, . ignored {quote} So we should throw an excpetion if any of these is found inside a phrase. It could confuse the user if we just ignore it. {quote} Question 2) Should these 2 queries behave the same when we fix the problem // checkMatches(\john -percival\, 1); // not logic doesn't work // checkMatches(\john (-percival)\, 1); // not logic doesn't work I suppose there's an open question as to if the second example is legal (the brackets are unnecessary) {quote} Yes, the second is unnecessary, but I don't think it's illegal. The user could type (smith) outside the phrase, it makes sense to support it inside also. {quote} Question 3) checkMatches(\jo* smith\~2, 1,2,3,5); // position logic works. doc 6 is also returned, so this feature does not seem to be working. That looks like a bug related to slop factor? {quote} I have not checked yet, but I think it's working fine. The slop means how many switches between the terms inside the phrase is allowed to match the query. It matches doc 6, because the term smith switches twice to the right and matched johathon mary gomes smith. Twice = slop 2 :) {quote} ANDs are ignored and turned into ORs (see earlier comments) but maybe a query parse error should be thrown to emphasise this. {quote} I think we could support AND also. I agree there are few cases where the user would use that. It would work as I explained before: {quote} What happens if I type (query AND parser) lucene. In my point of view it is: (query AND parser) AND_NEXT_TO lucene. Which means for me: find any document that contains the term 'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1. Is this the expected behaviour? {quote} Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Assignee: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch, junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically
Re: Lucene 2.9 Again
: LUCENE-1749 FieldCache introspection API Unassigned 16/Jul/09 : : You have time to work on this Hoss? i'd have more time if there weren't so many darn solr-user questions that no one else answers. The meat of the patch (adding an API to inspect the cache) could be commited as is today -- i just don't know if the API makes sense (needs more eyeballs), and the real value add will be getting the sanity testing utilities in place ... those are only about half done. i'll try to work on it more this week(end) but if there isn't any progress from me, someone else (ahem: Miller?) should probably prune it down to the core function, add whatever javadocs are missing, and commit. (better to have release with a simple inspection API then to delay releasing while a fancy inspection methods gets hashed out) -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1644) Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood
[ https://issues.apache.org/jira/browse/LUCENE-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734411#action_12734411 ] Michael McCandless commented on LUCENE-1644: bq. I was a little bit confused about the high term number cut off, Sorry I still need to do some perf testing to pick an appropriate default here. bq. Everywhere else (even in the backwards compatibility override test [JustCompile] it is public). And the same should be for the incNumberOfTerms (also protected). Woops -- I'll fix. Thanks for catching even though you're on vacation ;) bq. Also the current singletons are not really singletons, because queries that are unserialized will contain instances that are not the singleton instances Sigh. I'll do what FieldCache's parser singletons do. bq. The instance in the default Auto RewriteMethod is still modifiable. Is this correct? I was thinking this was OK, ie, you could set the default cutoffs for anything that used the AUTO_DEFAULT. But it is static (global), so that's not great. I guess I'll make it an anonymous subclass of ConstantScoreAutoRewrite that disallows changes. Enable MultiTermQuery's constant score mode to also use BooleanQuery under the hood --- Key: LUCENE-1644 URL: https://issues.apache.org/jira/browse/LUCENE-1644 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1644.patch, LUCENE-1644.patch, LUCENE-1644.patch When MultiTermQuery is used (via one of its subclasses, eg WildcardQuery, PrefixQuery, FuzzyQuery, etc.), you can ask it to use constant score mode, which pre-builds a filter and then wraps that filter as a ConstantScoreQuery. If you don't set that, it instead builds a [potentially massive] BooleanQuery with one SHOULD clause per term. There are some limitations of this approach: * The scores returned by the BooleanQuery are often quite meaningless to the app, so, one should be able to use a BooleanQuery yet get constant scores back. (Though I vaguely remember at least one example someone raised where the scores were useful...). * The resulting BooleanQuery can easily have too many clauses, throwing an extremely confusing exception to newish users. * It'd be better to have the freedom to pick build filter up front vs build massive BooleanQuery, when constant scoring is enabled, because they have different performance tradeoffs. * In constant score mode, an OpenBitSet is always used, yet for sparse bit sets this does not give good performance. I think we could address these issues by giving BooleanQuery a constant score mode, then empower MultiTermQuery (when in constant score mode) to pick choose whether to use BooleanQuery vs up-front filter, and finally empower MultiTermQuery to pick the best (sparse vs dense) bit set impl. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1756) contrib/memory: PatternAnalyzerTest is a very, very, VERY, bad unit test
contrib/memory: PatternAnalyzerTest is a very, very, VERY, bad unit test Key: LUCENE-1756 URL: https://issues.apache.org/jira/browse/LUCENE-1756 Project: Lucene - Java Issue Type: Bug Components: contrib/* Reporter: Hoss Man Priority: Minor while working on something else i was started getting consistent IllegalStateExceptions from PatternAnalyzerTest -- but only when running the test from the top level. Digging into the test, i've found numerous things that are very scary... * instead of using assertions to test that tokens streams match, it throws an IllegalStateExceptions when they don't, and then logs a bunch of info about the token streams to System.out -- having assertion messages that tell you *exactly* what doens't match would make a lot more sense. * it builds up a list of files to analyze using patsh thta it evaluates relative to the current working directory -- which means you get different files depending on wether you run the tests fro mthe contrib level, or from the top level build file * the list of files it looks for include: ../../*.txt, ../../*.html, ../../*.xml ... so not only do you get different results when you run the tests in the contrib vs at the top level, but different people runing the tests via the top level build file will get different results depending on what types of text, html, and xml files they happen to have two directories above where they checked out lucene. * the test comments indicates that it's purpose is to show that PatternAnalyzer produces the same tokens as other analyzers - but points out this will fail for WhitespaceAnalyzer because of the 255 character token limit WhitespaceTokenizer imposes -- the test then proceeds to compare PaternAnalyzer to WhitespaceTokenizer, garunteeing a test failure for anyone who happens to have a text file containing more then 255 characters of non-whitespace in a row somewhere in ../../ (in my case: my bookmarks.html file, and the hex encoded favicon.gif images) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org