[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835961#action_12835961 ] Mark Harwood commented on LUCENE-1486: -- Double Ugh. Applying the patch for the "non-default field" bug doesn't work any more because the latest ComplexPhraseQueryParser source sitting in contrib now has a different package to the QueryParser base class . This means that this subclass doesn't have the required write access to the package-protected "field" variable. This is needed to temporarily set the context of the parser when processing the inner contents of the phrase. Fixing this would require changing the package name of ComplexPhraseQueryParser or changing the visibility of "field" in the QueryParser base class to "protected". Anyone have any strong feelings about which of these is the most acceptable? > Wildcards, ORs etc inside Phrase queries > > > Key: LUCENE-1486 > URL: https://issues.apache.org/jira/browse/LUCENE-1486 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.4 >Reporter: Mark Harwood >Priority: Minor > Fix For: 3.1 > > Attachments: ComplexPhraseQueryParser.java, > junit_complex_phrase_qp_07_21_2009.patch, > junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default > field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, > LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java > > > An extension to the default QueryParser that overrides the parsing of > PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. > The implementation feels a little hacky - this is arguably better handled in > QueryParser itself. This works as a proof of concept for much of the query > parser syntax. Examples from the Junit test include: > checkMatches("\"j* smyth~\"", "1,2"); //wildcards and fuzzies > are OK in phrases > checkMatches("\"(jo* -john) smith\"", "2"); // boolean logic > works > checkMatches("\"jo* smith\"~2", "1,2,3"); // position logic > works. > > checkBadQuery("\"jo* id:1 smith\""); //mixing fields in a > phrase is bad > checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases > is bad > checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries > inside phrases not supported > Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1513) fastss fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir closed LUCENE-1513. --- Resolution: Not A Problem For Lucene, LUCENE-2089 will always be faster than even FastSS, as our FuzzyQuery is really a top-N query, and we can exploit properties of the priority queue to make it even faster. LUCENE-2089 also works without any auxiliary index or data structures, just solely on lucene's terms dict, so it works great for updates/NRT/whatever, no back compat problems. I'm cancelling this issue as the alternative is superior in every aspect. > fastss fuzzyquery > - > > Key: LUCENE-1513 > URL: https://issues.apache.org/jira/browse/LUCENE-1513 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Robert Muir >Priority: Minor > Attachments: fastSSfuzzy.zip > > > code for doing fuzzyqueries with fastssWC algorithm. > FuzzyIndexer: given a lucene field, it enumerates all terms and creates an > auxiliary offline index for fuzzy queries. > FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index > to retrieve a candidate list. this list is then verified with levenstein > algorithm. > sorry but the code is a bit messy... what I'm actually using is very > different from this so its pretty much untested. but at least you can see > whats going on or fix it up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ComplexPhraseQuery problems with simple phrases
This is because phrases are expected to contain >1 clause and the ComplexPhraseQueryParser was expecting a BooleanQuery from the base class which is used to hold the elements in the phrase. In this single-clause scenario I guess we could silently hide the error and return whatever single query clause was inappropriately found between the quotes. On 19 Feb 2010, at 19:53, David Kaelbling wrote: > Hi, > > ComplexPhraseQueryParser doesn't appear to handle some simple wildcard > phrases correctly. In TestComplexPhraseQuery.testComplexPhrases() on > trunk I tried these two tests: > > checkMatches("\"j*n sm*h\"", "1,2"); > checkMatches("\"j*n\"", "1,2,3,4"); > > The first check succeeds. The second throws an IllegalArgumentException > trying to rewrite the query, complaining that WildcardQuery is an > unknown query type. If this is bad syntax I would have expected the > first query to have failed too. > > Does anyone have a fix? > >Thanks, >David > > -- > David Kaelbling > Senior Software Engineer > Black Duck Software, Inc. > > dkaelbl...@blackducksoftware.com > T +1.781.810.2041 > F +1.781.891.5145 > > http://www.blackducksoftware.com > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Description: we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed up the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, maintaining all backwards compatibility. The advantages are: * we can seek to terms that are useful, instead of brute-forcing the entire terms dict * we can determine matches faster, as true/false from a DFA is array lookup, don't even need to run levenshtein. We build Levenshtein DFAs in linear time with respect to the length of the word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 To implement support for 'prefix' length, we simply concatenate two DFAs, which doesn't require us to do NFA->DFA conversion, as the prefix portion is a singleton. the concatenation is also constant time with respect to the size of the fuzzy DFA, it only need examine its start state. with this algorithm, parametric tables are precomputed so that DFAs can be constructed very quickly. if the required number of edits is too large (we don't have a table for it), we use "dumb mode" at first (no seeking, no DFA, just brute force like now). As the priority queue fills up during enumeration, the similarity score required to be a competitive term increases, so, the enum gets faster and faster as this happens. This is because terms in core FuzzyQuery are sorted by boost value, then by term (in lexicographic order). For a large term dictionary with a low minimal similarity, you will fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs (edit distance of 2 -> edit distance of 1 -> edit distance of 0) during enumeration, but also to switch from "dumb mode" to "smart mode". With this design, we can add more DFAs at any time by adding additional tables. The tradeoff is the tables get rather large, so for very high K, we would start to increase the size of Lucene's jar file. The idea is we don't have include large tables for very high K, by using the 'competitive boost' attribute of the priority queue. For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton was: Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is itching to write that nasty algorithm) we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea * up front, calculate the maximum required K edits needed to match the users supplied float threshold. * for at least small common E up to some max K (1,2,3, etc) we should create a DFA for each E. if the required E is above our supported max, we use "dumb mode" at first (no seeking, no DFA, just brute force like now). As the pq fills, we swap progressively lower DFAs into the enum, based upon the lowest score in the pq. This should work well on avg, at high E, you will typically fill the pq very quickly since you will match many terms. This not only provides a mechanism to switch to more efficient DFAs during enumeration, but also to switch from "dumb mode" to "smart mode". i modified my wildcard benchmark to generate random fuzzy queries. * Pattern: 7N stands for NNN, etc. * AvgMS_DFA: this is the time spent creating the automaton (constructor) ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| |7N|10|64.0|4155.9|38.6|20.3| |14N|10|0.0|2511.6|46.0|37.9| |28N|10|0.0|2506.3|93.0|86.6| |56N|10|0.0|2524.5|304.4|298.5| as you can see, this prototype is no good yet, because it creates the DFA in a slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA conversion. So, for a very long string, it just gets worse and worse. This has nothing to do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), there is no problem there. instead we should just build a DFA to begin with, maybe with this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 we can precompute the tables with that algorithm up to some reasonable K, and then I think we are ok. the paper references using http://portal.acm.org/citation.cfm?id=135907 for linear minimization, if someone wants to implement this they should not worry about minimization. in fact, we need to at some point determine if AutomatonQuery should even minimize FSM's at all, or if it is simply enough for them to be deterministic with no transitions to dead states. (The only code that actually assumes minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a summation easily). we need to benchmark really complex DFAs (i.e. write a regex benchmark) to figure out if minimization is even helping right now. edit the description, to hopefully be simpler. > explore using automaton for fuzzyquery > -- > >
ComplexPhraseQuery problems with simple phrases
Hi, ComplexPhraseQueryParser doesn't appear to handle some simple wildcard phrases correctly. In TestComplexPhraseQuery.testComplexPhrases() on trunk I tried these two tests: checkMatches("\"j*n sm*h\"", "1,2"); checkMatches("\"j*n\"", "1,2,3,4"); The first check succeeds. The second throws an IllegalArgumentException trying to rewrite the query, complaining that WildcardQuery is an unknown query type. If this is bad syntax I would have expected the first query to have failed too. Does anyone have a fix? Thanks, David -- David Kaelbling Senior Software Engineer Black Duck Software, Inc. dkaelbl...@blackducksoftware.com T +1.781.810.2041 F +1.781.891.5145 http://www.blackducksoftware.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Question on highlighting of nested SpanQuery instances
Hello, I initially posted a version of this question to java-user, but think it's more of a java-dev question. I haven't yet been able to resolve why I'm seeing spurious highlighting in nested SpanQuery instances. To illustrate this, I added the code below to the HighlighterTest class in lucene_2_9_1: /* * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/ */ public void testHighlightingNestedSpans2() throws Exception { String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; // Problem //String theText = "The Lucene was made by Doug Cutting and the great Hadoop was"; // Works okay String fieldName = "SOME_FIELD_NAME"; SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] { new SpanTermQuery(new Term(fieldName, "lucene")), new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true); Query query = new SpanNearQuery(new SpanQuery[] { spanNear, new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true); String expected = "The Lucene was made by Doug Cutting and Lucene great Hadoop was"; //String expected = "The Lucene was made by Doug Cutting and the great Hadoop was"; String observed = highlightField(query, fieldName, theText); System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + observed); assertEquals("Why is that second instance of the term \"Lucene\" highlighted?", expected, observed); } Is this an issue that's arisen before? I've been reading through the source to QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and NearSpansOrdered, but haven't found the solution yet. Initially, I thought that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me too far. Any suggestions are welcome. Thanks. Mike
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Attachment: ContrivedFuzzyBenchmark.java attached is a 'contrived fuzzy benchmark' derived from my wildcard benchmark (randomly generated 7-digit terms) for the benchmark, i ran results for various combinations of minimum similarity, prefix length, and pq size for the test index of 10million terms. Avg MS old is the current flex branch. Avg MS new is with the patch. Notes: * only the table for distance n=1 is implemented yet! * n=1 is fast. * Use of the PQ boost attribute speeds up fuzzy queries for higher n slightly, too. * adding a table for n=2 should be extremely helpful, and maybe even enough for the default PQ size of 1024 (BQ.maxClauseCount), to make all fuzzy queries reasonable. {{Minimum Sim = 0.73f (edit distance of 1)}} ||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)|| |0|1024|3286.0|10.6| |0|64|3320.4|7.2| |1|1024|316.8|5.3| |1|64|314.3|5.3| |2|1024|31.8|4.0| |2|64|31.9|4.2| {{Minimum Sim = 0.58f (edit distance of 2)}} ||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)|| |0|1024|4223.3|1341.6| |0|64|4199.7|501.9| |1|1024|430.1|304.1| |1|64|392.8|44.7| |2|1024|82.5|70.0| |2|64|38.4|7.7| {{Minimum Sim = 0.43f (edit distance of 3)}} ||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)|| |0|1024|5299.9|2617.0| |0|64|5231.8|476.4| |1|1024|522.9|318.9| |1|64|480.9|73.9| |2|1024|89.0|83.9| |2|64|46.3|8.6| {{Minimum Sim = 0.29f (edit distance of 4)}} ||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)|| |0|1024|6258.1|3114.0| |0|64|6247.6|684.6| |1|1024|609.9|380.0| |1|64|567.1|69.3| |2|1024|98.6|93.8| |2|64|55.6|11.4| > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: Flex Branch > > Attachments: ContrivedFuzzyBenchmark.java, LUCENE-2089.patch, > LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, > LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization
[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835903#action_12835903 ] Uwe Schindler commented on LUCENE-2190: --- During refactoring I found out: CustomScoreQuery is more broken: the rewrite() method is wrong, for now its not really a problem but when we change to per-segment rewrite (as Mike plans) its broken. Its even broken if you rewrite against one IndexReader and want to reuse the query later on another IndexReader. It rewrites all its subqueries and returns itsself, which is wrong: if one of the subqueries was rewritten it must return a new clone instance (like BooleanQuery). Also hashCode and equals ignore strict. Will provide patch later. Now everything seems to work correct. > CustomScoreQuery (function query) is broken (due to per-segment searching) > -- > > Key: LUCENE-2190 > URL: https://issues.apache.org/jira/browse/LUCENE-2190 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2190.patch > > > Spinoff from here: > http://lucene.markmail.org/message/psw2m3adzibaixbq > With the cutover to per-segment searching, CustomScoreQuery is not really > usable anymore, because the per-doc custom scoring method (customScore) > receives a per-segment docID, yet there is no way to figure out which segment > you are currently searching. > I think to fix this we must also notify the subclass whenever a new segment > is switched to. I think if we copy Collector.setNextReader, that would be > sufficient. It would by default do nothing in CustomScoreQuery, but a > subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835837#action_12835837 ] Uwe Schindler commented on LUCENE-2190: --- We can preserve backwards compatibility is the default impl with the new reader only passes to the deprecated old customScore function. I will provide a patch tomorrow. > CustomScoreQuery (function query) is broken (due to per-segment searching) > -- > > Key: LUCENE-2190 > URL: https://issues.apache.org/jira/browse/LUCENE-2190 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2190.patch > > > Spinoff from here: > http://lucene.markmail.org/message/psw2m3adzibaixbq > With the cutover to per-segment searching, CustomScoreQuery is not really > usable anymore, because the per-doc custom scoring method (customScore) > receives a per-segment docID, yet there is no way to figure out which segment > you are currently searching. > I think to fix this we must also notify the subclass whenever a new segment > is switched to. I think if we copy Collector.setNextReader, that would be > sufficient. It would by default do nothing in CustomScoreQuery, but a > subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-2190: --- The fix is invalid: Adding setNextReader to CustomScoreQuery makes the Query itsself stateful. This breaks when using together with e.g. ParallelMultiSearcher. As the package is experimental, I see no problem in changing the method signature of customScore to pass in the affected IndexReader (CustomScorer can do this) > CustomScoreQuery (function query) is broken (due to per-segment searching) > -- > > Key: LUCENE-2190 > URL: https://issues.apache.org/jira/browse/LUCENE-2190 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2190.patch > > > Spinoff from here: > http://lucene.markmail.org/message/psw2m3adzibaixbq > With the cutover to per-segment searching, CustomScoreQuery is not really > usable anymore, because the per-doc custom scoring method (customScore) > receives a per-segment docID, yet there is no way to figure out which segment > you are currently searching. > I think to fix this we must also notify the subclass whenever a new segment > is switched to. I think if we copy Collector.setNextReader, that would be > sufficient. It would by default do nothing in CustomScoreQuery, but a > subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery
[ https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2089: Attachment: LUCENE-2089.patch * implement the pq algorithm, when the value at the bottom of the pq changes (BoostAttribute maxCompetitiveBoost), the enum adjusts itself by decreasing edit distance, and swapping in more efficient code. * remove the wasted prefix checks in automatonfuzzytermsenum, as Uwe noticed, because its not necessary and handled as part of the DFA itself (it will never seek to such terms). here is a patch, which is complete... needs code beautification/tests/docs but it has all functionality. we should also add a table for n=2, maybe n=3 also, but these can be separate issues. > explore using automaton for fuzzyquery > -- > > Key: LUCENE-2089 > URL: https://issues.apache.org/jira/browse/LUCENE-2089 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: Flex Branch >Reporter: Robert Muir >Assignee: Mark Miller >Priority: Minor > Fix For: Flex Branch > > Attachments: LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, > LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, > Moman-0.2.1.tar.gz, TestFuzzy.java > > > Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is > itching to write that nasty algorithm) > we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea > * up front, calculate the maximum required K edits needed to match the users > supplied float threshold. > * for at least small common E up to some max K (1,2,3, etc) we should create > a DFA for each E. > if the required E is above our supported max, we use "dumb mode" at first (no > seeking, no DFA, just brute force like now). > As the pq fills, we swap progressively lower DFAs into the enum, based upon > the lowest score in the pq. > This should work well on avg, at high E, you will typically fill the pq very > quickly since you will match many terms. > This not only provides a mechanism to switch to more efficient DFAs during > enumeration, but also to switch from "dumb mode" to "smart mode". > i modified my wildcard benchmark to generate random fuzzy queries. > * Pattern: 7N stands for NNN, etc. > * AvgMS_DFA: this is the time spent creating the automaton (constructor) > ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA|| > |7N|10|64.0|4155.9|38.6|20.3| > |14N|10|0.0|2511.6|46.0|37.9| > |28N|10|0.0|2506.3|93.0|86.6| > |56N|10|0.0|2524.5|304.4|298.5| > as you can see, this prototype is no good yet, because it creates the DFA in > a slow way. right now it creates an NFA, and all this wasted time is in > NFA->DFA conversion. > So, for a very long string, it just gets worse and worse. This has nothing to > do with lucene, and here you can see, the TermEnum is fast (AvgMS - > AvgMS_DFA), there is no problem there. > instead we should just build a DFA to begin with, maybe with this paper: > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 > we can precompute the tables with that algorithm up to some reasonable K, and > then I think we are ok. > the paper references using http://portal.acm.org/citation.cfm?id=135907 for > linear minimization, if someone wants to implement this they should not worry > about minimization. > in fact, we need to at some point determine if AutomatonQuery should even > minimize FSM's at all, or if it is simply enough for them to be deterministic > with no transitions to dead states. (The only code that actually assumes > minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a > summation easily). we need to benchmark really complex DFAs (i.e. write a > regex benchmark) to figure out if minimization is even helping right now. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Improved test, that also checks for increasing doc ids when score identical > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, > TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
[ https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Keegan updated LUCENE-2272: - Attachment: payloadfunctin-patch.txt This patch adds the 'explain' method to the 'PayloadFunction' interface, where the Scorer can call it. Added unit tests for 'explain' and for {Min,Max}PayloadFunction. > PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction' > --- > > Key: LUCENE-2272 > URL: https://issues.apache.org/jira/browse/LUCENE-2272 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Reporter: Peter Keegan > Attachments: payloadfunctin-patch.txt > > > The 'explain' method in PayloadNearSpanScorer assumes the > AveragePayloadFunction was used. This patch adds the 'explain' method to the > 'PayloadFunction' interface, where the Scorer can call it. Added unit tests > for 'explain' and for {Min,Max}PayloadFunction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction' --- Key: LUCENE-2272 URL: https://issues.apache.org/jira/browse/LUCENE-2272 Project: Lucene - Java Issue Type: Bug Components: Search Reporter: Peter Keegan Attachments: payloadfunctin-patch.txt The 'explain' method in PayloadNearSpanScorer assumes the AveragePayloadFunction was used. This patch adds the 'explain' method to the 'PayloadFunction' interface, where the Scorer can call it. Added unit tests for 'explain' and for {Min,Max}PayloadFunction. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835763#action_12835763 ] Uwe Schindler commented on LUCENE-2271: --- The cost to handle NaN is the modified lessThan() in HitQueue. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835762#action_12835762 ] Yonik Seeley commented on LUCENE-2271: -- bq. A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception? No, a design bug that -Inf scores were disallowed, esp since they were handled just fine in the past. NaN is different - it's more of a judgement call depending on the cost to handle it. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835758#action_12835758 ] Robert Muir commented on LUCENE-2271: - bq. OK, so it was a design bug too. A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception? > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835750#action_12835750 ] Yonik Seeley commented on LUCENE-2271: -- bq. its not a bug, as its doc'ed to work this way. OK, so it was a design bug too. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Patch with testcases for trunk, but should work on branches, too (after removing @Override). Without the fixes in HitQueue or TSDC the tests fail. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835717#action_12835717 ] Robert Muir commented on LUCENE-2271: - bq.The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. This should be benchmarked for MultiSearcher and ParallelMultiSearcher, too, as they also use HitQueue. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: (was: LUCENE-2271.patch) > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2271: Attachment: TSDC.patch attached is a patch, written by Uwe. as far as a "bugfix" i prefer this patch, as the more complicated, performance-intrusive NaN fixes I think should be something we do in 3.1 e.g., "fixing" NaN to work will likely slow down people getting large numbers of results, and i don't think we should do that in bugfix releases. but in 3.1, we could change it, include some large results-oriented collectors for these people, and the whole thing would make sense. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch Sorry reverted a comment remove. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch, TSDC.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2271: -- Attachment: LUCENE-2271.patch This is patch that supports NaN and -inf. The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. The check enforces all sentinels to the top of the queue, regardless what their score is (because always NaN != NaN). > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > Attachments: LUCENE-2271.patch > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835683#action_12835683 ] Robert Muir commented on LUCENE-2271: - I don't think we should do anything to fix NaN, such as using more expensive comparisons (Float.compareTo) and stuff. as it is not a number, its an invalid score. i think function queries shoudl throw and exception instead of producing NaN, this problem is only limited to function queries. I think fixing scores of negative infinity make more sense, as these are unpreventable (again only a problem with function queries!) and at least negative infinity is actually a number. i think "fixing" a top-N collector, or "fixing" anything that sorts NaN is wrong. NaN doesnt have a properly defined sort order. NaN has an order hacked into Float.compareTo, but this is different. sorting the primitive type makes no sense, and the documentation should stay that it doesnt work with TSDC. > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
[ https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2271: Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) its not a bug, as its doc'ed to work this way. {code} * NOTE: The values Float.Nan, * Float.NEGATIVE_INFINITY and Float.POSITIVE_INFINITY are * not valid scores. This collector will not properly * collect hits with such scores. {code} > Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect > results with TopScoreDocCollector > -- > > Key: LUCENE-2271 > URL: https://issues.apache.org/jira/browse/LUCENE-2271 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.9 >Reporter: Uwe Schindler >Priority: Minor > Fix For: 2.9.2, 3.0.1, 3.1 > > > This is a foolowup to LUCENE-2270, where a part of this problem was fixed > (boost = 0 leading to NaN scores, which is also un-intuitive), but in > general, function queries in Solr can create these invalid scores easily. In > previous version of Lucene these scores ordered correct (except NaN, which > mixes up results), but never invalid document ids are returned (like > Integer.MAX_VALUE). > The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel > ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ > to work, this sentinel must be smaller than all posible values, which is not > the case: > - -inf is equal and the document is not inserted into the HQ, as not > competitive, but the HQ is not yet full, so the sentinel values keep in the > HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and > only affects the Ordered collector) by chaning the exit condition to: > {code} > if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { > // Since docs are returned in-order (i.e., increasing doc Id), a document > // with equal score to pqTop.score cannot compete since HitQueue favors > // documents with lower doc Ids. Therefore reject those docs too. > return; > } > {code} > - The NaN case can be fixed in the same way, but then has another problem: > all comparisons with NaN result in false (none of these is true): x < NaN, x > > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns > false, leading to unexspected ordering in the PQ and sometimes the sentinel > values do not stay at the top of the queue. A later hit then overrides the > top of the queue but leaves the incorrect sentinels unchanged -> invalid > results. This can be fixed in two ways in HQ: > Force all sentinels to the top: > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.doc == Integer.MAX_VALUE) > return true; > if (hitB.doc == Integer.MAX_VALUE) > return false; > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return hitA.score < hitB.score; > } > {code} > or alternatively have a defined order for NaN (Float.compare sorts them after > +inf): > {code} > protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { > if (hitA.score == hitB.score) > return hitA.doc > hitB.doc; > else > return Float.compare(hitA.score, hitB.score) < 0; > } > {code} > The problem with both solutions is, that we have now more comparisons per hit > and the use of sentinels is questionable. I would like to remove the > sentinels and use the old pre 2.9 code for comparing and using PQ.add() when > a competitive hit arrives. The order of NaN would be unspecified. > To fix the order of NaN, it would be better to replace all score comparisons > by Float.compare() [also in FieldComparator]. > I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and > solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector
Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector -- Key: LUCENE-2271 URL: https://issues.apache.org/jira/browse/LUCENE-2271 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Fix For: 2.9.2, 3.0.1, 3.1 This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE). The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case: - -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to: {code} if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) { // Since docs are returned in-order (i.e., increasing doc Id), a document // with equal score to pqTop.score cannot compete since HitQueue favors // documents with lower doc Ids. Therefore reject those docs too. return; } {code} - The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x < NaN, x > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged -> invalid results. This can be fixed in two ways in HQ: Force all sentinels to the top: {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.doc == Integer.MAX_VALUE) return true; if (hitB.doc == Integer.MAX_VALUE) return false; if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return hitA.score < hitB.score; } {code} or alternatively have a defined order for NaN (Float.compare sorts them after +inf): {code} protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) { if (hitA.score == hitB.score) return hitA.doc > hitB.doc; else return Float.compare(hitA.score, hitB.score) < 0; } {code} The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified. To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator]. I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org