[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2010-02-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835961#action_12835961
 ] 

Mark Harwood commented on LUCENE-1486:
--

Double Ugh. Applying the patch for the "non-default field" bug doesn't work any 
more because the latest ComplexPhraseQueryParser source sitting in contrib now 
has a different package to the QueryParser base class . This means that this 
subclass doesn't have the required write access to the package-protected 
"field" variable. This is needed to temporarily set the context of the parser 
when processing the inner contents of the phrase.

Fixing this would require changing the package name of ComplexPhraseQueryParser 
or changing the visibility of "field" in the QueryParser base class to 
"protected".
Anyone have any strong feelings about which of these is the most acceptable?

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1
>
> Attachments: ComplexPhraseQueryParser.java, 
> junit_complex_phrase_qp_07_21_2009.patch, 
> junit_complex_phrase_qp_07_22_2009.patch, Lucene-1486 non default 
> field.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch, 
> LUCENE-1486.patch, LUCENE-1486.patch, TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1513) fastss fuzzyquery

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir closed LUCENE-1513.
---

Resolution: Not A Problem

For Lucene, LUCENE-2089 will always be faster than even FastSS, as our 
FuzzyQuery is really a top-N query, and we can exploit properties of the 
priority queue to make it even faster.

LUCENE-2089 also works without any auxiliary index or data structures, just 
solely on lucene's terms dict, so it works great for updates/NRT/whatever, no 
back compat problems.

I'm cancelling this issue as the alternative is superior in every aspect.

> fastss fuzzyquery
> -
>
> Key: LUCENE-1513
> URL: https://issues.apache.org/jira/browse/LUCENE-1513
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Robert Muir
>Priority: Minor
> Attachments: fastSSfuzzy.zip
>
>
> code for doing fuzzyqueries with fastssWC algorithm.
> FuzzyIndexer: given a lucene field, it enumerates all terms and creates an 
> auxiliary offline index for fuzzy queries.
> FastFuzzyQuery: similar to fuzzy query except it queries the auxiliary index 
> to retrieve a candidate list. this list is then verified with levenstein 
> algorithm.
> sorry but the code is a bit messy... what I'm actually using is very 
> different from this so its pretty much untested. but at least you can see 
> whats going on or fix it up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: ComplexPhraseQuery problems with simple phrases

2010-02-19 Thread Mark Harwood
This is because phrases are expected to contain >1 clause and the 
ComplexPhraseQueryParser was expecting a BooleanQuery from the base class  
which is used to hold the elements in the phrase.

In this single-clause scenario I guess we could silently hide the error and 
return whatever single query clause was inappropriately found between the 
quotes. 


On 19 Feb 2010, at 19:53, David Kaelbling wrote:

> Hi,
> 
> ComplexPhraseQueryParser doesn't appear to handle some simple wildcard
> phrases correctly.  In TestComplexPhraseQuery.testComplexPhrases() on
> trunk I tried these two tests:
> 
>   checkMatches("\"j*n sm*h\"", "1,2");
>   checkMatches("\"j*n\"", "1,2,3,4");
> 
> The first check succeeds.  The second throws an IllegalArgumentException
> trying to rewrite the query, complaining that WildcardQuery is an
> unknown query type.  If this is bad syntax I would have expected the
> first query to have failed too.
> 
> Does anyone have a fix?
> 
>Thanks,
>David
> 
> -- 
> David Kaelbling
> Senior Software Engineer
> Black Duck Software, Inc.
> 
> dkaelbl...@blackducksoftware.com
> T +1.781.810.2041
> F +1.781.891.5145
> 
> http://www.blackducksoftware.com
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Description: 
we can optimize fuzzyquery by using AutomatonTermsEnum. The idea is to speed up 
the core FuzzyQuery in similar fashion to Wildcard and Regex speedups, 
maintaining all backwards compatibility.

The advantages are:
* we can seek to terms that are useful, instead of brute-forcing the entire 
terms dict
* we can determine matches faster, as true/false from a DFA is array lookup, 
don't even need to run levenshtein.

We build Levenshtein DFAs in linear time with respect to the length of the 
word: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

To implement support for 'prefix' length, we simply concatenate two DFAs, which 
doesn't require us to do NFA->DFA conversion, as the prefix portion is a 
singleton. the concatenation is also constant time with respect to the size of 
the fuzzy DFA, it only need examine its start state.

with this algorithm, parametric tables are precomputed so that DFAs can be 
constructed very quickly.
if the required number of edits is too large (we don't have a table for it), we 
use "dumb mode" at first (no seeking, no DFA, just brute force like now).

As the priority queue fills up during enumeration, the similarity score 
required to be a competitive term increases, so, the enum gets faster and 
faster as this happens. This is because terms in core FuzzyQuery are sorted by 
boost value, then by term (in lexicographic order).

For a large term dictionary with a low minimal similarity, you will fill the pq 
very quickly since you will match many terms. 
This not only provides a mechanism to switch to more efficient DFAs (edit 
distance of 2 -> edit distance of 1 -> edit distance of 0) during enumeration, 
but also to switch from "dumb mode" to "smart mode".

With this design, we can add more DFAs at any time by adding additional tables. 
The tradeoff is the tables get rather large, so for very high K, we would start 
to increase the size of Lucene's jar file. The idea is we don't have include 
large tables for very high K, by using the 'competitive boost' attribute of the 
priority queue.

For more information, see http://en.wikipedia.org/wiki/Levenshtein_automaton

  was:
Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
itching to write that nasty algorithm)

we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
* up front, calculate the maximum required K edits needed to match the users 
supplied float threshold.
* for at least small common E up to some max K (1,2,3, etc) we should create a 
DFA for each E. 

if the required E is above our supported max, we use "dumb mode" at first (no 
seeking, no DFA, just brute force like now).
As the pq fills, we swap progressively lower DFAs into the enum, based upon the 
lowest score in the pq.
This should work well on avg, at high E, you will typically fill the pq very 
quickly since you will match many terms. 
This not only provides a mechanism to switch to more efficient DFAs during 
enumeration, but also to switch from "dumb mode" to "smart mode".

i modified my wildcard benchmark to generate random fuzzy queries.
* Pattern: 7N stands for NNN, etc.
* AvgMS_DFA: this is the time spent creating the automaton (constructor)

||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
|7N|10|64.0|4155.9|38.6|20.3|
|14N|10|0.0|2511.6|46.0|37.9|   
|28N|10|0.0|2506.3|93.0|86.6|
|56N|10|0.0|2524.5|304.4|298.5|

as you can see, this prototype is no good yet, because it creates the DFA in a 
slow way. right now it creates an NFA, and all this wasted time is in NFA->DFA 
conversion.
So, for a very long string, it just gets worse and worse. This has nothing to 
do with lucene, and here you can see, the TermEnum is fast (AvgMS - AvgMS_DFA), 
there is no problem there.

instead we should just build a DFA to begin with, maybe with this paper: 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
we can precompute the tables with that algorithm up to some reasonable K, and 
then I think we are ok.

the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
linear minimization, if someone wants to implement this they should not worry 
about minimization.
in fact, we need to at some point determine if AutomatonQuery should even 
minimize FSM's at all, or if it is simply enough for them to be deterministic 
with no transitions to dead states. (The only code that actually assumes 
minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
summation easily). we need to benchmark really complex DFAs (i.e. write a regex 
benchmark) to figure out if minimization is even helping right now.




edit the description, to hopefully be simpler.

> explore using automaton for fuzzyquery
> --
>
>

ComplexPhraseQuery problems with simple phrases

2010-02-19 Thread David Kaelbling
Hi,

ComplexPhraseQueryParser doesn't appear to handle some simple wildcard
phrases correctly.  In TestComplexPhraseQuery.testComplexPhrases() on
trunk I tried these two tests:

checkMatches("\"j*n sm*h\"", "1,2");
checkMatches("\"j*n\"", "1,2,3,4");

The first check succeeds.  The second throws an IllegalArgumentException
trying to rewrite the query, complaining that WildcardQuery is an
unknown query type.  If this is bad syntax I would have expected the
first query to have failed too.

Does anyone have a fix?

Thanks,
David

-- 
David Kaelbling
Senior Software Engineer
Black Duck Software, Inc.

dkaelbl...@blackducksoftware.com
T +1.781.810.2041
F +1.781.891.5145

http://www.blackducksoftware.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Question on highlighting of nested SpanQuery instances

2010-02-19 Thread Goddard, Michael J.
Hello,

I initially posted a version of this question to java-user, but think it's more 
of a java-dev question.  I haven't yet been able to resolve why I'm seeing 
spurious highlighting in nested SpanQuery instances.  To illustrate this, I 
added the code below to the HighlighterTest class in lucene_2_9_1:

/*
 * Ref: http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/
 */
public void testHighlightingNestedSpans2() throws Exception {

  String theText = "The Lucene was made by Doug Cutting and Lucene great Hadoop 
was"; // Problem
  //String theText = "The Lucene was made by Doug Cutting and the great Hadoop 
was"; // Works okay

  String fieldName = "SOME_FIELD_NAME";

  SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
new SpanTermQuery(new Term(fieldName, "lucene")),
new SpanTermQuery(new Term(fieldName, "doug")) }, 5, true);

  Query query = new SpanNearQuery(new SpanQuery[] { spanNear,
new SpanTermQuery(new Term(fieldName, "hadoop")) }, 4, true);

  String expected = "The Lucene was made by Doug Cutting and 
Lucene great Hadoop was";
  //String expected = "The Lucene was made by Doug Cutting and 
the great Hadoop was";

  String observed = highlightField(query, fieldName, theText);
  System.out.println("Expected: \"" + expected + "\n" + "Observed: \"" + 
observed);

  assertEquals("Why is that second instance of the term \"Lucene\" 
highlighted?", expected, observed);
}

Is this an issue that's arisen before?  I've been reading through the source to 
QueryScorer, WeightedSpanTerm, WeightedSpanTermExtractor, Spans, and 
NearSpansOrdered, but haven't found the solution yet.  Initially, I thought 
that the extractWeightedSpanTerms method in WeightedSpanTermExtractor should be 
called on each clause of a SpanNearQuery or SpanOrQuery, but that didn't get me 
too far.

Any suggestions are welcome.

Thanks.

  Mike


[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Attachment: ContrivedFuzzyBenchmark.java

attached is a 'contrived fuzzy benchmark' derived from my wildcard benchmark 
(randomly generated 7-digit terms)

for the benchmark, i ran results for various combinations of minimum 
similarity, prefix length, and pq size for the test index of 10million terms.

Avg MS old is the current flex branch. Avg MS new is with the patch.

Notes:
* only the table for distance n=1 is implemented yet! 
* n=1 is fast.
* Use of the PQ boost attribute speeds up fuzzy queries for higher n slightly, 
too.
* adding a table for n=2 should be extremely helpful, and maybe even enough for 
the default PQ size of 1024 (BQ.maxClauseCount), to make all fuzzy queries 
reasonable.

{{Minimum Sim = 0.73f (edit distance of 1)}} 
||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)||
|0|1024|3286.0|10.6|
|0|64|3320.4|7.2|
|1|1024|316.8|5.3|
|1|64|314.3|5.3|
|2|1024|31.8|4.0|
|2|64|31.9|4.2|

{{Minimum Sim = 0.58f (edit distance of 2)}}
||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)||
|0|1024|4223.3|1341.6|
|0|64|4199.7|501.9|
|1|1024|430.1|304.1|
|1|64|392.8|44.7|
|2|1024|82.5|70.0|
|2|64|38.4|7.7|


{{Minimum Sim = 0.43f (edit distance of 3)}}
||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)||
|0|1024|5299.9|2617.0|
|0|64|5231.8|476.4|
|1|1024|522.9|318.9|
|1|64|480.9|73.9|
|2|1024|89.0|83.9|
|2|64|46.3|8.6|


{{Minimum Sim = 0.29f (edit distance of 4)}}
||Prefix Length||PQ Size||Avg MS (old)||Avg MS (new)||
|0|1024|6258.1|3114.0|
|0|64|6247.6|684.6|
|1|1024|609.9|380.0|
|1|64|567.1|69.3|
|2|1024|98.6|93.8|
|2|64|55.6|11.4|


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: ContrivedFuzzyBenchmark.java, LUCENE-2089.patch, 
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
> LUCENE-2089_concat.patch, Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least small common E up to some max K (1,2,3, etc) we should create 
> a DFA for each E. 
> if the required E is above our supported max, we use "dumb mode" at first (no 
> seeking, no DFA, just brute force like now).
> As the pq fills, we swap progressively lower DFAs into the enum, based upon 
> the lowest score in the pq.
> This should work well on avg, at high E, you will typically fill the pq very 
> quickly since you will match many terms. 
> This not only provides a mechanism to switch to more efficient DFAs during 
> enumeration, but also to switch from "dumb mode" to "smart mode".
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization

[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-02-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835903#action_12835903
 ] 

Uwe Schindler commented on LUCENE-2190:
---

During refactoring I found out:

CustomScoreQuery is more broken: the rewrite() method is wrong, for now its not 
really a problem but when we change to per-segment rewrite (as Mike plans) its 
broken. Its even broken if you rewrite against one IndexReader and want to 
reuse the query later on another IndexReader. It rewrites all its subqueries 
and returns itsself, which is wrong: if one of the subqueries was rewritten it 
must return a new clone instance (like BooleanQuery). Also hashCode and equals 
ignore strict.

Will provide patch later. Now everything seems to work correct.

> CustomScoreQuery (function query) is broken (due to per-segment searching)
> --
>
> Key: LUCENE-2190
> URL: https://issues.apache.org/jira/browse/LUCENE-2190
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2190.patch
>
>
> Spinoff from here:
>   http://lucene.markmail.org/message/psw2m3adzibaixbq
> With the cutover to per-segment searching, CustomScoreQuery is not really 
> usable anymore, because the per-doc custom scoring method (customScore) 
> receives a per-segment docID, yet there is no way to figure out which segment 
> you are currently searching.
> I think to fix this we must also notify the subclass whenever a new segment 
> is switched to.  I think if we copy Collector.setNextReader, that would be 
> sufficient.  It would by default do nothing in CustomScoreQuery, but a 
> subclass could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-02-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835837#action_12835837
 ] 

Uwe Schindler commented on LUCENE-2190:
---

We can preserve backwards compatibility is the default impl with the new reader 
only passes to the deprecated old customScore function.

I will provide a patch tomorrow.

> CustomScoreQuery (function query) is broken (due to per-segment searching)
> --
>
> Key: LUCENE-2190
> URL: https://issues.apache.org/jira/browse/LUCENE-2190
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2190.patch
>
>
> Spinoff from here:
>   http://lucene.markmail.org/message/psw2m3adzibaixbq
> With the cutover to per-segment searching, CustomScoreQuery is not really 
> usable anymore, because the per-doc custom scoring method (customScore) 
> receives a per-segment docID, yet there is no way to figure out which segment 
> you are currently searching.
> I think to fix this we must also notify the subclass whenever a new segment 
> is switched to.  I think if we copy Collector.setNextReader, that would be 
> sufficient.  It would by default do nothing in CustomScoreQuery, but a 
> subclass could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reopened LUCENE-2190:
---


The fix is invalid:
Adding setNextReader to CustomScoreQuery makes the Query itsself stateful. This 
breaks when using together with e.g. ParallelMultiSearcher.
As the package is experimental, I see no problem in changing the method 
signature of customScore to pass in the affected IndexReader (CustomScorer can 
do this)

> CustomScoreQuery (function query) is broken (due to per-segment searching)
> --
>
> Key: LUCENE-2190
> URL: https://issues.apache.org/jira/browse/LUCENE-2190
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2190.patch
>
>
> Spinoff from here:
>   http://lucene.markmail.org/message/psw2m3adzibaixbq
> With the cutover to per-segment searching, CustomScoreQuery is not really 
> usable anymore, because the per-doc custom scoring method (customScore) 
> receives a per-segment docID, yet there is no way to figure out which segment 
> you are currently searching.
> I think to fix this we must also notify the subclass whenever a new segment 
> is switched to.  I think if we copy Collector.setNextReader, that would be 
> sufficient.  It would by default do nothing in CustomScoreQuery, but a 
> subclass could override.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2089) explore using automaton for fuzzyquery

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2089:


Attachment: LUCENE-2089.patch

* implement the pq algorithm, when the value at the bottom of the pq changes 
(BoostAttribute maxCompetitiveBoost), the enum adjusts itself by decreasing 
edit distance, and swapping in more efficient code.
* remove the wasted prefix checks in automatonfuzzytermsenum, as Uwe noticed, 
because its not necessary and handled as part of the DFA itself (it will never 
seek to such terms).

here is a patch, which is complete... needs code beautification/tests/docs but 
it has all functionality.

we should also add a table for n=2, maybe n=3 also, but these can be separate 
issues.


> explore using automaton for fuzzyquery
> --
>
> Key: LUCENE-2089
> URL: https://issues.apache.org/jira/browse/LUCENE-2089
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Mark Miller
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089.patch, 
> LUCENE-2089.patch, LUCENE-2089.patch, LUCENE-2089_concat.patch, 
> Moman-0.2.1.tar.gz, TestFuzzy.java
>
>
> Mark brought this up on LUCENE-1606 (i will assign this to him, I know he is 
> itching to write that nasty algorithm)
> we can optimize fuzzyquery by using AutomatonTermsEnum, here is my idea
> * up front, calculate the maximum required K edits needed to match the users 
> supplied float threshold.
> * for at least small common E up to some max K (1,2,3, etc) we should create 
> a DFA for each E. 
> if the required E is above our supported max, we use "dumb mode" at first (no 
> seeking, no DFA, just brute force like now).
> As the pq fills, we swap progressively lower DFAs into the enum, based upon 
> the lowest score in the pq.
> This should work well on avg, at high E, you will typically fill the pq very 
> quickly since you will match many terms. 
> This not only provides a mechanism to switch to more efficient DFAs during 
> enumeration, but also to switch from "dumb mode" to "smart mode".
> i modified my wildcard benchmark to generate random fuzzy queries.
> * Pattern: 7N stands for NNN, etc.
> * AvgMS_DFA: this is the time spent creating the automaton (constructor)
> ||Pattern||Iter||AvgHits||AvgMS(old)||AvgMS (new,total)||AvgMS_DFA||
> |7N|10|64.0|4155.9|38.6|20.3|
> |14N|10|0.0|2511.6|46.0|37.9| 
> |28N|10|0.0|2506.3|93.0|86.6|
> |56N|10|0.0|2524.5|304.4|298.5|
> as you can see, this prototype is no good yet, because it creates the DFA in 
> a slow way. right now it creates an NFA, and all this wasted time is in 
> NFA->DFA conversion.
> So, for a very long string, it just gets worse and worse. This has nothing to 
> do with lucene, and here you can see, the TermEnum is fast (AvgMS - 
> AvgMS_DFA), there is no problem there.
> instead we should just build a DFA to begin with, maybe with this paper: 
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
> we can precompute the tables with that algorithm up to some reasonable K, and 
> then I think we are ok.
> the paper references using http://portal.acm.org/citation.cfm?id=135907 for 
> linear minimization, if someone wants to implement this they should not worry 
> about minimization.
> in fact, we need to at some point determine if AutomatonQuery should even 
> minimize FSM's at all, or if it is simply enough for them to be deterministic 
> with no transitions to dead states. (The only code that actually assumes 
> minimal DFA is the "Dumb" vs "Smart" heuristic and this can be rewritten as a 
> summation easily). we need to benchmark really complex DFAs (i.e. write a 
> regex benchmark) to figure out if minimization is even helping right now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2271:
--

Attachment: LUCENE-2271.patch

Improved test, that also checks for increasing doc ids when score identical

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, LUCENE-2271.patch, 
> TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'

2010-02-19 Thread Peter Keegan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Keegan updated LUCENE-2272:
-

Attachment: payloadfunctin-patch.txt

This patch adds the 'explain' method to the 'PayloadFunction' interface, where 
the Scorer can call it. Added unit tests for 'explain' and for 
{Min,Max}PayloadFunction.

> PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
> ---
>
> Key: LUCENE-2272
> URL: https://issues.apache.org/jira/browse/LUCENE-2272
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Peter Keegan
> Attachments: payloadfunctin-patch.txt
>
>
> The 'explain' method in PayloadNearSpanScorer assumes the 
> AveragePayloadFunction was used. This patch adds the 'explain' method to the 
> 'PayloadFunction' interface, where the Scorer can call it. Added unit tests 
> for 'explain' and for {Min,Max}PayloadFunction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2272) PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'

2010-02-19 Thread Peter Keegan (JIRA)
PayloadNearQuery has hardwired explanation for 'AveragePayloadFunction'
---

 Key: LUCENE-2272
 URL: https://issues.apache.org/jira/browse/LUCENE-2272
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Reporter: Peter Keegan
 Attachments: payloadfunctin-patch.txt

The 'explain' method in PayloadNearSpanScorer assumes the 
AveragePayloadFunction was used. This patch adds the 'explain' method to the 
'PayloadFunction' interface, where the Scorer can call it. Added unit tests for 
'explain' and for {Min,Max}PayloadFunction.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835763#action_12835763
 ] 

Uwe Schindler commented on LUCENE-2271:
---

The cost to handle NaN is the modified lessThan() in HitQueue.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835762#action_12835762
 ] 

Yonik Seeley commented on LUCENE-2271:
--

bq. A design bug that function queries score docs with an invalid score (NaN) 
instead of throwing an exception?

No, a design bug that -Inf scores were disallowed, esp since they were handled 
just fine in the past.

NaN is different - it's more of a judgement call depending on the cost to 
handle it.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835758#action_12835758
 ] 

Robert Muir commented on LUCENE-2271:
-

bq. OK, so it was a design bug too.

A design bug that function queries score docs with an invalid score (NaN) 
instead of throwing an exception?

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835750#action_12835750
 ] 

Yonik Seeley commented on LUCENE-2271:
--

bq. its not a bug, as its doc'ed to work this way. 

OK, so it was a design bug too.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2271:
--

Attachment: LUCENE-2271.patch

Patch with testcases for trunk, but should work on branches, too (after 
removing @Override). Without the fixes in HitQueue or TSDC the tests fail.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835717#action_12835717
 ] 

Robert Muir commented on LUCENE-2271:
-

bq.The cost of the additional checks in HitQueue.lessThan are neglectible, as 
they only occur when a competitive hit is really inserted into the queue.

This should be benchmarked for MultiSearcher and ParallelMultiSearcher, too, as 
they also use HitQueue.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2271:
--

Attachment: (was: LUCENE-2271.patch)

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2271:


Attachment: TSDC.patch

attached is a patch, written by Uwe. as far as a "bugfix" i prefer this patch, 
as the more complicated, performance-intrusive NaN fixes I think should be 
something we do in 3.1

e.g., "fixing" NaN to work will likely slow down people getting large numbers 
of results, and i don't think we should do that in bugfix releases. 

but in 3.1, we could change it, include some large results-oriented collectors 
for these people, and the whole thing would make sense.


> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2271:
--

Attachment: LUCENE-2271.patch

Sorry reverted a comment remove.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch, TSDC.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2271:
--

Attachment: LUCENE-2271.patch

This is patch that supports NaN and -inf.

The cost of the additional checks in HitQueue.lessThan are neglectible, as they 
only occur when a competitive hit is really inserted into the queue. The check 
enforces all sentinels to the top of the queue, regardless what their score is 
(because always NaN != NaN).

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
> Attachments: LUCENE-2271.patch
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835683#action_12835683
 ] 

Robert Muir commented on LUCENE-2271:
-

I don't think we should do anything to fix NaN, such as using more expensive 
comparisons (Float.compareTo) and stuff. as it is not a number, its an invalid 
score. 

i think function queries shoudl throw and exception instead of producing NaN, 
this problem is only limited to function queries.

I think fixing scores of negative infinity make more sense, as these are 
unpreventable (again only a problem with function queries!) and at least 
negative infinity is actually a number.

i think "fixing" a top-N collector, or "fixing" anything that sorts NaN is 
wrong. NaN doesnt have a properly defined sort order. NaN has an order hacked 
into Float.compareTo, but this is different. sorting the primitive type makes 
no sense, and the documentation should stay that it doesnt work with TSDC.

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2271:


  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

its not a bug, as its doc'ed to work this way.

{code}
 * NOTE: The values Float.Nan,
 * Float.NEGATIVE_INFINITY and Float.POSITIVE_INFINITY are
 * not valid scores.  This collector will not properly
 * collect hits with such scores.
{code}

> Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
> results with TopScoreDocCollector
> --
>
> Key: LUCENE-2271
> URL: https://issues.apache.org/jira/browse/LUCENE-2271
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.9
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 2.9.2, 3.0.1, 3.1
>
>
> This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
> (boost = 0 leading to NaN scores, which is also un-intuitive), but in 
> general, function queries in Solr can create these invalid scores easily. In 
> previous version of Lucene these scores ordered correct (except NaN, which 
> mixes up results), but never invalid document ids are returned (like 
> Integer.MAX_VALUE).
> The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
> ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ 
> to work, this sentinel must be smaller than all posible values, which is not 
> the case:
> - -inf is equal and the document is not inserted into the HQ, as not 
> competitive, but the HQ is not yet full, so the sentinel values keep in the 
> HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and 
> only affects the Ordered collector) by chaning the exit condition to:
> {code}
> if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
> // Since docs are returned in-order (i.e., increasing doc Id), a document
> // with equal score to pqTop.score cannot compete since HitQueue favors
> // documents with lower doc Ids. Therefore reject those docs too.
> return;
> }
> {code}
> - The NaN case can be fixed in the same way, but then has another problem: 
> all comparisons with NaN result in false (none of these is true): x < NaN, x 
> > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns 
> false, leading to unexspected ordering in the PQ and sometimes the sentinel 
> values do not stay at the top of the queue. A later hit then overrides the 
> top of the queue but leaves the incorrect sentinels  unchanged -> invalid 
> results. This can be fixed in two ways in HQ:
> Force all sentinels to the top:
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.doc == Integer.MAX_VALUE)
>   return true;
> if (hitB.doc == Integer.MAX_VALUE)
>   return false;
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return hitA.score < hitB.score;
> }
> {code}
> or alternatively have a defined order for NaN (Float.compare sorts them after 
> +inf):
> {code}
> protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
> if (hitA.score == hitB.score)
>   return hitA.doc > hitB.doc; 
> else
>   return Float.compare(hitA.score, hitB.score) < 0;
> }
> {code}
> The problem with both solutions is, that we have now more comparisons per hit 
> and the use of sentinels is questionable. I would like to remove the 
> sentinels and use the old pre 2.9 code for comparing and using PQ.add() when 
> a competitive hit arrives. The order of NaN would be unspecified.
> To fix the order of NaN, it would be better to replace all score comparisons 
> by Float.compare() [also in FieldComparator].
> I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
> solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2271) Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

2010-02-19 Thread Uwe Schindler (JIRA)
Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect 
results with TopScoreDocCollector
--

 Key: LUCENE-2271
 URL: https://issues.apache.org/jira/browse/LUCENE-2271
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
 Fix For: 2.9.2, 3.0.1, 3.1


This is a foolowup to LUCENE-2270, where a part of this problem was fixed 
(boost = 0 leading to NaN scores, which is also un-intuitive), but in general, 
function queries in Solr can create these invalid scores easily. In previous 
version of Lucene these scores ordered correct (except NaN, which mixes up 
results), but never invalid document ids are returned (like Integer.MAX_VALUE).

The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel 
ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to 
work, this sentinel must be smaller than all posible values, which is not the 
case:
- -inf is equal and the document is not inserted into the HQ, as not 
competitive, but the HQ is not yet full, so the sentinel values keep in the HQ 
and result is the Integer.MAX_VALUE docs. This problem is solveable (and only 
affects the Ordered collector) by chaning the exit condition to:
{code}
if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
{code}

- The NaN case can be fixed in the same way, but then has another problem: all 
comparisons with NaN result in false (none of these is true): x < NaN, x > NaN, 
NaN == NaN. This leads to the fact that HQ's lessThan always returns false, 
leading to unexspected ordering in the PQ and sometimes the sentinel values do 
not stay at the top of the queue. A later hit then overrides the top of the 
queue but leaves the incorrect sentinels  unchanged -> invalid results. This 
can be fixed in two ways in HQ:
Force all sentinels to the top:
{code}
protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
if (hitA.doc == Integer.MAX_VALUE)
  return true;
if (hitB.doc == Integer.MAX_VALUE)
  return false;
if (hitA.score == hitB.score)
  return hitA.doc > hitB.doc; 
else
  return hitA.score < hitB.score;
}
{code}
or alternatively have a defined order for NaN (Float.compare sorts them after 
+inf):
{code}
protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
if (hitA.score == hitB.score)
  return hitA.doc > hitB.doc; 
else
  return Float.compare(hitA.score, hitB.score) < 0;
}
{code}

The problem with both solutions is, that we have now more comparisons per hit 
and the use of sentinels is questionable. I would like to remove the sentinels 
and use the old pre 2.9 code for comparing and using PQ.add() when a 
competitive hit arrives. The order of NaN would be unspecified.

To fix the order of NaN, it would be better to replace all score comparisons by 
Float.compare() [also in FieldComparator].

I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and 
solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org