[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Mike Klaas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669843#action_12669843
 ] 

Mike Klaas commented on LUCENE-1534:


[quote]But if we feel that over-emphasizes terms with large idfs, then we 
should not remove an idf factor from one vector, but rather rework our weight 
heuristic, perhaps replacing idf with sqrt(idf), no?[/quote]

FWIW, having implemented web search on a large (500m) corpus, we found the 
stock idf factor in lucene is too high, and ended up sqrt()'ing it in 
Similarity.

That said, much of the score in this system came from anchor text, link 
analysis scores, and term proximity, so it's hard to measure the impact the idf 
change independently.

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669794#action_12669794
 ] 

Michael McCandless commented on LUCENE-1534:


bq. But if we feel that over-emphasizes terms with large idfs, then we should 
not remove an idf factor from one vector, but rather rework our weight 
heuristic, perhaps replacing idf with sqrt(idf), no?

I agree, that should be the approach if we decide idf^2 is too much, but I 
don't have an opinion (yet!) on whether it's too much (but that thread 
referenced above is nonetheless interesting).


> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1534.


Resolution: Invalid

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669791#action_12669791
 ] 

Michael McCandless commented on LUCENE-1534:


bq. But the more important part of the scoring is how terms are scored relative 
to each other in the same query - and that is still idf**2

Ahh OK, now I get it -- idf is indeed factored in twice.  A single TermQuery is 
a somewhat degenerate case; queries with more than one term will show the 
effect.  Thanks for clarifying ;)

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669756#action_12669756
 ] 

Yonik Seeley commented on LUCENE-1534:
--

{quote}
EG for a single TermQuery, the queryWeight will always be 1.0 (except
for roundoff errors), cancelling out that idf factor, leaving only one
idf factor?
{quote}

Yes, for a score returned to the user only one idf factor remains because of 
the normalization.
*But* the more important part of the scoring is how terms are scored relative 
to each other in the same query - and that is still idf**2

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669754#action_12669754
 ] 

Doug Cutting commented on LUCENE-1534:
--

sumOfSquaredWeights properly normalizes query vectors to the unit sphere.  We 
can't easily do that with document vectors, since idfs change as the collection 
changes.  So we instead use a heuristic to normalize documents, 
sqrt(numTokens), which is usually a good approximation.  Regardless of how it's 
normalized, the global term weight factors twice in each addend, once from each 
vector.

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669750#action_12669750
 ] 

Doug Cutting commented on LUCENE-1534:
--

I've always found "idf squared" an unhelpful description.  We're computing a 
dot-product of two vectors, the angle between them.  Terms are dimensions.  The 
magnitude in each dimension is the weight of the term in a query or document.  
Our heuristic for computing weights is (sqrt(tf)*idf)/norm.  Put all that 
together, and you do indeed get an "idf squared" factor in each addend of the 
score.  But if we feel that over-emphasizes terms with large idfs, then we 
should not remove an idf factor from one vector, but rather rework our weight 
heuristic, perhaps replacing idf with sqrt(idf), no?

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669743#action_12669743
 ] 

Michael McCandless commented on LUCENE-1534:


But sumOfSquaredWeights is only used as a fixed normalization across
all sub-queries in the Query?

EG for a single TermQuery, the queryWeight will always be 1.0 (except
for roundoff errors), cancelling out that idf factor, leaving only one
idf factor?


> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669673#action_12669673
 ] 

Yonik Seeley commented on LUCENE-1534:
--

Right and explain explains it by having 1 idf factor in the queryWeight and 
1 in the fieldWeight:

{code}
0.6433005 = (MATCH) weight(text:solr in 14), product of:
  0.9994 = queryWeight(text:solr), product of:
3.6390574 = idf(docFreq=1, numDocs=26)
0.27479643 = queryNorm
  0.64330053 = (MATCH) fieldWeight(text:solr in 14), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6390574 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=14)
{code}


> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669663#action_12669663
 ] 

Mark Miller commented on LUCENE-1534:
-

hmmm...we do multiply it in twice, but a bit happens in between - we multiply 
by idf(t) in sumOfSquaredWeights()  and then again in normalize(float 
queryNorm).

Technically that is boost * idf(t) * norm * idf(t), right? For idf(t)^2 * boost 
* norm? And then that times tf in the scorer... 

> idf(t) is not actually squared during scoring?
> --
>
> Key: LUCENE-1534
> URL: https://issues.apache.org/jira/browse/LUCENE-1534
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> The javadocs for Similarity:
>   
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
> show idf(t) as being squared when computing net query score.  But I
> don't think it is actually squared, in looking at the sources?  Maybe
> it used to be, eg this interesting discussion:
>   http://markmail.org/message/k5pl7scmiac5wosb
> Or am I missing something?  We just need to fix the javadocs to take
> away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1534) idf(t) is not actually squared during scoring?

2009-02-02 Thread Michael McCandless (JIRA)
idf(t) is not actually squared during scoring?
--

 Key: LUCENE-1534
 URL: https://issues.apache.org/jira/browse/LUCENE-1534
 Project: Lucene - Java
  Issue Type: Bug
  Components: Query/Scoring
Affects Versions: 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


The javadocs for Similarity:

  
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

show idf(t) as being squared when computing net query score.  But I
don't think it is actually squared, in looking at the sources?  Maybe
it used to be, eg this interesting discussion:

  http://markmail.org/message/k5pl7scmiac5wosb

Or am I missing something?  We just need to fix the javadocs to take
away the "squared"...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-02-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1483.


Resolution: Fixed

Committed revision 740021.

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
> sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669643#action_12669643
 ] 

Michael McCandless commented on LUCENE-1483:


bq. javadoc: maxDoc should be numDocs in "in order of decreasing maxDoc()"

Woops, right!  I'll fix & commit.  Thanks Yonik!

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
> sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-02-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669632#action_12669632
 ] 

Yonik Seeley commented on LUCENE-1483:
--

+1, thanks Mike!

javadoc: maxDoc should be numDocs in "in order of decreasing maxDoc()"


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
> sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669605#action_12669605
 ] 

Robert Muir commented on LUCENE-1532:
-

when you talk about hardcoding normalization, I really don't see where its 
unfair or even 'hardcoding' to assume a zipfian distribution in any corpus of 
text for incorporating the frequency weight

I agree the specific corpus determines some of these properties but at the end 
of the day they all tend to have the same general distribution curve even if 
the specifics are different.

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669603#action_12669603
 ] 

Mark Miller commented on LUCENE-1532:
-

Just to clear up the normalization (since I havn't done a good job of 
explaining it) -

If you use a weight that combines freq and edit distance (so that small edit 
distance wins don't completely decide the suggestion), using the real term 
frequencies will create math that explodes higher freq terms  unfairly. 
Normalizing down makes things a little better - in an index with term freqs 
that go from 1 to 400,000 - you don't want a term with freq:150,000 to be 
heavily preferred over something with freq:125,000. They should really be 
treated about the same, with edit distance the main factor.

- Mark

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669601#action_12669601
 ] 

Robert Muir commented on LUCENE-1532:
-

I think we are on the same page here, I'm just suggesting that if the broad 
goal is to improve spellcheck, I think smarter distance metrics are also worth 
looking at.

In my tests I got significantly better results by tuning the ED function as 
mentioned, I also use freetts/cmudict to incorporate phonetic edit distance and 
average the two. (The idea being to help with true typos but also with 
genuinely bad spellers).  The downside to these tricks are that they are 
language-dependent.

For reference the other thing I will mention is aspell has some test data here: 
http://aspell.net/test/orig/ , maybe it is useful in some way?


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669599#action_12669599
 ] 

Mark Miller commented on LUCENE-1532:
-

bq. In one corpus doc. frequency of 3 means it is probably a typo, in another 
this means nothing...

I think typos are another problem though. There can be too many high frequency 
typos and low frequency correct spellings. I think this has to be attacked in 
other ways anyway. Maybe favoring an a true dictionary slightly, over the user 
dictionary. Certainly its hard to use freq for it though. Helps keep those 
typos from getting suggested though - and you only pay by seeing fewer less 
common, but correct, suggestions as well.

bq. My proposal is to work with real frequency as you have no information loss 
there ... 

I don't think the info you are losing is helpful. If you start favoring a word 
that occurs 70,000 times heavily over words that occur 40,000 times, I think it 
works in the favor of bad suggestions. On a scaled freq chart, they might 
actually be a 4 and 5 or both the same freq even. Considering that 70k-40k 
doesnt likely tell you much in terms of which is a better suggestion, this 
allows edit distance to play the larger role that it should in deciding.

Of course it makes sense for the implementation to be able to work with the raw 
values as you say though. We wouldn't want to hardcode the normalization. Your 
right - who knows what is right to do for it, or whether you should even 
normalize at the end of the day. I don't. A casual experience showed good 
results though, and I think supplying something like that out of the box will 
really improve lucenes spelling suggestions.
- Mark

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669595#action_12669595
 ] 

Eks Dev commented on LUCENE-1532:
-

.bq but I'm not sure the exact frequency number at just word-level is really 
that useful for spelling correction, assuming a normal zipfian distribution.

you are probably right, you cannot expect high resolution from frequency, but 
exact frequency information is your "source information". Clustering it on 
anything is just one algorithmic modification where, at the end, less 
information remains. Mark suggests 1-10, someone else would be happy with 1-3  
... who could tell? Therefore I would recommend real frequency information and 
leave possibility for end user to decide what to do with it. 

Frequency distribution is not simple measure, depends heavily on corpus 
composition, size. In one corpus doc. frequency of 3 means it is probably a 
typo, in another this means nothing...

My proposal is to work with real frequency as you have no information loss 
there ...  




  


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589
 ] 

markrmil...@gmail.com edited comment on LUCENE-1532 at 2/2/09 5:02 AM:
-

bq. but I'm not sure the exact frequency number at just word-level is really 
that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You 
could do 1-3 and have low freq, med freq, hi freq. (note: i found that when 
normalizing, taking the top value as like the 90-95 percentile created a better 
distribution - knocks off a decent amount of outliers that can push everything 
else to lower freq values)

Consider I make a site called MarkMiller.com - its full of stuff about Mark 
Miller. In my dictionary is Mike Muller though, which is mentioned on the site 
twice. Mark Miller is mentioned thousands of times. Now if I type something 
like Mlller and it suggest Muller just using edit distance - that type of thing 
will create a lot of bad suggestions. Muller is practically unheard of on my 
site, but I am suggesting it over Miller which is all over the place. Edit 
distance by itself as the first cut off creates too many of these close bad 
suggestions. So its not that freq should be used heavily - but it can clear up 
these little oddities quite nicely.


  was (Author: markrmil...@gmail.com):
bq. but I'm not sure the exact frequency number at just word-level is 
really that useful for spelling correction, assuming a normal zipfian 
distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You 
could do 1-3 and have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark 
Miller. In my dictionary is Mike Muller though, which is mentioned on the site 
twice. Mark Miller is mentioned thousands of times. Now if I type something 
like Mlller and it suggest Muller just using edit distance - that type of thing 
will create a lot of bad suggestions. Muller is practically unheard of on my 
site, but I am suggesting it over Miller which is all over the place. Edit 
distance by itself as the first cut off creates too many of these close bad 
suggestions. So its not that freq should be used heavily - but it can clear up 
this little oddities quite nicely.

  
> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589
 ] 

Mark Miller commented on LUCENE-1532:
-

bq. but I'm not sure the exact frequency number at just word-level is really 
that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You 
could do 1-3 and have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark 
Miller. In my dictionary is Mike Muller though, which is mentioned on the site 
twice. Mark Miller is mentioned thousands of times. Now if I type something 
like Mlller and it suggest Muller just using edit distance - that type of thing 
will create a lot of bad suggestions. Muller is practically unheard of on my 
site, but I am suggesting it over Miller which is all over the place. Edit 
distance by itself as the first cut off creates too many of these close bad 
suggestions. So its not that freq should be used heavily - but it can clear up 
this little oddities quite nicely.


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669582#action_12669582
 ] 

Robert Muir commented on LUCENE-1532:
-

I agree the frequency information is very useful, but I'm not sure the exact 
frequency number at just word-level is really that useful for spelling 
correction, assuming a normal zipfian distribution.

using the frequency as a basic guide: 'typo or non-typo', 'common or uncommon', 
etc might be the best use for it.


> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669579#action_12669579
 ] 

Eks Dev commented on LUCENE-1532:
-

.bq  I got better results by refining edit distance costs by keyboard layout 

Sure,  better distance helps a lot, but even in that case frequency information 
brings a lot. Frequency brings you some information about  corpus that is 
orthogonal to information you get from pure "word1" vs "word2" comparison. 



> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-02-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1483:
---

Attachment: LUCENE-1483.patch

bq. public IndexSearcher(IndexReader r, boolean sortSegments) (or docsInOrder?)

Sounds good... I added an expert ctor that takes boolean docsInOrder (attached).

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
> sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Reopened: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-02-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1483:


  Assignee: Michael McCandless

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

2009-02-02 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669576#action_12669576
 ] 

Robert Muir commented on LUCENE-1532:
-

just a suggestion... I got better results by refining edit distance costs by 
keyboard layout (substituting a 'd' with an 'f' costs less than 'd' with 'j', 
and i also penalize less for transposition).

if you have lots of terms it helps for ed function to be able to discriminate 
terms better.

> File based spellcheck with doc frequencies supplied
> ---
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/spellchecker
>Reporter: David Bowen
>Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org