[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-20 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552265#comment-14552265
 ] 

Mark Harwood commented on LUCENE-329:
-

Committed to 5.x branch and trunk

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552260#comment-14552260
 ] 

ASF subversion and git services commented on LUCENE-329:


Commit 1680548 from mharw...@apache.org in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1680548 ]

LUCENE-329: Fix FuzzyQuery defaults to rank exact matches highest

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-20 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552134#comment-14552134
 ] 

ASF subversion and git services commented on LUCENE-329:


Commit 1680522 from mharw...@apache.org in branch 'dev/trunk'
[ https://svn.apache.org/r1680522 ]

LUCENE-329: Fix FuzzyQuery defaults to rank exact matches highest

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550376#comment-14550376
 ] 

Mark Harwood commented on LUCENE-329:
-

Thanks, I'll commit tomorrow if there's no objections.

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550377#comment-14550377
 ] 

Robert Muir commented on LUCENE-329:


+1 to this patch. I like this approach and it seems scoring-system agnostic, 
which was my major issue with IDF-specific stuff. When committing, maybe rename 
adjustDF() since actually it adjusts all term-level stats?

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-19 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550334#comment-14550334
 ] 

Adrien Grand commented on LUCENE-329:
-

+1 to this patch!

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch, LUCENE-329.patch, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-12 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540069#comment-14540069
 ] 

Adrien Grand commented on LUCENE-329:
-

It's not correct to do {{maxTtf = Math.max(ttf, maxTtf)}} because the ttf can 
sometimes be -1, so it would rather need to be something like {{maxTtf = ttf == 
-1 ? -1 : Math.max(ttf, maxTtf)}}.

Also I liked it better in the previous patch how you built a new TermContext 
instance instead of modifying the current one in place. Maybe you could add it 
back?

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch, 
> LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2015-05-07 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14533301#comment-14533301
 ] 

Adrien Grand commented on LUCENE-329:
-

I like the patch. Maybe we should blend the total term freq just like we blend 
the doc freq so that it works too with a similarity that uses the ttf instead 
of the df?

Also it is the 2nd query (the other one is FuzzyLikeThis) where we need to hack 
a bit TermContext in order to decouple the computation of statistics from the 
registration of the term states. I'm wondering if we should improve TermContext 
to make it easier, eg.

{code}
Index: lucene/core/src/java/org/apache/lucene/index/TermContext.java
===
--- lucene/core/src/java/org/apache/lucene/index/TermContext.java   
(revision 1678141)
+++ lucene/core/src/java/org/apache/lucene/index/TermContext.java   
(working copy)
@@ -117,16 +117,31 @@
* should be derived from a {@link IndexReaderContext}'s leaf ord.
*/
   public void register(TermState state, final int ord, final int docFreq, 
final long totalTermFreq) {
+register(state, ord);
+accumulateStatistics(docFreq, totalTermFreq);
+  }
+
+  /**
+   * Expert: Registers and associates a {@link TermState} with an leaf 
ordinal. The
+   * leaf ordinal should be derived from a {@link IndexReaderContext}'s leaf 
ord.
+   * On the contrary to {@link #register(TermState, int, int, long)} this 
method
+   * does NOT update term statistics.
+   */
+  public void register(TermState state, final int ord) {
 assert state != null : "state must not be null";
 assert ord >= 0 && ord < states.length;
 assert states[ord] == null : "state for ord: " + ord
 + " already registered";
+states[ord] = state;
+  }
+
+  /** Expert: Accumulate term statistics. */
+  public void accumulateStatistics(final int docFreq, final long 
totalTermFreq) {
 this.docFreq += docFreq;
 if (this.totalTermFreq >= 0 && totalTermFreq >= 0)
   this.totalTermFreq += totalTermFreq;
 else
   this.totalTermFreq = -1;
-states[ord] = state;
   }
 
   /**
{code}

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 5.x
>
> Attachments: ASF.LICENSE.NOT.GRANTED--patch.txt, LUCENE-329.patch
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-329) Fuzzy query scoring issues

2012-03-09 Thread Paul taylor (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226066#comment-13226066
 ] 

Paul taylor commented on LUCENE-329:


Why has this been closed ?

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 1.2
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2011-01-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987650#action_12987650
 ] 

Mark Harwood commented on LUCENE-329:
-

bq.  I think you can safely implement a RewriteMethod to do whatever you want?

Yep, I've got workarounds using FuzzyLikeThis that work for me but have long 
had a general unease about the "out of the box" experience for others.

However things are certainly better than they were when this issue was first 
raised and the main concerns have been addressed.

bq. So FuzzyQuery behaves now more as one would expect

Is it worth explicitly stating those expectations? Mine would be based on these 
principles:
1) IDF is commonly accepted as useful when ranking partial matches of queries 
with multiple optional clauses
2) IDF doesn't stop being useful if one of those clauses just  happens to be a 
term flagged as "fuzzy".

So given a query:rareWord~ OR commonWord~ 
I would expect an exact match on "rareWord" to rank higher than an exact match 
on "commonWord".
I don't think the current implementation respects this.








> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.2rc5
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2011-01-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987598#action_12987598
 ] 

Uwe Schindler commented on LUCENE-329:
--

I agree with your complaints, but this issue was more about MTQ queries at all 
and strange scoring. So FuzzyQuery behaves now more as one would expect. We 
already have different variants of this query like FuzzyLikeThis that also 
solve this issue.

This is why I closed the issue.

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.2rc5
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2011-01-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987600#action_12987600
 ] 

Robert Muir commented on LUCENE-329:


Mark, I tend to agree, but at the same time I think you can safely implement
a RewriteMethod to do whatever you want? (e.g. apply the logic of FuzzyLikeThis)

Doing something special with IDF is really specific to certain Similarities, 
for example
your Similarity might not use the traditional IDF at all, but something 
involving
totalTermFreq and sumOfTotalTermFreq (like language modelling).

So I am concerned about doing tricky things with the scoring system by default 
for this query... we provide the simple options in core (Scoring, BoostOnly, 
etc) though.

An idea would be to factor the logic out of FuzzyLikeThisQuery into a 
FuzzyLikeThisRewriteMethod,
so you could just call .setRewriteMethod on your fuzzy query and use it.


> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.2rc5
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-329) Fuzzy query scoring issues

2011-01-27 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987585#action_12987585
 ] 

Mark Harwood commented on LUCENE-329:
-

bq.So term's idf is not used at all. This should solve this problem

A sub-optimal  for the reasons I outlined earlier:

bq. The problem with ignoring IDF completely is that it doesn't help balance 
partial matches where there is >1 fuzzy element in the query e.g.in a query for 
John~ Patitucci~ I'm probably more interested in a partial match on the rarer 
surname than a partial match on the common forename. Obliterating IDF 
completely as a factor would lose this feature (available in FuzzyLikeThisQuery)

> Fuzzy query scoring issues
> --
>
> Key: LUCENE-329
> URL: https://issues.apache.org/jira/browse/LUCENE-329
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 1.2rc5
> Environment: Operating System: All
> Platform: All
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: patch.txt
>
>
> Queries which automatically produce multiple terms (wildcard, range, prefix, 
> fuzzy etc)currently suffer from two problems:
> 1) Scores for matching documents are significantly smaller than term queries 
> because of the volume of terms introduced (A match on query Foo~ is 0.1 
> whereas a match on query Foo is 1).
> 2) The rarer forms of expanded terms are favoured over those of more common 
> forms because of the IDF. When using Fuzzy queries for example, rare mis-
> spellings typically appear in results before the more common correct 
> spellings.
> I will attach a patch that corrects the issues identified above by 
> 1) Overriding Similarity.coord to counteract the downplaying of scores 
> introduced by expanding terms.
> 2) Taking the IDF factor of the most common form of expanded terms as the 
> basis of scoring all other expanded terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org