Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-19 Thread danield
Update: I have implemented my own subclasses of QueryParser, BooleanQuery,
BooleanScorer and Similarity to deal with this.

I have been successful in getting the exact behaviour I want... when
calling the .explain() method. However, the scores for some documents often
differ when calling IndexSearcher.search() vs IndexSearcher.explain().

I am a bit confused by this. The coord() seems to be one of the things I
need to change, but is not the only element in the formula that I have
clearly changed for the .explain() pipeline but not for .search().

The implementation of BulkScorer remains perplexing to me and I suspect it
is something in there I have missed. Any pointers?

Thanks!
Daniel


On 15 January 2015 at 23:00, Jack Krupansky-3 [via Lucene] 
ml-node+s472066n4179925...@n3.nabble.com wrote:

 File a Jira for this particular doc fix since it is significant and not
 just mere worksmithing. Better yet, submit a patch since that's Javadoc,
 although the exact form of the doc fix might be debatable, so I general
 description of the problem should be sufficient, unless you feel
 motivated.

 -- Jack Krupansky

 On Thu, Jan 15, 2015 at 11:23 AM, danield [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=4179925i=0 wrote:

  Hi Mike,
 
  Thank you for your reply. Yes, I had thought of this, but it is not a
  solution to my problem, and this is because the Term Frequency and
  therefore
  the results will still be wrong, as prepending or appending a string to
 the
  term will still make it a different term.
 
  Similarily, I could use regex queries, but again that doesn't fix the TF
  issue. I am not talking here hypothetically, I have proof this doesn't
 work
  experimentally (i.e. the precision for my task goes down in my
  experiments).
 
  Also, I agree that when your fields are essentially different as in
  /title/,
  /author /and /text/, normalizing by field length makes sense, but in my
  case
  my fields are many and are all chunks of a larger text (extracted
 sentences
  that have been labelled with a number of different classes), and in the
  experiments I am running I am trying to establish whether weighting
  sentences in different classes differently will lead to increased
 relevance
  of results.
 
  This also doesn't change the fact that documentation is wrong! Any ideas
  how
  to fix?
  Daniel
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=4179925i=1
  For additional commands, e-mail: [hidden email]
 http:///user/SendEmail.jtp?type=nodenode=4179925i=2
 
 


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179925.html
  To unsubscribe from Similarity formula documentation is misleading + how
 to make field-agnostic queries?, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4179307code=ZGFuaWVsZHVtYUBnbWFpbC5jb218NDE3OTMwN3wxMjkzMjkwMDg3
 .
 NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4180529.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Michael Sokolov

On 1/15/15 11:23 AM, danield wrote:

Hi Mike,

Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a different term.

Similarily, I could use regex queries, but again that doesn't fix the TF
issue. I am not talking here hypothetically, I have proof this doesn't work
experimentally (i.e. the precision for my task goes down in my experiments).

Also, I agree that when your fields are essentially different as in /title/,
/author /and /text/, normalizing by field length makes sense, but in my case
my fields are many and are all chunks of a larger text (extracted sentences
that have been labelled with a number of different classes), and in the
experiments I am running I am trying to establish whether weighting
sentences in different classes differently will lead to increased relevance
of results.

This also doesn't change the fact that documentation is wrong! Any ideas how
to fix?
Daniel

In Lucene a Term encodes the field and the term text, so the 
documentation is not incorrect.  In fact this is stated explicitly here:


Lucene is field based, hence each query term applies to a single field, 
document length normalization is by the length of the certain field, and 
in addition to document boost there are also document fields boosts.


You might consider indexing your sentences as multiple values of a 
single field.  If you need to label them you could possibly use payloads 
for that.


-Mike


Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that
explanation more prominent, as I clearly missed it.

Never mind, I am working on my own solution for this, through subclassing
QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other
classes.

Cheers,
Daniel




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179851.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread danield
Hi Mike,

Thank you for your reply. Yes, I had thought of this, but it is not a
solution to my problem, and this is because the Term Frequency and therefore
the results will still be wrong, as prepending or appending a string to the
term will still make it a different term.

Similarily, I could use regex queries, but again that doesn't fix the TF
issue. I am not talking here hypothetically, I have proof this doesn't work
experimentally (i.e. the precision for my task goes down in my experiments).

Also, I agree that when your fields are essentially different as in /title/,
/author /and /text/, normalizing by field length makes sense, but in my case
my fields are many and are all chunks of a larger text (extracted sentences
that have been labelled with a number of different classes), and in the
experiments I am running I am trying to establish whether weighting
sentences in different classes differently will lead to increased relevance
of results.

This also doesn't change the fact that documentation is wrong! Any ideas how
to fix?
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-15 Thread Jack Krupansky
File a Jira for this particular doc fix since it is significant and not
just mere worksmithing. Better yet, submit a patch since that's Javadoc,
although the exact form of the doc fix might be debatable, so I general
description of the problem should be sufficient, unless you feel motivated.

-- Jack Krupansky

On Thu, Jan 15, 2015 at 11:23 AM, danield danield...@gmail.com wrote:

 Hi Mike,

 Thank you for your reply. Yes, I had thought of this, but it is not a
 solution to my problem, and this is because the Term Frequency and
 therefore
 the results will still be wrong, as prepending or appending a string to the
 term will still make it a different term.

 Similarily, I could use regex queries, but again that doesn't fix the TF
 issue. I am not talking here hypothetically, I have proof this doesn't work
 experimentally (i.e. the precision for my task goes down in my
 experiments).

 Also, I agree that when your fields are essentially different as in
 /title/,
 /author /and /text/, normalizing by field length makes sense, but in my
 case
 my fields are many and are all chunks of a larger text (extracted sentences
 that have been labelled with a number of different classes), and in the
 experiments I am running I am trying to establish whether weighting
 sentences in different classes differently will lead to increased relevance
 of results.

 This also doesn't change the fact that documentation is wrong! Any ideas
 how
 to fix?
 Daniel



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-14 Thread Michael Sokolov
In practice, normalization by field length proves to be more useful than 
normalization by the sum of the lengths of all fields (document length), 
which I think is what you seem to be after.  Think of a book chapter 
document with two fields: title and full text.  It makes little sense to 
weight the terms in the title differently for longer and shorter texts.


To get the behavior (I think) you want, you could index your documents 
like this:


document1={field:field1:term1 field1:term1}
document2={field:field1:term1 field2:term1}

and form queries like:

query1=field:field1\:term1
query2=field:(field1\:term1 or field2\:term1)

-Mike

On 1/13/15 2:24 PM, danield wrote:

Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms.

As an example, for 2 documents with fields field1 and field2, where
query1=”field1:term1”
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?

Cheers,
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield
Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms. 

As an example, for 2 documents with fields field1 and field2, where 
query1=”field1:term1”
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that 

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?

Cheers,
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

2015-01-13 Thread danield
Corrections: 

document2={field1:”term1”, field2:”term1”} 
Coord(query1,document2)= 1/1 = 1

(Doesn't affect the problem/observation)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org