Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Update: I have implemented my own subclasses of QueryParser, BooleanQuery, BooleanScorer and Similarity to deal with this. I have been successful in getting the exact behaviour I want... when calling the .explain() method. However, the scores for some documents often differ when calling IndexSearcher.search() vs IndexSearcher.explain(). I am a bit confused by this. The coord() seems to be one of the things I need to change, but is not the only element in the formula that I have clearly changed for the .explain() pipeline but not for .search(). The implementation of BulkScorer remains perplexing to me and I suspect it is something in there I have missed. Any pointers? Thanks! Daniel On 15 January 2015 at 23:00, Jack Krupansky-3 [via Lucene] ml-node+s472066n4179925...@n3.nabble.com wrote: File a Jira for this particular doc fix since it is significant and not just mere worksmithing. Better yet, submit a patch since that's Javadoc, although the exact form of the doc fix might be debatable, so I general description of the problem should be sufficient, unless you feel motivated. -- Jack Krupansky On Thu, Jan 15, 2015 at 11:23 AM, danield [hidden email] http:///user/SendEmail.jtp?type=nodenode=4179925i=0 wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could use regex queries, but again that doesn't fix the TF issue. I am not talking here hypothetically, I have proof this doesn't work experimentally (i.e. the precision for my task goes down in my experiments). Also, I agree that when your fields are essentially different as in /title/, /author /and /text/, normalizing by field length makes sense, but in my case my fields are many and are all chunks of a larger text (extracted sentences that have been labelled with a number of different classes), and in the experiments I am running I am trying to establish whether weighting sentences in different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http:///user/SendEmail.jtp?type=nodenode=4179925i=1 For additional commands, e-mail: [hidden email] http:///user/SendEmail.jtp?type=nodenode=4179925i=2 -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179925.html To unsubscribe from Similarity formula documentation is misleading + how to make field-agnostic queries?, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4179307code=ZGFuaWVsZHVtYUBnbWFpbC5jb218NDE3OTMwN3wxMjkzMjkwMDg3 . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4180529.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
On 1/15/15 11:23 AM, danield wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could use regex queries, but again that doesn't fix the TF issue. I am not talking here hypothetically, I have proof this doesn't work experimentally (i.e. the precision for my task goes down in my experiments). Also, I agree that when your fields are essentially different as in /title/, /author /and /text/, normalizing by field length makes sense, but in my case my fields are many and are all chunks of a larger text (extracted sentences that have been labelled with a number of different classes), and in the experiments I am running I am trying to establish whether weighting sentences in different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel In Lucene a Term encodes the field and the term text, so the documentation is not incorrect. In fact this is stated explicitly here: Lucene is field based, hence each query term applies to a single field, document length normalization is by the length of the certain field, and in addition to document boost there are also document fields boosts. You might consider indexing your sentences as multiple values of a single field. If you need to label them you could possibly use payloads for that. -Mike
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Oh thanks Mike, it did say somewhere. I guess it wouldn't hurt to make that explanation more prominent, as I clearly missed it. Never mind, I am working on my own solution for this, through subclassing QueryParser, BooleanQuery, BooleanScorer, Similarity and a bunch of other classes. Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179851.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could use regex queries, but again that doesn't fix the TF issue. I am not talking here hypothetically, I have proof this doesn't work experimentally (i.e. the precision for my task goes down in my experiments). Also, I agree that when your fields are essentially different as in /title/, /author /and /text/, normalizing by field length makes sense, but in my case my fields are many and are all chunks of a larger text (extracted sentences that have been labelled with a number of different classes), and in the experiments I am running I am trying to establish whether weighting sentences in different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
File a Jira for this particular doc fix since it is significant and not just mere worksmithing. Better yet, submit a patch since that's Javadoc, although the exact form of the doc fix might be debatable, so I general description of the problem should be sufficient, unless you feel motivated. -- Jack Krupansky On Thu, Jan 15, 2015 at 11:23 AM, danield danield...@gmail.com wrote: Hi Mike, Thank you for your reply. Yes, I had thought of this, but it is not a solution to my problem, and this is because the Term Frequency and therefore the results will still be wrong, as prepending or appending a string to the term will still make it a different term. Similarily, I could use regex queries, but again that doesn't fix the TF issue. I am not talking here hypothetically, I have proof this doesn't work experimentally (i.e. the precision for my task goes down in my experiments). Also, I agree that when your fields are essentially different as in /title/, /author /and /text/, normalizing by field length makes sense, but in my case my fields are many and are all chunks of a larger text (extracted sentences that have been labelled with a number of different classes), and in the experiments I am running I am trying to establish whether weighting sentences in different classes differently will lead to increased relevance of results. This also doesn't change the fact that documentation is wrong! Any ideas how to fix? Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179834.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
In practice, normalization by field length proves to be more useful than normalization by the sum of the lengths of all fields (document length), which I think is what you seem to be after. Think of a book chapter document with two fields: title and full text. It makes little sense to weight the terms in the title differently for longer and shorter texts. To get the behavior (I think) you want, you could index your documents like this: document1={field:field1:term1 field1:term1} document2={field:field1:term1 field2:term1} and form queries like: query1=field:field1\:term1 query2=field:(field1\:term1 or field2\:term1) -Mike On 1/13/15 2:24 PM, danield wrote: Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf Term Frequency (TF) counts are expected to be per-document in the IR literature, and this documentation doesn’t say any differently. However, it turns out that for Lucene, TF scores are in fact PER-FIELD. This furthermore applies to the /coord/ component. I realise that /coord/ is a ratio of query terms matched over total query terms, but I believe an effort could be made to make clear that field1:term1 and field2:term1 count as 2 different query terms. As an example, for 2 documents with fields field1 and field2, where query1=”field1:term1” query2=”field1:term1 or field2:term1” document1={field1:”term1 term1”, field2:””} document2={field2:”term1”, field2:”term1”} Coord(query1,document1)= 1/1 = 1 Coord(query2,document1)= 1/2 = 0.5 Coord(query1,document2)= 1/2 = 0.5 Coord(query2,document2)= 2/2 = 1 Now, the TF scores will be normalized with the fieldNorm component which is computed based on field length at indexing time and stored in a single byte, with a significant loss of precision. These things together make it impossible to run Lucene retrieval in such a way that *similarity(query2,document1) == similarity(query2,document2)* which is precisely what I need in my use case. Here are my questions: 1. I think the documentation should be updated to make this clear! Can I do this myself? 2. Has anyone encountered this problem before? Is there an easy fix? Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Similarity formula documentation is misleading + how to make field-agnostic queries?
Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf Term Frequency (TF) counts are expected to be per-document in the IR literature, and this documentation doesn’t say any differently. However, it turns out that for Lucene, TF scores are in fact PER-FIELD. This furthermore applies to the /coord/ component. I realise that /coord/ is a ratio of query terms matched over total query terms, but I believe an effort could be made to make clear that field1:term1 and field2:term1 count as 2 different query terms. As an example, for 2 documents with fields field1 and field2, where query1=”field1:term1” query2=”field1:term1 or field2:term1” document1={field1:”term1 term1”, field2:””} document2={field2:”term1”, field2:”term1”} Coord(query1,document1)= 1/1 = 1 Coord(query2,document1)= 1/2 = 0.5 Coord(query1,document2)= 1/2 = 0.5 Coord(query2,document2)= 2/2 = 1 Now, the TF scores will be normalized with the fieldNorm component which is computed based on field length at indexing time and stored in a single byte, with a significant loss of precision. These things together make it impossible to run Lucene retrieval in such a way that *similarity(query2,document1) == similarity(query2,document2)* which is precisely what I need in my use case. Here are my questions: 1. I think the documentation should be updated to make this clear! Can I do this myself? 2. Has anyone encountered this problem before? Is there an easy fix? Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity formula documentation is misleading + how to make field-agnostic queries?
Corrections: document2={field1:”term1”, field2:”term1”} Coord(query1,document2)= 1/1 = 1 (Doesn't affect the problem/observation) -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307p4179370.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org