Re: Why it's boosted up?
Thanks! That' make sense :) - Original Message - From: "Ahmet Arslan" To: Sent: Tuesday, August 24, 2010 4:30 PM Subject: Re: Why it's boosted up? Then why short fields are boost up? In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page. ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C Checked by AVG - www.avg.com Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 02:34:00
Re: Why it's boosted up?
Thanks for your clear explanation! I got it :) - Original Message - From: "MitchK" To: Sent: Tuesday, August 24, 2010 3:37 PM Subject: Re: Why it's boosted up? Hi Scott, (so shorter fields are automatically boosted up). " The theory behind that is the following (in easy words): Let's say you got two documents, each doc contains on 1 field (like it was in my example). Additionally we got a query that contains two words. Let's say doc1 contains on 10 words and doc2 contains on 20 words. The query matches both docs with both words. The idea of boosting shorter fields stronger than longer fields is the following: In doc1, 2/10 = 0.2 => 20% of the words are matching your query. In doc2 2/20 = 0.1 => 10% of the words are matching your query. So doc1 should get a better score, because the rate of matching words vs the total number of occuring words is greater than in doc2 This is the idea of using norms as an index-time-boosting-factor. NOTE: This does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only illustrates what the idea behind such norms is. From the similarity-class's documentation of lengthNorm(): Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. However, you, as a search-application-developer got the task, that you have to decide whether this theory applies to your application or not. In some cases using norms makes no sense, in others it does. If you think that norms are applying to your project, ommitting them is no good approach to save disk-space. Furthermore: If you think the theory does apply to the business-needs of your application but its impact is currently to heavy, you can have a look at the sweetSpotSimilarity in Lucene. The request is from our business team, they wish user of our product can type in partial string of a word that exists in title or body field. You mean something like typing "note" and also getting results like "notebook"? The correct approach for something like that is not using shingleFilter but NGrams or edged NGrams. Shingles are doing something like that: "This is my shingle sentence" -> "This is, is my, my shingle, shingle sentence" -> it breaks up the sentence into smaller pieces. The benefit of doins so is, that, if a query matches one of these shingles, you have found a short phrase without using the performance-consuming phraseQuery-feature. Kind regards, - Mitch scott chu wrote: In Lucene's web page, there's a paragraph: "Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). " I though the greater the value, the boosting is upper. Then why short fields are boost up? Isn't Norm value for short fields smaller? -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html Sent from the Solr - User mailing list archive at Nabble.com. ¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C Checked by AVG - www.avg.com Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 02:34:00
Re: Why it's boosted up?
> Then why short fields are boost up? In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page.
Re: Why it's boosted up?
Hi Scott, > (so shorter fields are automatically boosted up). " > The theory behind that is the following (in easy words): Let's say you got two documents, each doc contains on 1 field (like it was in my example). Additionally we got a query that contains two words. Let's say doc1 contains on 10 words and doc2 contains on 20 words. The query matches both docs with both words. The idea of boosting shorter fields stronger than longer fields is the following: In doc1, 2/10 = 0.2 => 20% of the words are matching your query. In doc2 2/20 = 0.1 => 10% of the words are matching your query. So doc1 should get a better score, because the rate of matching words vs the total number of occuring words is greater than in doc2 This is the idea of using norms as an index-time-boosting-factor. NOTE: This does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only illustrates what the idea behind such norms is. >From the similarity-class's documentation of lengthNorm(): > Matches in longer fields are less precise, so implementations of this > method usually return smaller values when numTokens is large, and larger > values when numTokens is small. > However, you, as a search-application-developer got the task, that you have to decide whether this theory applies to your application or not. In some cases using norms makes no sense, in others it does. If you think that norms are applying to your project, ommitting them is no good approach to save disk-space. Furthermore: If you think the theory does apply to the business-needs of your application but its impact is currently to heavy, you can have a look at the sweetSpotSimilarity in Lucene. > The request is from our business team, they wish user of our product can > type in partial string of a word that exists in title or body field. > You mean something like typing "note" and also getting results like "notebook"? The correct approach for something like that is not using shingleFilter but NGrams or edged NGrams. Shingles are doing something like that: "This is my shingle sentence" -> "This is, is my, my shingle, shingle sentence" -> it breaks up the sentence into smaller pieces. The benefit of doins so is, that, if a query matches one of these shingles, you have found a short phrase without using the performance-consuming phraseQuery-feature. Kind regards, - Mitch scott chu wrote: > > In Lucene's web page, there's a paragraph: > > "Indexing time boosts are preprocessed for storage efficiency and written > to > the directory (when writing the document) in a single byte (!) as follows: > For each field of a document, all boosts of that field (i.e. all boosts > under the same field name in that doc) are multiplied. The result is > multiplied by the boost of the document, and also multiplied by a "field > length norm" value that represents the length of that field in that doc > (so > shorter fields are automatically boosted up). " > > I though the greater the value, the boosting is upper. Then why short > fields > are boost up? Isn't Norm value for short fields smaller? > > > -- View this message in context: http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html Sent from the Solr - User mailing list archive at Nabble.com.
Why it's boosted up?
In Lucene's web page, there's a paragraph: "Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte (!) as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). " I though the greater the value, the boosting is upper. Then why short fields are boost up? Isn't Norm value for short fields smaller?