Re: Why it's boosted up?

2010-08-24 Thread 朱炎詹

Thanks! That' make sense :)

- Original Message - 
From: "Ahmet Arslan" 

To: 
Sent: Tuesday, August 24, 2010 4:30 PM
Subject: Re: Why it's boosted up?



Then why short fields are boost up?


In other words longer documents are punished. Because they contain 
possibly many terms/words. If this mechanism does not exist, longer 
documents takes over and pops up usually in the first page.












¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00




Re: Why it's boosted up?

2010-08-24 Thread 朱炎詹

Thanks for your clear explanation! I got it :)
- Original Message - 
From: "MitchK" 

To: 
Sent: Tuesday, August 24, 2010 3:37 PM
Subject: Re: Why it's boosted up?




Hi Scott,




(so  shorter fields are automatically boosted up). "


The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
In doc2 2/20 = 0.1 => 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs 
the

total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: 
This

does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

From the similarity-class's documentation of lengthNorm():




Matches in longer fields are less precise, so implementations of this
method usually return smaller values when numTokens is large, and larger
values when numTokens is small.



However, you, as a search-application-developer got the task, that you 
have

to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does.
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene.




The request is from our business team, they wish user of our product can
type in partial string of a word that exists in title or body field.


You mean something like typing "note" and also getting results like
"notebook"?
The correct approach for something like that is not using shingleFilter 
but

NGrams or edged NGrams.
Shingles are doing something like that:
"This is my shingle sentence" -> "This is, is my, my shingle, shingle
sentence" -> it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have 
found
a short phrase without using the performance-consuming 
phraseQuery-feature.


Kind regards,
- Mitch


scott chu wrote:


In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written
to
the directory (when writing the document) in a single byte (!) as 
follows:

For each field of a document, all boosts of that field (i.e. all boosts
under the same field name in that doc) are multiplied. The result is
multiplied by the boost of the document, and also multiplied by a "field
length norm" value that represents the length of that field in that doc
(so
shorter fields are automatically boosted up). "

I though the greater the value, the boosting is upper. Then why short
fields
are boost up? Isn't Norm value for short fields smaller?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html

Sent from the Solr - User mailing list archive at Nabble.com.








¥¼¦b¶Ç¤J°T®§¤¤§ä¨ì¯f¬r¡C
Checked by AVG - www.avg.com
Version: 9.0.851 / Virus Database: 271.1.1/3090 - Release Date: 08/24/10 
02:34:00




Re: Why it's boosted up?

2010-08-24 Thread Ahmet Arslan
> Then why short fields are boost up? 

In other words longer documents are punished. Because they contain possibly 
many terms/words. If this mechanism does not exist, longer documents takes over 
and pops up usually in the first page.


  


Re: Why it's boosted up?

2010-08-24 Thread MitchK

Hi Scott,



> (so  shorter fields are automatically boosted up). " 
> 
The theory behind that is the following (in easy words):
Let's say you got two documents, each doc contains on 1 field (like it was
in my example).
Additionally we got a query that contains two words.
Let's say doc1 contains on 10 words and doc2 contains on 20 words.
The query matches both docs with both words.
The idea of boosting shorter fields stronger than longer fields is the
following:
In doc1, 2/10 = 0.2 => 20% of the words are matching your query.
In doc2 2/20 = 0.1 => 10% of the words are matching your query.

So doc1 should get a better score, because the rate of matching words vs the
total number of occuring words is greater than in doc2
This is the idea of using norms as an index-time-boosting-factor. NOTE: This
does not mean that doc1 get's boosted by 20% and doc1 by 10%! It only
illustrates what the idea behind such norms is.

>From the similarity-class's documentation of lengthNorm():



> Matches in longer fields are less precise, so implementations of this
> method usually return smaller values when numTokens is large, and larger
> values when numTokens is small.
> 

However, you, as a search-application-developer got the task, that you have
to decide whether this theory applies to your application or not. In some
cases using norms makes no sense, in others it does. 
If you think that norms are applying to your project, ommitting them is no
good approach to save disk-space.
Furthermore: If you think the theory does apply to the business-needs of
your application but its impact is currently to heavy, you can have a look
at the sweetSpotSimilarity in Lucene. 



> The request is from our business team, they wish user of our product can 
> type in partial string of a word that exists in title or body field.
> 
You mean something like typing "note" and also getting results like
"notebook"?
The correct approach for something like that is not using shingleFilter but
NGrams or edged NGrams.
Shingles are doing something like that:
"This is my shingle sentence" -> "This is, is my, my shingle, shingle
sentence" -> it breaks up the sentence into smaller pieces. The benefit of
doins so is, that, if a query matches one of these shingles, you have found
a short phrase without using the performance-consuming phraseQuery-feature.

Kind regards,
- Mitch


scott chu wrote:
> 
> In Lucene's web page, there's a paragraph:
> 
> "Indexing time boosts are preprocessed for storage efficiency and written
> to 
> the directory (when writing the document) in a single byte (!) as follows: 
> For each field of a document, all boosts of that field (i.e. all boosts 
> under the same field name in that doc) are multiplied. The result is 
> multiplied by the boost of the document, and also multiplied by a "field 
> length norm" value that represents the length of that field in that doc
> (so 
> shorter fields are automatically boosted up). "
> 
> I though the greater the value, the boosting is upper. Then why short
> fields 
> are boost up? Isn't Norm value for short fields smaller?
> 
> 
> 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Doing-Shingle-but-also-keep-special-single-word-tp1241204p1306419.html
Sent from the Solr - User mailing list archive at Nabble.com.


Why it's boosted up?

2010-08-23 Thread 朱炎詹

In Lucene's web page, there's a paragraph:

"Indexing time boosts are preprocessed for storage efficiency and written to 
the directory (when writing the document) in a single byte (!) as follows: 
For each field of a document, all boosts of that field (i.e. all boosts 
under the same field name in that doc) are multiplied. The result is 
multiplied by the boost of the document, and also multiplied by a "field 
length norm" value that represents the length of that field in that doc (so 
shorter fields are automatically boosted up). "


I though the greater the value, the boosting is upper. Then why short fields 
are boost up? Isn't Norm value for short fields smaller?