Thank you very much Nick for your response.

I would like to ask two more questions:
1- Are the tf/idf scores consistent accross the all segments in a non-optimized 
index? Or is it being calculated separately for each segment (tf would not 
change but idf might be different)?
2- (same question but for multiple indexes and polysearcher) If I use 
polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or 
would they be calculated separately for each index?

Regards,
Serkan

On 2017-11-21 01:49, Nick Wellnhofer <[email protected]> wrote: 
> 
> On Nov 21, 2017, at 02:09 , [email protected] wrote:
> > I have a question regarding the scoring mechanism for relevancy. Is the 
> > scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in 
> > the schema? What happens when multiple terms are used? Are tf/idf's summed?
> 
> Lucy uses Lucene's Practical Scoring Function by default:
> 
> https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
> 
> Essentially, tf/idf values are summed after being multiplied with each term's 
> boost and normalization factor.
> 
> > How does the incorporate the location of the words to the scoring mechanism 
> > for queries with multiple words?
> 
> > How about the fields which has RegexTokenizer? Is it still the same 
> > mechanism? Does the type of the tokenizer affect the scoring?  I believe 
> > the important thing is the generated tokens (and not related to the 
> > tokenizer), and maybe the order of the tokens in a document.
> 
> If you use the core Tokenizers, the type of Tokenizer or the location of 
> terms in a document don’t affect scoring. But you can write a custom 
> Tokenizer that sets different boost values for each Token, for example 
> depending on the location within the document.
> 
> > One more thing, if I were to change the scoring mechanism for different 
> > fields, how can I do it? Are there any predefined mechanisms eg. tf/idf 
> > doc2vec etc. Or if I want to go further and come up with my own how can I 
> > do it?
> 
> You can tweak the scoring formula by supplying your own Similarity subclass 
> for each FieldType, possibly in conjunction with your own 
> Query/Compiler/Matcher subclasses:
> 
> https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html
> 
> The public documentation for Similarity is incomplete, unfortunately. But the 
> class is similar to Lucene’s. The .cfh file contains more details:
> 
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD
> 
> You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.
> 
> Nick
> 
> 

Reply via email to