Re: Jensen–Shannon divergence

Ahmet Arslan Sun, 13 Dec 2015 10:17:14 -0800

Hi Shay,

I suggest you to extend o.a.l.search.similarities.SimilarityBase.
All you need to implement a score() method. After all fancy names (language 
models, etc), a similarity is a function of seven salient statistics. It is 
actually six: avgFieldLength can derived from other two (numberOfFieldTokens 
divided by numberOfDocuments)

Seven Statistics come from,
Corpus statistics : numberOfDocuments, numberOfFieldTokens, avgFieldLength
Term statistics: totalTermFreq and docFreq
About the document being scored : within document term frequency (freq) and 
document length (docLen)

If you can express your ranking method in terms of these seven variables, you 
are ready to go. For example my Dirichlet LM model implementation is nothing 
but :

return log2(1 + (tf / (c * (termFrequency / numberOfTokens)))) + log2(c / 
(docLength + c));

If you need additional statistics, number of unique terms in a document for 
example, you need to calculate it by your self and embed it to the index 
(possibly using DocValues). During scoring, you can retrieve it.

Personally I wondered about your similarity, If possible please let community 
know about its effectiveness.

Please also see Robert's write-up : 
http://lucidworks.com/blog/2011/09/12/flexible-ranking-in-lucene-4/

Thanks,
Ahmet

On Sunday, December 13, 2015 6:28 PM, will martin <[email protected]> wrote:
Sorry it was early.

If you go looking on the web, you can find, as I did reputable work on 
implementing DiricletLanguage Models. However, at this hour you might get 
answers here. Extrapolating others work into a lucene implantation is only 
slightly different from getting answers here. imo

g'luck

> On Dec 13, 2015, at 10:55 AM, Shay Hummel <[email protected]> wrote:
> 
> Hi
> 
> I am sorry but I didn't understand your answer. Can you please elaborate?
> 
> Shay
> 
> On Sun, Dec 13, 2015 at 3:41 PM will martin <[email protected]> wrote:
> 
>> expand your due diligence beyond wikipedia:
>> i.e.
>> 
>> http://ciir.cs.umass.edu/pubfiles/ir-464.pdf
>> 
>> 
>> 
>>> On Dec 13, 2015, at 8:30 AM, Shay Hummel <[email protected]> wrote:
>>> 
>>> LMDiricletbut its feasibilit
>> 
> -- 
> Regards,
> Shay Hummel

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Jensen–Shannon divergence

Reply via email to