Multi-field IDF

Nicolás Lichtmaier Thu, 17 Nov 2016 10:10:06 -0800

IDF measures the selectivity of a term. But the calculation isper-field. That can be bad for very short fields (like titles). Oneexample of this problem: If I don't delete stop words, then "or", "and",etc. should be dealt with low IDF values, however "or" is, perhaps, notso usual in titles. Then, "or" will have a high IDF value and be treatedas an important term. That's bad.

One solution I see is to modify the Similarity to have a global, ormulti-field IDF value. This value would include in its calculationlonger fields that has more "normal text"-like stats. However this isnot trivial because I can't just add document-frequencies (I would becounting some documents several times if "or" is present in more thanone field). I would need need to OR the bit-vectors that signal thepresence of the term, right? Not trivial.


Has anyone encountered this issue? Has it been solved? Is my thinking wrong?

Should I also try the developers' list?

Thanks!

Nicolás.-

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Multi-field IDF

Reply via email to