That depends on what you want. In this case I want to use a
discrimination power based in all the body text, not just the titles.
Because otherwise terms that are really not that relevant end up being
very high!
El 17/11/16 a las 18:25, Ahmet Arslan escribió:
Hi Nicholas,
IDF, among others, is a measure of term specificity. If 'or' is not so usual in
titles, then it has some discrimination power in that domain.
I think it's OK 'or' to get a high IDF value in this case.
Ahmet
On Thursday, November 17, 2016 9:09 PM, Nicolás Lichtmaier
<nicol...@wolfram.com> wrote:
IDF measures the selectivity of a term. But the calculation is
per-field. That can be bad for very short fields (like titles). One
example of this problem: If I don't delete stop words, then "or", "and",
etc. should be dealt with low IDF values, however "or" is, perhaps, not
so usual in titles. Then, "or" will have a high IDF value and be treated
as an important term. That's bad.
One solution I see is to modify the Similarity to have a global, or
multi-field IDF value. This value would include in its calculation
longer fields that has more "normal text"-like stats. However this is
not trivial because I can't just add document-frequencies (I would be
counting some documents several times if "or" is present in more than
one field). I would need need to OR the bit-vectors that signal the
presence of the term, right? Not trivial.
Has anyone encountered this issue? Has it been solved? Is my thinking wrong?
Should I also try the developers' list?
Thanks!
Nicolás.-
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org