[
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512959#comment-14512959
]
Liang-Chi Hsieh commented on SPARK-7143:
I found some references and articles discussing the comparison between BM25 and
TF-IDF.
The two articles talk about the BM25 and TF-IDF in Lucene and Elasticsearch:
[BM-vs-Lucene-default-similarity|https://www.found.no/foundation/BM-vs-Lucene-default-similarity/],
[Similarity in Elasticsearch|https://www.found.no/foundation/similarity/].
The first shows that in their experiments BM25 performs better than Lucene's
default similarity (tf-idf based). However, it also notices that this is not a
general proof that BM25 is always better than the default similarity, but just
a suggestion for using BM25 over the default similarity.
The second one explains the difference between BM25 and TF-IDF. In short, BM25
should perform better because of saturation function and compensating document
length.
For recent academic papers, I found two papers [1][2] dealing with document
ranking problem. Although they are intended to compare BM25 and TF-IDF (they
are not new methods), indirectly we can observe the performance difference
between two methods. In these experiments, BM25 clearly performs better than
TF-IDF.
This BM25 implementation is based on the formula described in the Wikipedia
page. Compared with the formula found in the [3][4], it additionally multiples
the constant (k1 + 1) to normalize the weight of terms with tf equals to 1.
[1] Jiaul H. Paik, "A Novel TF-IDF Weighting Scheme for Effective Ranking,"
SIGIR'13.
[2] Lei Zheng, Ingemar J. Cox, "Re-ranking Documents Based on Query-Independent
Document Specificity," FQAS'09.
[3] Joaqu´ınP´erez-Iglesias et al., "IntegratingtheProbabilisticModel
BM25/BM25FintoLucene."
[4] Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance
Framework:
BM25 and Beyond."
> Add BM25 Estimator
> --
>
> Key: SPARK-7143
> URL: https://issues.apache.org/jira/browse/SPARK-7143
> Project: Spark
> Issue Type: New Feature
> Components: ML
>Reporter: Liang-Chi Hsieh
>
> [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used
> to rank documents. It is commonly used in IR tasks and can be parallel. This
> issue is proposed to add it into Spark ML.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org