[ https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512959#comment-14512959 ]
Liang-Chi Hsieh commented on SPARK-7143: ---------------------------------------- I found some references and articles discussing the comparison between BM25 and TF-IDF. The two articles talk about the BM25 and TF-IDF in Lucene and Elasticsearch: [BM-vs-Lucene-default-similarity|https://www.found.no/foundation/BM-vs-Lucene-default-similarity/], [Similarity in Elasticsearch|https://www.found.no/foundation/similarity/]. The first shows that in their experiments BM25 performs better than Lucene's default similarity (tf-idf based). However, it also notices that this is not a general proof that BM25 is always better than the default similarity, but just a suggestion for using BM25 over the default similarity. The second one explains the difference between BM25 and TF-IDF. In short, BM25 should perform better because of saturation function and compensating document length. For recent academic papers, I found two papers [1][2] dealing with document ranking problem. Although they are intended to compare BM25 and TF-IDF (they are not new methods), indirectly we can observe the performance difference between two methods. In these experiments, BM25 clearly performs better than TF-IDF. This BM25 implementation is based on the formula described in the Wikipedia page. Compared with the formula found in the [3][4], it additionally multiples the constant (k1 + 1) to normalize the weight of terms with tf equals to 1. [1] Jiaul H. Paik, "A Novel TF-IDF Weighting Scheme for Effective Ranking," SIGIR'13. [2] Lei Zheng, Ingemar J. Cox, "Re-ranking Documents Based on Query-Independent Document Specificity," FQAS'09. [3] Joaqu´ınP´erez-Iglesias et al., "IntegratingtheProbabilisticModel BM25/BM25FintoLucene." [4] Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance Framework: BM25 and Beyond." > Add BM25 Estimator > ------------------ > > Key: SPARK-7143 > URL: https://issues.apache.org/jira/browse/SPARK-7143 > Project: Spark > Issue Type: New Feature > Components: ML > Reporter: Liang-Chi Hsieh > > [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used > to rank documents. It is commonly used in IR tasks and can be parallel. This > issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org