[ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512959#comment-14512959
 ] 

Liang-Chi Hsieh commented on SPARK-7143:
----------------------------------------

I found some references and articles discussing the comparison between BM25 and 
TF-IDF. 

The two articles talk about the BM25 and TF-IDF in Lucene and Elasticsearch: 
[BM-vs-Lucene-default-similarity|https://www.found.no/foundation/BM-vs-Lucene-default-similarity/],
 [Similarity in Elasticsearch|https://www.found.no/foundation/similarity/].

The first shows that in their experiments BM25 performs better than Lucene's 
default similarity (tf-idf based). However, it also notices that this is not a 
general proof that BM25 is always better than the default similarity, but just 
a suggestion for using BM25 over the default similarity.

The second one explains the difference between BM25 and TF-IDF. In short, BM25 
should perform better because of saturation function and compensating document 
length.

For recent academic papers, I found two papers [1][2] dealing with document 
ranking problem. Although they are intended to compare BM25 and TF-IDF (they 
are not new methods), indirectly we can observe the performance difference 
between two methods. In these experiments, BM25 clearly performs better than 
TF-IDF.

This BM25 implementation is based on the formula described in the Wikipedia 
page. Compared with the formula found in the [3][4], it additionally multiples 
the constant (k1 + 1) to normalize the weight of terms with tf equals to 1.

[1] Jiaul H. Paik, "A Novel TF-IDF Weighting Scheme for Effective Ranking," 
SIGIR'13.
[2] Lei Zheng, Ingemar J. Cox, "Re-ranking Documents Based on Query-Independent 
Document Specificity," FQAS'09.
[3] Joaqu´ınP´erez-Iglesias et al., "IntegratingtheProbabilisticModel 
BM25/BM25FintoLucene."
[4] Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance 
Framework: 
BM25 and Beyond."

> Add BM25 Estimator
> ------------------
>
>                 Key: SPARK-7143
>                 URL: https://issues.apache.org/jira/browse/SPARK-7143
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Liang-Chi Hsieh
>
> [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
> to rank documents. It is commonly used in IR tasks and can be parallel. This 
> issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to