[jira] [Commented] (SPARK-7143) Add BM25 Estimator

2015-04-26 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512959#comment-14512959
 ] 

Liang-Chi Hsieh commented on SPARK-7143:


I found some references and articles discussing the comparison between BM25 and 
TF-IDF. 

The two articles talk about the BM25 and TF-IDF in Lucene and Elasticsearch: 
[BM-vs-Lucene-default-similarity|https://www.found.no/foundation/BM-vs-Lucene-default-similarity/],
 [Similarity in Elasticsearch|https://www.found.no/foundation/similarity/].

The first shows that in their experiments BM25 performs better than Lucene's 
default similarity (tf-idf based). However, it also notices that this is not a 
general proof that BM25 is always better than the default similarity, but just 
a suggestion for using BM25 over the default similarity.

The second one explains the difference between BM25 and TF-IDF. In short, BM25 
should perform better because of saturation function and compensating document 
length.

For recent academic papers, I found two papers [1][2] dealing with document 
ranking problem. Although they are intended to compare BM25 and TF-IDF (they 
are not new methods), indirectly we can observe the performance difference 
between two methods. In these experiments, BM25 clearly performs better than 
TF-IDF.

This BM25 implementation is based on the formula described in the Wikipedia 
page. Compared with the formula found in the [3][4], it additionally multiples 
the constant (k1 + 1) to normalize the weight of terms with tf equals to 1.

[1] Jiaul H. Paik, "A Novel TF-IDF Weighting Scheme for Effective Ranking," 
SIGIR'13.
[2] Lei Zheng, Ingemar J. Cox, "Re-ranking Documents Based on Query-Independent 
Document Specificity," FQAS'09.
[3] Joaqu´ınP´erez-Iglesias et al., "IntegratingtheProbabilisticModel 
BM25/BM25FintoLucene."
[4] Stephen Robertson and Hugo Zaragoza, "The Probabilistic Relevance 
Framework: 
BM25 and Beyond."

> Add BM25 Estimator
> --
>
> Key: SPARK-7143
> URL: https://issues.apache.org/jira/browse/SPARK-7143
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Liang-Chi Hsieh
>
> [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
> to rank documents. It is commonly used in IR tasks and can be parallel. This 
> issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7143) Add BM25 Estimator

2015-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512765#comment-14512765
 ] 

Joseph K. Bradley commented on SPARK-7143:
--

Do you have some references to recent papers and current use cases in industry, 
especially ones showing BM25 is much better than TF-IDF?  It will be good to 
figure out whether it is clearly better than TF-IDF, or if it is best in 
specialized cases (and would then be better as a Spark package).

Also, can you please comment on which variant you're implementing?  The 
Wikipedia page makes it sound like some corrections are necessary for the basic 
BM25 in order to make it more practical.

Thanks!

> Add BM25 Estimator
> --
>
> Key: SPARK-7143
> URL: https://issues.apache.org/jira/browse/SPARK-7143
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Liang-Chi Hsieh
>
> [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
> to rank documents. It is commonly used in IR tasks and can be parallel. This 
> issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7143) Add BM25 Estimator

2015-04-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512578#comment-14512578
 ] 

Apache Spark commented on SPARK-7143:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5701

> Add BM25 Estimator
> --
>
> Key: SPARK-7143
> URL: https://issues.apache.org/jira/browse/SPARK-7143
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Liang-Chi Hsieh
>
> [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
> to rank documents. It is commonly used in IR tasks and can be parallel. This 
> issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org