word weights using BM25

2014-09-23 Thread Arian Pasquali
Hi, I was wondering if would be possible to support bm25 term weighting extending Mahout's tf-idf implementation. I was curious to know if anyone here has already tried to do so. If not, what would be your suggestion for such implementation on Mahout? Arian Pasquali http://about.me/arianpasquali

Re: word weights using BM25

2014-09-23 Thread Ted Dunning
Should be pretty easy. I haven't heard of anyone doing it. Sent from my iPhone > On Sep 23, 2014, at 18:53, Arian Pasquali wrote: > > Hi, > I was wondering if would be possible to support bm25 term weighting > extending Mahout's tf-idf implementation. > > I was curious to know if anyone here

Re: word weights using BM25

2014-09-23 Thread Suneel Marthi
Lucene 4.x supports okapi-bm25. So it should be easy to implement. On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning wrote: > Should be pretty easy. I haven't heard of anyone doing it. > > Sent from my iPhone > > > On Sep 23, 2014, at 18:53, Arian Pasquali > wrote: > > > > Hi, > > I was wondering i

Re: word weights using BM25

2014-09-24 Thread Arian Pasquali
Yes, I'm studying his work and the current mahout's tfidf code. Trying to understand how I would port that to mr. I ll try to share something if I succeed. Arian Pasquali http://about.me/arianpasquali 2014-09-24 5:12 GMT+01:00 Suneel Marthi : > Luce

Re: word weights using BM25

2014-09-24 Thread Marko
Hello everyone, I'm very sorry to bump in like this, I have been added to the mail list (I think), but it seems that I'm somehow unable to ask a question, that is, I asked a question full times and got no answer. I hope this way will work. I'm new to Mahout and I've been struggling with Stre

Re: word weights using BM25

2014-09-24 Thread Ted Dunning
Marko, Sorry to be non-responsive. There is not a good user manual for the streaming k-means software and there are some known scaling pathologies with that code. I myself know some about it, but lack the time currently to provide detailed support. Can you remind me what your interest is? Is t

Re: word weights using BM25

2014-09-24 Thread Suneel Marthi
@Marko, Subject: Streaming KMeans See http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471 for how to invoke Streaming Kmeans Also look at examples/bin/cluster-reuters.sh for the Streaming KMeans option. On Wed, Sep 24, 2014 at 11:34 AM, Marko wrote

Re: word weights using BM25

2014-09-24 Thread Ted Dunning
Marko, Suneel's answer is much better than mine. On Wed, Sep 24, 2014 at 10:10 PM, Suneel Marthi wrote: > @Marko, Subject: Streaming KMeans > > See > > http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means/18090471#18090471 > for how to invoke Streaming Kmeans > > Also

Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Hey guys, I think it is fair to give you some feedback. I managed to implement BM25+ term score on Mahout. It was straightforward using the current TFIDF implementation as an example. Basically what I did was implement the interface org.apache.mahout.vecto

Re: word weights using BM25

2014-10-01 Thread Suneel Marthi
How did u implement BM25PartialVectorReducer and BM25Converter?? The present implementations for TFIDFConverter and Reducer are MR. Mahout is not accepting any new MapReduce code. On Wed, Oct 1, 2014 at 7:18 AM, Arian Pasquali wrote: > Hey guys, > I think it is fair to give you some feedback. >

Re: word weights using BM25

2014-10-01 Thread Ted Dunning
Thanks so much for the feedback. Glad to hear it was straightforward. But the important question is how did BM25 work for you? On Wed, Oct 1, 2014 at 6:18 AM, Arian Pasquali wrote: > Hey guys, > I think it is fair to give you some feedback. > I managed to implement BM25+

Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Hi Ted, My dataset is a collection of documents in german and I can say that the scores seems better compared to my TFIDF scores. Results make more sense now, specially my bi-grams. Arian Pasquali http://about.me/arianpasquali 2014-10-01 13:09 GMT+01:00 Ted Dunning : > Thanks so much for the

Re: word weights using BM25

2014-10-01 Thread Arian Pasquali
Yes Suneel, Indeed It is in MR fashion. What exactly do you mean when you said Mahout is not accepting any new MapReduce code? Do you mean for submitting a patch? I'm sure there might be better ways to implement it, but I'm more interesting in the results right now. What would be your suggestion?

Re: word weights using BM25

2014-10-01 Thread Ted Dunning
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali wrote: > My dataset is a collection of documents in german and I can say that the > scores seems better compared to my TFIDF scores. Results make more sense > now, specially my bi-grams. > OK. I will take note.

Re: word weights using BM25

2014-10-02 Thread Pat Ferrel
We are moving to higher performance platforms than Hadoop mapreduce, like Spark. You can still do map/reduce style code but Mahout's not taking new Hadoop mr code. On Oct 1, 2014, at 6:30 AM, Arian Pasquali wrote: Yes Suneel, Indeed It is in MR fashion. What exactly do you mean when you said