error when evaluating recommender w/boolean prefs

2012-07-06 Thread Matt Mitchell
Hi, I have a recommender, with a boolean prefs model. I am following the instructions in the MIA book, but only get this exception: Illegal precision: NaN [Thrown class java.lang.IllegalArgumentException] Restarts: 0: [QUIT] Quit to the SLIME top level Backtrace: 0: com.google.common.base.

Re: Apache Mahout integration with Apache HIve

2012-07-06 Thread Robin Morris
Some mahout algorithms use map-reduce, others (e.g. logistic regression) do not. If your data is in hive, you could look in to shoehorning the mahout algorithm in to a UDAF. This is what I'll be looking in to in the next couple of weeks, so if it's of potential interest, ping me in a few weeks an

Re: A bunch of SVD questions...

2012-07-06 Thread Dmitriy Lyubimov
yes, that's the one. Thank you, Ted. On Fri, Jul 6, 2012 at 2:32 PM, Ted Dunning wrote: > I think that Dmitriy is referring to this: > > http://www.deepdyve.com/lp/association-for-computing-machinery/regression-based-latent-factor-models-1ebJXMCs0K > > On Fri, Jul 6, 2012 at 2:26 PM, Dmitriy Lyub

Re: A bunch of SVD questions...

2012-07-06 Thread Ted Dunning
I think that Dmitriy is referring to this: http://www.deepdyve.com/lp/association-for-computing-machinery/regression-based-latent-factor-models-1ebJXMCs0K On Fri, Jul 6, 2012 at 2:26 PM, Dmitriy Lyubimov wrote: > (it is in ACM library, or Ted knows a cheaper arrangement to pull it off). >

Re: A bunch of SVD questions...

2012-07-06 Thread Dmitriy Lyubimov
these guys show one way to combine content info with dyadic data factorization, which is pretty close to what i used. Unfortunately i don't have a free download link for them (it is in ACM library, or Ted knows a cheaper arrangement to pull it off). Agarwal, Chen : "Regression-based Latent Factor

Re: A bunch of SVD questions...

2012-07-06 Thread Sean Owen
That's right, in the formulation you are referring to you are not predicting the original input values, so you can't compare them with RMSE or something. To test precision / recall you hold out some of the top-rated items (these are the "relevant results"), and see how many come back in the recomm

RE: A bunch of SVD questions...

2012-07-06 Thread Razon, Oren
Thanks Sean I've accidently continued this thread under the thread you opened, so I'm moving back to my thread :) I will rephrase the question I've asked there. Let's say that as part of my held-out test my model find for user u2 connection to i1 has strength of 28.94 to i2 17.9 and to i3 4.5. T

Re: What is the best factorizer for low-quality LSA?

2012-07-06 Thread Lance Norskog
Thanks. On Thu, Jul 5, 2012 at 9:55 AM, Ted Dunning wrote: > For this size a dense solver like in commons math should work. For larger > sizes (up to about a million non-zeros), the in-memory stochastic > projection SVD in Mahout should work well. > > On Thu, Jul 5, 2012 at 12:44 AM, Sean Owen

Re: Lucene Mahout Integration

2012-07-06 Thread Lance Norskog
I suggest asking this question on the lucene-users mailing list. On Thu, Jul 5, 2012 at 8:56 AM, Praveen Chandar wrote: > Hi, > I've used lucene as a data source for Mahout in the past. Recently, I > switched to Lucene 4.0 (trunk) and in lucene 4.0 the indexing/term vector > APIs have changed. >

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Ted Dunning
It is critical to use randomized projections here in order to get the dimension independent characteristics. On Fri, Jul 6, 2012 at 11:32 AM, Sean Owen wrote: > LSH is probably my ticket, thanks all. I tried a form of this, but > just used the basis of the feature space to define the hyperplanes

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Sean Owen
LSH is probably my ticket, thanks all. I tried a form of this, but just used the basis of the feature space to define the hyperplanes because I was lazy and experimenting. I didn't work well in the sense that the best recommendations were not hashed together unless you had fairly few buckets (i.e.,

Re: must numeric item and user IDs be sequential, for bin/mahout itemsimilarity?

2012-07-06 Thread Dan Brickley
On 6 July 2012 19:36, Sean Owen wrote: > I don't recall that it has ever caused a problem, no. The values are > just keys in a hashtable, so don't need to be sequential. Thanks, Sean. Quite possibly I was misinterpreting something; I've not managed to track down the source of my belief and am hap

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Ted Dunning
THere is a very lightweight LSH implementation in https://github.com/tdunning/knn that will be what I am bringing into Mahout as part of 0.8. It is specifically design to approximate dot products to accelerate search of this sort. You should be able to decrease the number of actual dot products b

Re: must numeric item and user IDs be sequential, for bin/mahout itemsimilarity?

2012-07-06 Thread Sean Owen
I don't recall that it has ever caused a problem, no. The values are just keys in a hashtable, so don't need to be sequential. On Fri, Jul 6, 2012 at 8:26 PM, Dan Brickley wrote: > I recall having problems with this before, using the non-Mahout Taste > code. I have meaningful strings for content

must numeric item and user IDs be sequential, for bin/mahout itemsimilarity?

2012-07-06 Thread Dan Brickley
I recall having problems with this before, using the non-Mahout Taste code. I have meaningful strings for content IDs and had mapped them systematically to pseudo-meaningful (but non-sequential) numbers. I remember that causing some problems a year or so back, ... but I'm trying it again now with t

Measuring the quality of the model

2012-07-06 Thread Sean Owen
(Changed subject from unrelated thread) You measure precision / recall, or the related F1 measure, or normalized discounted cumulative gain, or ROC. They are different, standard metrics that are less complicated than the sound. On Fri, Jul 6, 2012 at 6:13 PM, Razon, Oren wrote: > Thanks, it help

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread sam wu
One more thought. Cosine similarity kinds of measure the ratio of different feature preference. In recommendation job, I think ratio of feature preference is more relevant than the score itself ( kind reducing bias impact, some people rank score higher,..) Sam On Fri, Jul 6, 2012 at 9:01 AM, sam

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread sam wu
LSH has many different flavors (based on the different similarity metric). Normally Minhash, which is good for if you have boolean (yes-no, 0-1) features, and in the case of k-shingle, it fits well. In the latent topcis model, like ALS, the feature is no longer 0-1. I think Random Hyperplane (cosin

Re: Apache Mahout integration with Apache HIve

2012-07-06 Thread AnilKumar B
Hi Vignesh, Hive is not database, It is a query language on Hadoop. Hive internally converts queries into mapreduce jobs and executes. Mahout is implementation of ml algorithms using mapreduce. Both uses HDFS for storage. What exactly you want to do? Thanks, B Anil Kumar. On Fri, Jul 6, 2012

RE: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Razon, Oren
Thanks, it helped! After having some thoughts about what the outcome prediction, I'm having a question about measuring the quality of my model. If I'm using a technique in which in the end I'm predicting a preference value (implicit \ explicit) I could easily measure my model by applying it on a

RE: A bunch of SVD questions...

2012-07-06 Thread Razon, Oren
Hi Dmitriy, Thank you for the answer. I will be happy to read such paper -Original Message- From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Thursday, July 05, 2012 19:18 To: user@mahout.apache.org Subject: RE: A bunch of SVD questions... Cold start problem is usually best attacke

Re: Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Jens Grivolla
Maybe locality-sensitive hashing can help to get candidates before calculating the exact distance? Bye, Jens On 07/06/2012 11:35 AM, Sean Owen wrote: Here's one I've been puzzling over for a bit. In a factorization based on the SVD or what have you, you reconstruct the approximate original mat

general mahout working / some solr questions / last version tests

2012-07-06 Thread Videnova, Svetlana
Can someone please ask me to following questions: 1)What the input of mahout (a xml file? Which is the output of solr, that what it interests me!)? 2)What the output of mahout, I mean after clusterisation with k-means for exemple (a xml file again? )? 3)Where the output is store? 4)Can somebod

Shortcut to finding the best recs from factored matrices?

2012-07-06 Thread Sean Owen
Here's one I've been puzzling over for a bit. In a factorization based on the SVD or what have you, you reconstruct the approximate original matrix (well, one row) by multiplying the matrices back together and looking for the largest elements. This is essentially multiplying a user feature vector b

Re: nutch and mahout integration

2012-07-06 Thread Alexander Aristov
thank you it's very helpful Best Regards Alexander Aristov On 5 July 2012 20:12, Andy Schlaikjer wrote: > Hi Lance, > > Elephant Bird includes support for SequenceFile i/o from Pig: > > > https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/sto