seq2sparse throwing java.lang.NoSuchFieldError: LUCENE_41 error

2013-02-27 Thread Kris Jack
Hello all, I checked out the latest mahout 0.8 code this morning but get an error when I run seq2sparse. $ mahout seq2sparse -i in -o out --namedVector --weight tf hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally 27-Feb-2013 17:08:58 org.slf4j.impl.JCLLoggerAdapter

Re: Error while compiling Ted dunning algo - knn-master

2013-02-27 Thread Dan Filimon
Yes, we need the snapshot because of the streaming k-means mapper and reducer tests. Specifically, we need to add more than one input to the mappers (we need the entire set of points). Only the mrunit SNAPSHOT has this feature. On Wed, Feb 27, 2013 at 12:04 AM, Ted Dunning wrote: > The problem h

Re: kmeans clustering - how to leave some docs unclustered

2013-02-27 Thread Matt Molek
Sorry for the confusion, I meant the same thing. I'm also looking at the content of my clusteredPoints/part-m-0 file. I'm having trouble filtering outliers from my clusters too. Depending on the clusterClassificationThreshold value, either all or none of my points are classified. I think it's

Re: Vector distance within a cluster

2013-02-27 Thread Sean Owen
A common measure of cluster coherence is the mean distance or mean squared difference between the members and the cluster centroid. It sounds like this is the kind of thing you're measuring with this all-pairs distances. That could be a measure too; I've usually seen that done by taking the maximum

Re: kmeans clustering - how to leave some docs unclustered

2013-02-27 Thread Chris Harrington
Clustering for me worked, (sorry if I didn't make that part clear) it's the empty clusteredPoints/part-m-0 file is the problem I'm having. Any value greater than 0.025 and the clusteredPoints/part-m-0 is empty and I use that file to map the document to the cluster it ended up in. If I c

RE: Using ALS job to extract the decomposed matrices

2013-02-27 Thread Razon, Oren
Thanks for the details! I don't believe it's a memory issue cause our dataset is smaller than 1GB. Anyhow I will go ahead and will try to execute it on much smaller dataset, just to be sure. As for my second question... How could I extract the 2 small matrices (U*K & I*K) into CSV's using Mahout

Re: Vector distance within a cluster

2013-02-27 Thread Chris Harrington
Hmmm, you may have to dumb things down for me here. I have don't have much of a background in the area of ML and I'm just piecing things together and learning as I go. So I don't really understand what you mean by "Coherence against an external standard? Or internal consistency/homogeneity?" or

Re: How to remove popular items?

2013-02-27 Thread Sean Owen
It's true, although many of the algorithms will by nature not emphasize popular items. There is an old and semi-deprecated class in the project called InverseUserFrequency, which you can use to manually de-emphasize popular items internally. I wouldn't really recommend it. You can always use IDRes

Re: How to remove popular items?

2013-02-27 Thread Aleksei Udatšnõi
Consider using IDRescorer to penalize or skip items. On Mon, Feb 4, 2013 at 6:54 PM, Zia mel wrote: > Hi , is there a current way to remove the popular items in the > recommendations? Something like STOP words. > Thanks ! >

mahout for web page categorization

2013-02-27 Thread Rajesh Nikam
Hi, I am looking at how to use mahout for web page categorization. Idea is to have various categories like Adult Arts Business Computers Games Health Home Kids News Recreation Reference Science Shopping Society Sports and classify given web page into specific category. After going through some

Re: Using ALS job to extract the decomposed matrices

2013-02-27 Thread Sebastian Schelter
The difference is that the job used a reduce-side join to join feature vectors and ratings in 0.5 which is scalable but very slow. We changed this to a broadcast join in later versions, which can be executed using a single map-only job. However, each of the feature matrices has to fit into the map

RE: Using ALS job to extract the decomposed matrices

2013-02-27 Thread Razon, Oren
Yes I'm sure. We used some code of us that execute the specific ParallelALSFactorizationJob. Same execution worked for mahout0.5 but not for 0.6 \ 0.7. Is there anything different in the way this job is activated? -Original Message- From: Sebastian Schelter [mailto:s...@apache.org] Sent

Re: Using ALS job to extract the decomposed matrices

2013-02-27 Thread Sebastian Schelter
Hell Razon, this a strange bug that should not happen. It seems that some of the vectors supplied to the solver are null. Are you sure that there no exceptions previous to this one? Best, Sebastian On 27.02.2013 09:53, Razon, Oren wrote: > Hi there, > I'm using Hadoop-core 0.20.3 and I want to u

Using ALS job to extract the decomposed matrices

2013-02-27 Thread Razon, Oren
Hi there, I'm using Hadoop-core 0.20.3 and I want to use mahout ALS algorithm. My purpose is to run the ALS model and extract the decomposed matrices for further usage in my application (I want to create 2 different csv files: [UserId, latentFeatureId, Value] and [ItemId, latentFeatureId, Value])