Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
The size should not matter, you should get output, what do you exactly mean by "it has null"? --sebastian On 02.08.2013 03:44, hahn jiang wrote: > The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no > errors occur. I check the output but it has null. > I think the problem is

Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
I would also be fine with keeping if there is demand. I just proposed to deprecate it and nobody voted against that at that point in time. --sebastian On 02.08.2013 03:12, Dmitriy Lyubimov wrote: > There's a part of Nathan Halko's dissertation referenced on algorithm page > running comparison.

Re: Question for RecommenderJob

2013-08-01 Thread hahn jiang
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no errors occur. I check the output but it has null. I think the problem is my data set. Is it too small about my item set that only 200 elements? On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter wrote: > Which version of

Re: Why is Lanczos deprecated?

2013-08-01 Thread Dmitriy Lyubimov
There's a part of Nathan Halko's dissertation referenced on algorithm page running comparison. In particular, he was not able to compute more than 40 eigenvectors with Lanczos on wikipedia dataset. You may refer to that study. On the accuracy part, it was not observed that it was a problem, assum

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the

Re: Setting up a recommender

2013-08-01 Thread B Lyon
I am wondering about row/column confusion as well - fleshing out the doc/design with more specifics (which Pat is kind of doing, basically) should make things obvious eventually, imo. The way Pat had phrased it got me to wondering what rationale you use to rank the results when you are querying th

Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel wrote: > Sorry to be dense but I think there is some miscommunication. The most > important question is: am I writing the item-item similarity matrix DRM out > to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc. Different DRM's pro

Re: multi-class classification question

2013-08-01 Thread Ted Dunning
I have talked to one user who had ~60,000 classes and they were able to use OLR with success. The way that they did this was to arrange the output classes into a multi-level tree. Then the trained classifiers at each level of the tree. At any level, if there was a dominating result, then only th

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Thanks for pointing that out. I corrected the Wiki page. From: Marco To: "user@mahout.apache.org" Sent: Thursday, August 1, 2013 3:08 PM Subject: Re: k-means issues thanks a lot. will try your suggestions asap. i was sort of following this http://goo.gl/u

Re: k-means issues

2013-08-01 Thread Marco
thanks a lot. will try your suggestions asap. i was sort of following this http://goo.gl/u8VFZN - Messaggio originale - Da: Jeff Eastman A: user@mahout.apache.org Cc: Inviato: Giovedì 1 Agosto 2013 21:02 Oggetto: Re: k-means issues The clustering arguments are usually directories, not

Re: k-means issues

2013-08-01 Thread Jeff Eastman
The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout

Re: k-means issues

2013-08-01 Thread Suneel Marthi
You also need to specify the distance measure '-dm' to clusterdump. This is the Distance Measure that was used for clustering. (Again look at the example in /examples/bin/cluster-reuters.sh - it has all the steps u r trying to accomplish) From: Marco To: "u

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender this is in "tmp/similarityMatrix". If not then please stop me. If I'

Re: k-means issues

2013-08-01 Thread Marco
 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi A: "user@mahout.apache.org" ; Marco Cc: Inv

multi-class classification question

2013-08-01 Thread yikes aroni
Say that I am trying to determine which customers buy particular candy bars. So I want to classify training data consisting of candy bar attributes (an N dimensional vector of variables) into customer attributes (an M dimensional vector of customer attributes). Is there a preferred method when N a

Re: Why is Lanczos deprecated?

2013-08-01 Thread Jake Mannix
On Thu, Aug 1, 2013 at 7:08 AM, Sebastian Schelter wrote: > IIRC the main reasons for deprecating Lanczos was that in contrast to > SSVD, it does not use a constant number of MapReduce jobs and that our > implementation has the constraint that all the resulting vectors have to > fit into the memo

Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel wrote: > > For item similarities there is no need to do more than fetch one doc that > contains the similarities, right? I've successfully used this method with > the Mahout recommender but please correct me if something above is wrong. No. First, you

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Setting it to the maximum number should be enough. Would be great if you can share your dataset and tests. 2013/8/1 Rafal Lukawiecki > Should I have set that parameter to a value much much larger than the > maximum number of actually expressed preferences by a user? > > I'm working on an anonymi

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user? I'm working on an anonymised data set. If it works as an error test case, I'd be happy to share it for your re-test. I am still hoping it is my error, not Mahout's.

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki > Hi Sebastian, > > I've rechecked the results, and, I'm afraid that the issue has not

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, with

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Not following so… Here so is what I've done in probably too much detail: 1) ingest raw log files and split them up by action 2) turn these into Mahout preference files using Mahout type IDs, keeping a map of IDs 3) run the Mahout Item-based recommender using LLR for similarity 4) created a Mahou

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Ryan Josal
Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI. I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time. It seems that this would be usef

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Could u post the Command line u r using for clusterdump? From: Marco To: "user@mahout.apache.org" ; Suneel Marthi Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump an

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
The original motivation of spectral clustering talks about graphs. But the idea of clustering the reduced dimension form of a matrix simply depends on the fact[1] that the metric is approximately preserved by the reduced form and is thus applicable to any matrix. [1] Johnson-Lindenstrauss yet ag

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi wrote: > I think there is a problem because of NamedVector as after some search I > get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 > Note also that this bug is fixed in 0.8

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
Oops, I'm sorry. I had one too many zeros there, should be '-Dmapred.max.split.size=10' Just (input size)/(desired number of mappers)

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=' argument. The is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=100' On Wed, Jul 3

Re: k-means issues

2013-08-01 Thread Marco
ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get "Wrote 0 clusters" - Messaggio originale - Da: Suneel Marthi A: "user@mahout.apache.org" ; Marco Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reute

Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
IIRC the main reasons for deprecating Lanczos was that in contrast to SSVD, it does not use a constant number of MapReduce jobs and that our implementation has the constraint that all the resulting vectors have to fit into the memory of the driver machine. Best, Sebastian On 01.08.2013 12:15, Fer

Re: k-means issues

2013-08-01 Thread Suneel Marthi
Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco To: "user@mahout.apache.org" Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issu

Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
Which version of Mahout are you using? Did you check the output, are you sure that no errors occur? Best, Sebastian On 01.08.2013 09:59, hahn jiang wrote: > Hi all, > > > I have a question when I use RecommenderJob for item-based recommendation. > > My input data format is "userid,itemid,1", s

k-means issues

2013-08-01 Thread Marco
So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like "china    japan    senkaku    dispute" or "italy   lampedusa   immgration"). I want to run k-means clusteriazion on them. Here's what I do (i'm ac

CHEMDNER CFP and training data

2013-08-01 Thread Martin Krallinger
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task (see http://www.biocreative.org/tasks/biocreative-iv/chemdner) (1) The CHEMDNER task (part of The BioCreative IV competition) is a community challenge on named entity recognition of chemical compounds. The goal

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Chirag Lakhani
Maybe someone can clarify this issue but the spectral clustering implementation assumes an affinity graph, am I correct? Are there direct ways of going from a list of feature vectors to an affinity matrix in order to then implement spectral clustering? On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awast

RE: How to SSVD output to generate Clusters

2013-08-01 Thread Stuti Awasthi
Thanks Ted, Dmitriy Il check the Spectral Clustering as well PCA option but first with normal approach I want to execute it once. Here is what I am doing with Mahout 0.7: 1. seqdirectory : ~/mahout-distribution-0.7/bin/mahout seqdirectory -i /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, my apologies for my dumb question. I found the web site for prediction IO—I did not realise it was a separate project, and I was looking for info in the existing Mahout documentation. I will research it now for our use case. -- Rafal Lukawiecki Strategic Consultant and Director Project Bo

Why is Lanczos deprecated?

2013-08-01 Thread Fernando Fernández
Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have succes

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, is there any documentation available, or more info on PredictionIO? -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 09:13, "Simon Chan" wrote: > We are building PredictionIO that helps to handle a number of business > logics. Recommending only items that the user has

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Simon Chan
We are building PredictionIO that helps to handle a number of business logics. Recommending only items that the user has never expressed a preference before is supported. It is a layer on top of Mahout. Hope it is helpful. Simon On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning wrote: > Go with 0.8

Question for RecommenderJob

2013-08-01 Thread hahn jiang
Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is "userid,itemid,1", so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many time