error running lucene.vectors
I'm running the command mahout lucene.vectors (via cygwin) on a Solr (4.4) index (using Mahout 0.8) I'm getting the following error SEVERE: There are too many documents that do not have a term vector for text Exception in thread main java.lang.IllegalStateException: There are too many documents that do not have a term vector for text at org.apache.mahout.utils.vectors.lucene.AbstractLuceneIterator.computeNext (AbstractLuceneIterator.java:97) I tried adding the flag: --maxPercentErrorDocs 0.9 and I still get the same error. I have defined termvectors for my Solr 'text' field
Re: Data distribution guidance for recommendation engines
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote: If I split my data into train and test sets, I can show good performance of Good performance according to what metric? it makes a lot of difference whether you are talking about precision/recall or RMSE. the model on the train set. What might I expect given an uneven distribution of ratings? Imagine a situation where 50% of the ratings are 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do In the general case, recommenders don't rate items at all, they rank items. So this may not be a question that matters. about the rating scale itself. For example, given [1:3] vs [1:10] ranges, in with the former, you've got a 1/3 chance of predicting the correct rating, say, while in the latter case it is a 1/10. Or, when is sparse too Why do you say that... the recommender is not choosing ratings randomly. Ultimately, I'm trying to figure out under what conditions one would look at a model and say that is crap, pardon my language. Do any more You use evaluation metrics, which are imperfect, but about the best you can do in the lab. If you're already doing that, you're doing fine. This is true no matter what your input is like. If your input is things like click count, then they will certainly be mostly 1 and follow a power-law distribution. This is no problem but you want to follow the 'implicit feedback' version of ALS, where you are not trying to reconstruct the input but use the input as weights.
Question for RecommenderJob
Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many times use different arguments. But the result is also null. This is one of my commands. Would you help me for tell me why it is null please? bash recommender-job.sh --input input/user-item-value --output output/recommender --numRecommendations 10 --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300 --maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity 1000 --booleanData true Thanks
Re: RecommenderJob Recommending an Item Already Preferred by a User
We are building PredictionIO that helps to handle a number of business logics. Recommending only items that the user has never expressed a preference before is supported. It is a layer on top of Mahout. Hope it is helpful. Simon On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Go with 0.8. Definitely. Hadoop scaleout should be easy. On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Thank you! In general, should I be putting our efforts into using 0.8 or stick with 0.7 for now, re RecommenderJob? On another note, which might be a different thread, but would you have any ready-made accuracy and reliability validation code to suggest when using RecommenderJob, or do I need to stick with predicting from test data/test partitions, and analysing resulting confusion matrices in R etc? Anything turnkey aides to entice new users. Rafal Ps. Another reason for using RJ in our use case is the hopeful, assumed promise of a Hadoop-derived scale-out, when needed in the near future. Mixed results so far on that end. -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Many thanks, I'll report the issue, when I figure out where. :) I can help with that! https://issues.apache.org/jira/browse/MAHOUT
Re: RecommenderJob Recommending an Item Already Preferred by a User
Simon, is there any documentation available, or more info on PredictionIO? -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote: We are building PredictionIO that helps to handle a number of business logics. Recommending only items that the user has never expressed a preference before is supported. It is a layer on top of Mahout. Hope it is helpful. Simon On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Go with 0.8. Definitely. Hadoop scaleout should be easy. On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Thank you! In general, should I be putting our efforts into using 0.8 or stick with 0.7 for now, re RecommenderJob? On another note, which might be a different thread, but would you have any ready-made accuracy and reliability validation code to suggest when using RecommenderJob, or do I need to stick with predicting from test data/test partitions, and analysing resulting confusion matrices in R etc? Anything turnkey aides to entice new users. Rafal Ps. Another reason for using RJ in our use case is the hopeful, assumed promise of a Hadoop-derived scale-out, when needed in the near future. Mixed results so far on that end. -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Many thanks, I'll report the issue, when I figure out where. :) I can help with that! https://issues.apache.org/jira/browse/MAHOUT
Why is Lanczos deprecated?
Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have successfully used Lanczos in several projects and it's been a surprise to me finding that the main reason (according to what I've read that might not be the full story) is that it's not being used. At the begining I supposed it was because SSVD is supposed to be much faster with similar results, but after making some tests I have found that running times are similar or even worse than lanczos for some configurations (I have tried several combinations of parameters, given child processes enough memory, etc. and had no success in running SSVD at least in 3/4 of time Lanczos runs, thouh they might be some combinations of parameters I have still not tried). It seems to be quite tricky to find a good combination of parameters for SSVD and I have seen also a precision loss in some examples that makes me not confident in migrating Lanczos to SSVD from now on (How far can I trust results from a combination of parameters that runs in significant less time, or at least a good time?). Can someone convince me that SSVD is actually a better option than Lanczos? (I'm totally willing to be convinced... :) ) Thank you very much in advance. Fernando.
Re: RecommenderJob Recommending an Item Already Preferred by a User
Simon, my apologies for my dumb question. I found the web site for prediction IO—I did not realise it was a separate project, and I was looking for info in the existing Mahout documentation. I will research it now for our use case. -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 1 Aug 2013, at 09:52, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Simon, is there any documentation available, or more info on PredictionIO? -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote: We are building PredictionIO that helps to handle a number of business logics. Recommending only items that the user has never expressed a preference before is supported. It is a layer on top of Mahout. Hope it is helpful. Simon On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Go with 0.8. Definitely. Hadoop scaleout should be easy. On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Thank you! In general, should I be putting our efforts into using 0.8 or stick with 0.7 for now, re RecommenderJob? On another note, which might be a different thread, but would you have any ready-made accuracy and reliability validation code to suggest when using RecommenderJob, or do I need to stick with predicting from test data/test partitions, and analysing resulting confusion matrices in R etc? Anything turnkey aides to entice new users. Rafal Ps. Another reason for using RJ in our use case is the hopeful, assumed promise of a Hadoop-derived scale-out, when needed in the near future. Mixed results so far on that end. -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Many thanks, I'll report the issue, when I figure out where. :) I can help with that! https://issues.apache.org/jira/browse/MAHOUT
RE: How to SSVD output to generate Clusters
Thanks Ted, Dmitriy Il check the Spectral Clustering as well PCA option but first with normal approach I want to execute it once. Here is what I am doing with Mahout 0.7: 1. seqdirectory : ~/mahout-distribution-0.7/bin/mahout seqdirectory -i /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq 2.seq2sparse ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70 3. ssvd ~/mahout-distribution-0.7/bin/mahout ssvd -i /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V true --reduceTasks 1 4.kmeans: with U as input ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl -k 10 5. Clusterdump ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d /stuti/SSVD/data-vectors/dictionary.file-* -o ~/ClusterOutput/SSVD/KMeans_10 -p /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV Output : Normally if I use Clusterdump with CSV option, the I receive the ClusterId and associated documents names but this time Im getting the output like : 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_, ... I think there is a problem because of NamedVector as after some search I get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 My Queries : 1. Is the process which Im doing is correct ? should U be directly fed as input to Clustering Algorithm 2. The Output issue is because of NamedVector ?? If yes , then if I use Mahout 0.8 will the issue be resolved ? 3. Im confused between parameter -k in SSVD and -k in Clustering(KMeans). How these are different ? As -k in Clustering means Number of cluster to be created . What is the purpose of -k(rank) in SSVD (My apologies, but I am having some problem in grasping the SSVD algorithm. The concept of Rank is not clear to me) 4. If I generate -k =100 in SSVD, will I still be able to create say 10 Clusters using the clustering with this data. Thanks Stuti Awasthi -Original Message- From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Wednesday, July 31, 2013 11:15 PM To: user@mahout.apache.org Subject: Re: How to SSVD output to generate Clusters many people also use PCA options workflow with SSVD and then try clusterize the output U*Sigma which is dimensionally reduced representation of original row-wise dataset. To enable PCA and U*Sigma output, use ssvd -pca true -us true -u false -v false -k=... -q=1 ... -q=1 recommended for accuracy. On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I wanted to group the documents with same context but which belongs to one single domain together. I have tried KMeans and LDA provided in Mahout to perform the clustering but the groups which are generated are not very good. Hence I thought to use LSA to indentify the context related to the word and then perform the Clustering. I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as output of SSVD. I am not sure how to use the output of SSVD to fed to the Clustering Algorithm so that we can generate the clusters of the documents which might be talking about same context. Any pointers how can I achieve this ? Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. --
Re: How to SSVD output to generate Clusters
Maybe someone can clarify this issue but the spectral clustering implementation assumes an affinity graph, am I correct? Are there direct ways of going from a list of feature vectors to an affinity matrix in order to then implement spectral clustering? On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Thanks Ted, Dmitriy Il check the Spectral Clustering as well PCA option but first with normal approach I want to execute it once. Here is what I am doing with Mahout 0.7: 1. seqdirectory : ~/mahout-distribution-0.7/bin/mahout seqdirectory -i /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq 2.seq2sparse ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70 3. ssvd ~/mahout-distribution-0.7/bin/mahout ssvd -i /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V true --reduceTasks 1 4.kmeans: with U as input ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl -k 10 5. Clusterdump ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d /stuti/SSVD/data-vectors/dictionary.file-* -o ~/ClusterOutput/SSVD/KMeans_10 -p /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV Output : Normally if I use Clusterdump with CSV option, the I receive the ClusterId and associated documents names but this time Im getting the output like : 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_, ... I think there is a problem because of NamedVector as after some search I get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 My Queries : 1. Is the process which Im doing is correct ? should U be directly fed as input to Clustering Algorithm 2. The Output issue is because of NamedVector ?? If yes , then if I use Mahout 0.8 will the issue be resolved ? 3. Im confused between parameter -k in SSVD and -k in Clustering(KMeans). How these are different ? As -k in Clustering means Number of cluster to be created . What is the purpose of -k(rank) in SSVD (My apologies, but I am having some problem in grasping the SSVD algorithm. The concept of Rank is not clear to me) 4. If I generate -k =100 in SSVD, will I still be able to create say 10 Clusters using the clustering with this data. Thanks Stuti Awasthi -Original Message- From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Wednesday, July 31, 2013 11:15 PM To: user@mahout.apache.org Subject: Re: How to SSVD output to generate Clusters many people also use PCA options workflow with SSVD and then try clusterize the output U*Sigma which is dimensionally reduced representation of original row-wise dataset. To enable PCA and U*Sigma output, use ssvd -pca true -us true -u false -v false -k=... -q=1 ... -q=1 recommended for accuracy. On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I wanted to group the documents with same context but which belongs to one single domain together. I have tried KMeans and LDA provided in Mahout to perform the clustering but the groups which are generated are not very good. Hence I thought to use LSA to indentify the context related to the word and then perform the Clustering. I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as output of SSVD. I am not sure how to use the output of SSVD to fed to the Clustering Algorithm so that we can generate the clusters of the documents which might be talking about same context. Any pointers how can I achieve this ? Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior
CHEMDNER CFP and training data
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task (see http://www.biocreative.org/tasks/biocreative-iv/chemdner) (1) The CHEMDNER task (part of The BioCreative IV competition) is a community challenge on named entity recognition of chemical compounds. The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. (2) The datasets relevant to the CHEMDNER tasks will all be listed under the following link: http://www.biocreative.org/resources/corpora/bc-iv-chemdner-corpus The CHEMDER training set is now online, together with an updated version of the annotation guidelines. They are available at: http://www.biocreative.org/media/store/files/2013/CHEMDNER_TRAIN_V01.zip (3) Dates: Please notice the following CHEMDNER schedule, in particular the test set prediction due. *25th June: sample data collection, detailed task description, annotation and evaluation script 31st July: training data collection, annotations and updated guidelines 16th August : development data annotations 3nd September: test set release 12th September: test set prediction due 17th September: invite teams for workshop presentation talks 19th September: CHEMDNER workshop proceedings paper due (2-4 pages) 7th-9th October: BioCreative IV workshop http://www.biocreative.org/events/biocreative-iv/workshop/ * (4) Frequently asked questions (FAQ). Considering the numerous questions we have got from various teams (many of them related to the CDI subtask ranking). We have placed online a FAQ document at: http://www.biocreative.org/media/store/files/2013/chemdner_faq.pdf (5) Evaluation workshop. The evaluation results will be presented at this workshop and in the corresponding workshop proceedings evaluation paper. This is in line with other community challenges such as Critical Assessment of protein Structure Prediction (CASP) experiments. There will be a session devoted to the CHEMDNER task during the workshop as well as a poster session and selected talks from participating teams. The link of the workshop as well as additional details and registration is online at: http://www.biocreative.org/events/biocreative-iv/workshop (6) Workshop proceedings. The second volume of the BioCreative workshop proceedings will be devoted entirely to the CHEMDER task. The proceedings papers for the CHEMDER task should be 2-4 pages long, describing your system and results obtained for the training or development set (or both). Please refer to the following URL for more information. For more details refer to: http://www.biocreative.org/events/biocreative-iv/workshop/#proceedings (7) CHEMDER special issue publications There will be a journal issue devoted to BioCreative IV and also one for the CHEMDER task. This is in line with previous BioCreative challenges were special issues were published in BMC Bioinformatics, Genome Biology and the journal Database. We will announce more details on the selection process and the target journal after the workshop. Martin Krallinger Structural Computational Biology Group Structural Biology and BioComputing Programme Spanish National Cancer Research Centre (CNIO)
k-means issues
So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: Question for RecommenderJob
Which version of Mahout are you using? Did you check the output, are you sure that no errors occur? Best, Sebastian On 01.08.2013 09:59, hahn jiang wrote: Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many times use different arguments. But the result is also null. This is one of my commands. Would you help me for tell me why it is null please? bash recommender-job.sh --input input/user-item-value --output output/recommender --numRecommendations 10 --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300 --maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity 1000 --booleanData true Thanks
Re: k-means issues
Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: Why is Lanczos deprecated?
IIRC the main reasons for deprecating Lanczos was that in contrast to SSVD, it does not use a constant number of MapReduce jobs and that our implementation has the constraint that all the resulting vectors have to fit into the memory of the driver machine. Best, Sebastian On 01.08.2013 12:15, Fernando Fernández wrote: Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have successfully used Lanczos in several projects and it's been a surprise to me finding that the main reason (according to what I've read that might not be the full story) is that it's not being used. At the begining I supposed it was because SSVD is supposed to be much faster with similar results, but after making some tests I have found that running times are similar or even worse than lanczos for some configurations (I have tried several combinations of parameters, given child processes enough memory, etc. and had no success in running SSVD at least in 3/4 of time Lanczos runs, thouh they might be some combinations of parameters I have still not tried). It seems to be quite tricky to find a good combination of parameters for SSVD and I have seen also a precision loss in some examples that makes me not confident in migrating Lanczos to SSVD from now on (How far can I trust results from a combination of parameters that runs in significant less time, or at least a good time?). Can someone convince me that SSVD is actually a better option than Lanczos? (I'm totally willing to be convinced... :) ) Thank you very much in advance. Fernando.
Re: k-means issues
ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: Modify number of mappers for a mahout process?
One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=' argument. The is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=100' On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit galp...@ebay.comwrote: Hi, It sounds to me like this could be related to one of the Qs I've posted several days ago (is it?): My mahout clustering processes seem to be running very slow (several good hours on just ~1M items), and I'm wondering if there's anything that needs to be changed in setting/configuration. (and how?) I'm running on a large cluster and could potentially use thousands of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) are only using max 5 mappers (I tried it on several data sets). I've tried to define the number of mappers by something like: -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only uses =5 mappers. Is there a different way to set the number of mappers/reducers for a mahout process? Or is there another configuration issue I need to consider? I'd definitely be happy to use such a parameter, does it not exist? (I'm running mahout as installed on the cluster) Is there currently a workaround, besides running a mahout jar as an hadoop job? When I originally tried to run a mahout jar that uses KMeansDriver (and that runs great on my local machine)- it did not even initiate a job on the hadoop cluster. It seemed to be running parallel but in fact it was running only on the local node. Is this a known issue? Is there a fix for this? (I ended up dropping it and calling mahout step by step from command line, but I'd be happy to know if there a fix for this). Thanks, Galit. -Original Message- From: Ryan Josal [mailto:rjo...@gmail.com] Sent: Monday, July 29, 2013 9:33 PM To: Adam Baron Cc: Ryan Josal; user@mahout.apache.org Subject: Re: Run more than one mapper for TestForest? If you're running mahout from the CLI, you'll have to modify the Hadoop config file or your env manually for each job. This is code I put in to my custom job executions so I didn't have to calculate and set that up every time. Maybe that's your best route in that position. You could just provide your own mahout jar and run it as you would any other Hadoop job and ignore the installed Mahout. I do think this could be a useful parameter for a number of standard mahout jobs though; I know I would use it. Does anyone in the mahout community see this as a generally useful feature for a Mahout job? Ryan On Jul 29, 2013, at 10:25, Adam Baron adam.j.ba...@gmail.com wrote: Ryan, Thanks for the fix, the code looks reasonable to me. Which version of Mahout will this be in? 0.9? Unfortunately, I'm using a large shared Hadoop cluster which is not administered by my team. So I'm not in a position push the latest from the Mahout dev trunk into our environment; the admins will only install official releases. Regards, Adam On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal r...@josal.com wrote: Late reply, but for what it's still worth, since I've seen a couple other threads here on the topic of too few mappers, I added a parameter to set a minimum number of mappers. Some of my mahout jobs needed more mappers, but were not given many because of the small input file size. addOption(minMapTasks, m, Minimum number of map tasks to run, String.valueOf(1)); int minMapTasks = Integer.parseInt(getOption(minMapTasks)); int mapTasksThatWouldRun = (int) (vectorFileSizeBytes/getSplitSize()) + 1; log.info(map tasks min: + minMapTasks + current: + mapTasksThatWouldRun); if (minMapTasks mapTasksThatWouldRun) { String splitSizeBytes = String.valueOf(vectorFileSizeBytes/minMapTasks); log.info(Forcing mapred.max.split.size to + splitSizeBytes + to ensure minimum map tasks = + minMapTasks); hadoopConf.set(mapred.max.split.size, splitSizeBytes); } // there is actually a private method in hadoop to calculate this private long getSplitSize() { long blockSize = hadoopConf.getLong(dfs.block.size, 64 * 1024 * 1024); long maxSize = hadoopConf.getLong(mapred.max.split.size, Long.MAX_VALUE); int minSize = hadoopConf.getInt(mapred.min.split.size, 1); long splitSize = Math.max(minSize, Math.min(maxSize, blockSize)); log.info(String.format(min: %,d block: %,d max: %,d split: %,d, minSize, blockSize, maxSize, splitSize)); return splitSize; } It seems like there should be a more straightforward way to do this, but it works for me and I've used it on a lot of jobs to set a minimum number of mappers. Ryan On Jul 5, 2013,
Re: Modify number of mappers for a mahout process?
Oops, I'm sorry. I had one too many zeros there, should be '-Dmapred.max.split.size=10' Just (input size)/(desired number of mappers)
Re: How to SSVD output to generate Clusters
On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote: I think there is a problem because of NamedVector as after some search I get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 Note also that this bug is fixed in 0.8
Re: How to SSVD output to generate Clusters
The original motivation of spectral clustering talks about graphs. But the idea of clustering the reduced dimension form of a matrix simply depends on the fact[1] that the metric is approximately preserved by the reduced form and is thus applicable to any matrix. [1] Johnson-Lindenstrauss yet again. On Thu, Aug 1, 2013 at 6:22 AM, Chirag Lakhani clakh...@zaloni.com wrote: Maybe someone can clarify this issue but the spectral clustering implementation assumes an affinity graph, am I correct? Are there direct ways of going from a list of feature vectors to an affinity matrix in order to then implement spectral clustering? On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Thanks Ted, Dmitriy Il check the Spectral Clustering as well PCA option but first with normal approach I want to execute it once. Here is what I am doing with Mahout 0.7: 1. seqdirectory : ~/mahout-distribution-0.7/bin/mahout seqdirectory -i /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq 2.seq2sparse ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70 3. ssvd ~/mahout-distribution-0.7/bin/mahout ssvd -i /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V true --reduceTasks 1 4.kmeans: with U as input ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl -k 10 5. Clusterdump ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d /stuti/SSVD/data-vectors/dictionary.file-* -o ~/ClusterOutput/SSVD/KMeans_10 -p /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV Output : Normally if I use Clusterdump with CSV option, the I receive the ClusterId and associated documents names but this time Im getting the output like : 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_, ... I think there is a problem because of NamedVector as after some search I get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 My Queries : 1. Is the process which Im doing is correct ? should U be directly fed as input to Clustering Algorithm 2. The Output issue is because of NamedVector ?? If yes , then if I use Mahout 0.8 will the issue be resolved ? 3. Im confused between parameter -k in SSVD and -k in Clustering(KMeans). How these are different ? As -k in Clustering means Number of cluster to be created . What is the purpose of -k(rank) in SSVD (My apologies, but I am having some problem in grasping the SSVD algorithm. The concept of Rank is not clear to me) 4. If I generate -k =100 in SSVD, will I still be able to create say 10 Clusters using the clustering with this data. Thanks Stuti Awasthi -Original Message- From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] Sent: Wednesday, July 31, 2013 11:15 PM To: user@mahout.apache.org Subject: Re: How to SSVD output to generate Clusters many people also use PCA options workflow with SSVD and then try clusterize the output U*Sigma which is dimensionally reduced representation of original row-wise dataset. To enable PCA and U*Sigma output, use ssvd -pca true -us true -u false -v false -k=... -q=1 ... -q=1 recommended for accuracy. On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com wrote: Hi All, I wanted to group the documents with same context but which belongs to one single domain together. I have tried KMeans and LDA provided in Mahout to perform the clustering but the groups which are generated are not very good. Hence I thought to use LSA to indentify the context related to the word and then perform the Clustering. I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as output of SSVD. I am not sure how to use the output of SSVD to fed to the Clustering Algorithm so that we can generate the clusters of the documents which might be talking about same context. Any pointers how can I achieve this ? Regards Stuti Awasthi ::DISCLAIMER:: -- -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or
Re: k-means issues
Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: Modify number of mappers for a mahout process?
Galit, yes this does sound like this is related, and as Matt said, you can test this by setting the max split size on the CLI. I didn't personally find this to be a reliable and efficient method, so I wrote the -m parameter to my job to set it right every time. It seems that this would be useful to have as a general parameter for Mahout jobs; is there agreement on this, and if so can I get some guidance on how to contribute? Ryan On Aug 1, 2013, at 8:00, Matt Molek mpmo...@gmail.com wrote: One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=' argument. The is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=100' On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit galp...@ebay.comwrote: Hi, It sounds to me like this could be related to one of the Qs I've posted several days ago (is it?): My mahout clustering processes seem to be running very slow (several good hours on just ~1M items), and I'm wondering if there's anything that needs to be changed in setting/configuration. (and how?) I'm running on a large cluster and could potentially use thousands of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.) are only using max 5 mappers (I tried it on several data sets). I've tried to define the number of mappers by something like: -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still only uses =5 mappers. Is there a different way to set the number of mappers/reducers for a mahout process? Or is there another configuration issue I need to consider? I'd definitely be happy to use such a parameter, does it not exist? (I'm running mahout as installed on the cluster) Is there currently a workaround, besides running a mahout jar as an hadoop job? When I originally tried to run a mahout jar that uses KMeansDriver (and that runs great on my local machine)- it did not even initiate a job on the hadoop cluster. It seemed to be running parallel but in fact it was running only on the local node. Is this a known issue? Is there a fix for this? (I ended up dropping it and calling mahout step by step from command line, but I'd be happy to know if there a fix for this). Thanks, Galit. -Original Message- From: Ryan Josal [mailto:rjo...@gmail.com] Sent: Monday, July 29, 2013 9:33 PM To: Adam Baron Cc: Ryan Josal; user@mahout.apache.org Subject: Re: Run more than one mapper for TestForest? If you're running mahout from the CLI, you'll have to modify the Hadoop config file or your env manually for each job. This is code I put in to my custom job executions so I didn't have to calculate and set that up every time. Maybe that's your best route in that position. You could just provide your own mahout jar and run it as you would any other Hadoop job and ignore the installed Mahout. I do think this could be a useful parameter for a number of standard mahout jobs though; I know I would use it. Does anyone in the mahout community see this as a generally useful feature for a Mahout job? Ryan On Jul 29, 2013, at 10:25, Adam Baron adam.j.ba...@gmail.com wrote: Ryan, Thanks for the fix, the code looks reasonable to me. Which version of Mahout will this be in? 0.9? Unfortunately, I'm using a large shared Hadoop cluster which is not administered by my team. So I'm not in a position push the latest from the Mahout dev trunk into our environment; the admins will only install official releases. Regards, Adam On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal r...@josal.com wrote: Late reply, but for what it's still worth, since I've seen a couple other threads here on the topic of too few mappers, I added a parameter to set a minimum number of mappers. Some of my mahout jobs needed more mappers, but were not given many because of the small input file size. addOption(minMapTasks, m, Minimum number of map tasks to run, String.valueOf(1)); int minMapTasks = Integer.parseInt(getOption(minMapTasks)); int mapTasksThatWouldRun = (int) (vectorFileSizeBytes/getSplitSize()) + 1; log.info(map tasks min: + minMapTasks + current: + mapTasksThatWouldRun); if (minMapTasks mapTasksThatWouldRun) { String splitSizeBytes = String.valueOf(vectorFileSizeBytes/minMapTasks); log.info(Forcing mapred.max.split.size to + splitSizeBytes + to ensure minimum map tasks = + minMapTasks); hadoopConf.set(mapred.max.split.size, splitSizeBytes); } // there is actually a private method in hadoop to calculate this private long getSplitSize() { long blockSize = hadoopConf.getLong(dfs.block.size, 64 * 1024 * 1024); long maxSize = hadoopConf.getLong(mapred.max.split.size, Long.MAX_VALUE);
Re: Setting up a recommender
Not following so… Here so is what I've done in probably too much detail: 1) ingest raw log files and split them up by action 2) turn these into Mahout preference files using Mahout type IDs, keeping a map of IDs 3) run the Mahout Item-based recommender using LLR for similarity 4) created a Mahout style cross-recommender using cooccurrence similarity using matrix math 5) given two similairty matrixes and a user history matrix I am writing them to csv files with Mahout ID replaced by the original string external IDs for users and items input log file before splitting: u1 purchaseiphone u1 purchaseipad u2 purchasenexus-tablet u2 purchasegalaxy u3 purchasesurface u4 purchaseiphone u4 purchaseipad u1 viewiphone u1 viewipad u1 viewnexus-tablet u1 viewgalaxy u2 viewiphone u2 viewipad u2 viewnexus-tablet u2 viewgalaxy u3 viewsurface u4 viewiphone u4 viewipad u4 viewnexus-tablet Input user history DRM after ID translation to mahout IDs and splitting for action purchase B user/item iphone ipadnexus-tabletgalaxy surface u1 1 1 0 0 0 u2 0 0 1 1 0 u3 0 0 0 0 1 u4 1 1 0 0 0 Map of IDs Mahout to Original/External 0 - iphone 1 - ipad 2 - nexus-tablet 3 - galaxy 4 - surface To be specific the DRM from the RecommenderJob with item-item similarities using LLR looks like this: Input Path: out/p-recs/sims/part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {1:0.8472157541208549} Key: 1: Value: {0:0.8472157541208549} Key: 2: Value: {3:0.8181382096075936} Key: 3: Value: {2:0.8181382096075936} Key: 4: Value: {} This will be written to a directory for later Solr indexing as a csv of the form: item_id,similar_items,cross_action_similar_items iphone,ipad, ipad,iphone, nexus-tablet,galaxy, galaxy, nexus-tablet, surface,, By using a user's history vector as a query you get results = recommendations So if the user is u1, the history vector is: iphone ipad The Solr results for query iphone ipad using field similar_items will be 1. Doc ID, ipad 2. Doc ID, iphone If you want item similarities, for instance if a user is anonymous with no history and is looking at an iphone product page. You would fetch the doc for id = iphone and get: ipad Perhaps a bad example for ordering, since there is only one ID in the doc but the items in the similar_items field would be ordered by similarity strength. Likewise for the cross-action similarities though the matrix will have cooccurrence [B'A] values in the DRM. For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. On Jul 31, 2013, at 4:52 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, See inline On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote: So the XML as CSV would be: item_id,similar_items,cross_action_similar_items ipad,iphone,iphone nexus iphone,ipad,ipad galaxy Right. Doesn't matter what format. Might want quotes around space delimited lists, but anything will do. Note: As I mentioned before the order of the items in the field will encode rank of the similarity strength. This is for cases where you want to find similar items to a context item. You would fetch the doc for the context item by it's item ID and show the top k items in the doc. Ted's caveat would probably be to dither them. I always say dither so that is an easy one. But fetching similar items of a center item by fetching the center item and then fetching each of the referenced items is typically slower by about 2x than running the search for mentions of the center item. Sounds like Ted is generating data. Andrew or M Lyon do either of you want to set the demo system up? If so you'll need to find a system--free tier AWS, Ted's box, etc. Then install all the needed stuff. I'll get the output working to csv. On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote: OK and yes. The docs will look like: add doc field name='item_id'ipad/field field name='similar_items'iphone/field field name='cross_action_similar_items'iphone nexus/field /doc doc field name='item_id'iphone/field field name='similar_items'ipad/field field name='cross_action_similar_items'ipad galaxy/field /doc /add On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote: I'm interested in helping as well. Btw I thought that what was stored in the solr fields were the llr-filtered items (ids I guess) for the
Re: RecommenderJob Recommending an Item Already Preferred by a User
Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, without the preference value, that is as: user,item (one per line) as a --filterFile, with and without the -maxPrefsPerUser, and I am afraid we are also seeing recommendations for items the user has expressed a prior preference for. I suppose I need to file a bug report. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Dear Sebastian, It looks like setting --maxPrefsPerUser 1 have resolved the issue in our case—it seems that the most preferences a user had was just about 5000, so I doubled it just-in-case, but when I operationalise this model, I will make sure to calculate the actual max number of preferences and set the parameter accordingly. I will double-check the resultset to make sure the issue is really gone, as I have only checked the few cases where we have spotted a recommendation of a previously preferred item. Would you like me to file a bug, and would you like me to test it on 0.8 or another version? I am using 0.7. Thanks for your kind support. Rafal -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Rafal, can you try to set the option --maxPrefsPerUser to the maximum number of interactions per user and see if you still get the error? Best, Sebastian On 30.07.2013 19:29, Rafal Lukawiecki wrote: Thank you Sebastian. The data set is not that large, as we are running tests on a subset. It is about 24k users, 40k items, the preference file has 65k preferences as triples. This was using Similarity Cooccurrence. I can see if I could anonymise the data set to share if that would be helpful. Thanks for your kind help. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote: Hi Rafal, can you issue a ticket for this problem at https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that check whether this happens and currently they work fine. I can only imagine that the problem occurs in larger datasets where we sample the data in some places. Can you describe a scenario/dataset where this happens? Best, Sebastian 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com I'm new here, just registered. Many thanks to everyone for working on an amazing piece of software, thank you for building Mahout and for your support. My apologies if this is not the right place to ask the question—I have searched for the issue, and I can see this problem has been reported here: http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items Unfortunately, the trail leads to the newsgroups, and I have not found a way, yet, to get an answer from them, without asking you. Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, and I am finding that it is recommending items that the user has already expressed a preference for in their input file. I understand that this should not be happening, and I am not sure if there is a know fix or if I should be looking for a workaround (such as using the entire input as the filterFile). I will double-check that there is no error on my side, but so far it does not seem that way. Many thanks and my regards from Ireland, Rafal Lukawiecki -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd
Re: RecommenderJob Recommending an Item Already Preferred by a User
Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, without the preference value, that is as: user,item (one per line) as a --filterFile, with and without the -maxPrefsPerUser, and I am afraid we are also seeing recommendations for items the user has expressed a prior preference for. I suppose I need to file a bug report. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Dear Sebastian, It looks like setting --maxPrefsPerUser 1 have resolved the issue in our case—it seems that the most preferences a user had was just about 5000, so I doubled it just-in-case, but when I operationalise this model, I will make sure to calculate the actual max number of preferences and set the parameter accordingly. I will double-check the resultset to make sure the issue is really gone, as I have only checked the few cases where we have spotted a recommendation of a previously preferred item. Would you like me to file a bug, and would you like me to test it on 0.8 or another version? I am using 0.7. Thanks for your kind support. Rafal -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Rafal, can you try to set the option --maxPrefsPerUser to the maximum number of interactions per user and see if you still get the error? Best, Sebastian On 30.07.2013 19:29, Rafal Lukawiecki wrote: Thank you Sebastian. The data set is not that large, as we are running tests on a subset. It is about 24k users, 40k items, the preference file has 65k preferences as triples. This was using Similarity Cooccurrence. I can see if I could anonymise the data set to share if that would be helpful. Thanks for your kind help. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote: Hi Rafal, can you issue a ticket for this problem at https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that check whether this happens and currently they work fine. I can only imagine that the problem occurs in larger datasets where we sample the data in some places. Can you describe a scenario/dataset where this happens? Best, Sebastian 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com I'm new here, just registered. Many thanks to everyone for working on an amazing piece of software, thank you for building Mahout and for your support. My apologies if this is not the right place to ask the question—I have searched for the issue, and I can see this problem has been reported here: http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items Unfortunately, the trail leads to the newsgroups, and I have not found a way, yet, to get an answer from them, without asking you. Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, and I am finding that it is recommending items that the user has already expressed a preference for in their input file. I understand that this should not be happening, and I am not sure if there is a know fix or if I should be looking for a workaround (such as using the entire input as the filterFile). I will double-check that there is no error on my side, but so far it does not seem that way. Many thanks and my regards from Ireland, Rafal Lukawiecki -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd
Re: RecommenderJob Recommending an Item Already Preferred by a User
Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user? I'm working on an anonymised data set. If it works as an error test case, I'd be happy to share it for your re-test. I am still hoping it is my error, not Mahout's. Rafal -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org wrote: Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, without the preference value, that is as: user,item (one per line) as a --filterFile, with and without the -maxPrefsPerUser, and I am afraid we are also seeing recommendations for items the user has expressed a prior preference for. I suppose I need to file a bug report. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Dear Sebastian, It looks like setting --maxPrefsPerUser 1 have resolved the issue in our case—it seems that the most preferences a user had was just about 5000, so I doubled it just-in-case, but when I operationalise this model, I will make sure to calculate the actual max number of preferences and set the parameter accordingly. I will double-check the resultset to make sure the issue is really gone, as I have only checked the few cases where we have spotted a recommendation of a previously preferred item. Would you like me to file a bug, and would you like me to test it on 0.8 or another version? I am using 0.7. Thanks for your kind support. Rafal -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Rafal, can you try to set the option --maxPrefsPerUser to the maximum number of interactions per user and see if you still get the error? Best, Sebastian On 30.07.2013 19:29, Rafal Lukawiecki wrote: Thank you Sebastian. The data set is not that large, as we are running tests on a subset. It is about 24k users, 40k items, the preference file has 65k preferences as triples. This was using Similarity Cooccurrence. I can see if I could anonymise the data set to share if that would be helpful. Thanks for your kind help. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote: Hi Rafal, can you issue a ticket for this problem at https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that check whether this happens and currently they work fine. I can only imagine that the problem occurs in larger datasets where we sample the data in some places. Can you describe a scenario/dataset where this happens? Best, Sebastian 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com I'm new here, just registered. Many thanks to everyone for working on an amazing piece of software, thank you for building Mahout and for your support. My apologies if this is not the right place to ask the question—I have searched for the issue, and I can see this problem has been reported here: http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items Unfortunately, the trail leads to the newsgroups, and I have not found a way, yet, to get an answer from them, without asking you. Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, and I am finding that it is recommending items that the user has already expressed a preference for in their input file. I understand that this should not be happening, and I am not sure if there is a know fix or if I should be looking for a workaround (such as using the entire input as the filterFile). I will double-check that there is no error on my side, but so far it does not seem that way. Many thanks and my regards from Ireland, Rafal Lukawiecki -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd
Re: RecommenderJob Recommending an Item Already Preferred by a User
Setting it to the maximum number should be enough. Would be great if you can share your dataset and tests. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user? I'm working on an anonymised data set. If it works as an error test case, I'd be happy to share it for your re-test. I am still hoping it is my error, not Mahout's. Rafal -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org wrote: Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, without the preference value, that is as: user,item (one per line) as a --filterFile, with and without the -maxPrefsPerUser, and I am afraid we are also seeing recommendations for items the user has expressed a prior preference for. I suppose I need to file a bug report. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Dear Sebastian, It looks like setting --maxPrefsPerUser 1 have resolved the issue in our case—it seems that the most preferences a user had was just about 5000, so I doubled it just-in-case, but when I operationalise this model, I will make sure to calculate the actual max number of preferences and set the parameter accordingly. I will double-check the resultset to make sure the issue is really gone, as I have only checked the few cases where we have spotted a recommendation of a previously preferred item. Would you like me to file a bug, and would you like me to test it on 0.8 or another version? I am using 0.7. Thanks for your kind support. Rafal -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Rafal, can you try to set the option --maxPrefsPerUser to the maximum number of interactions per user and see if you still get the error? Best, Sebastian On 30.07.2013 19:29, Rafal Lukawiecki wrote: Thank you Sebastian. The data set is not that large, as we are running tests on a subset. It is about 24k users, 40k items, the preference file has 65k preferences as triples. This was using Similarity Cooccurrence. I can see if I could anonymise the data set to share if that would be helpful. Thanks for your kind help. Rafal -- Rafal Lukawiecki Pardon my brevity, sent from a telephone. On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote: Hi Rafal, can you issue a ticket for this problem at https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that check whether this happens and currently they work fine. I can only imagine that the problem occurs in larger datasets where we sample the data in some places. Can you describe a scenario/dataset where this happens? Best, Sebastian 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com I'm new here, just registered. Many thanks to everyone for working on an amazing piece of software, thank you for building Mahout and for your support. My apologies if this is not the right place to ask the question—I have searched for the issue, and I can see this problem has been reported here: http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items Unfortunately, the trail leads to the newsgroups, and I have not found a way, yet, to get an answer from them, without asking you. Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, and I am finding that it is recommending items that the user has already expressed a preference for in their input file. I understand that this should not be happening, and I am not sure if there is a know fix or if I should be looking for a workaround (such as using the entire input as the filterFile). I will double-check that there is no error on my side, but so far it does not seem that way. Many thanks and my regards from Ireland, Rafal Lukawiecki -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd
Re: Setting up a recommender
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: Why is Lanczos deprecated?
On Thu, Aug 1, 2013 at 7:08 AM, Sebastian Schelter s...@apache.org wrote: IIRC the main reasons for deprecating Lanczos was that in contrast to SSVD, it does not use a constant number of MapReduce jobs and that our implementation has the constraint that all the resulting vectors have to fit into the memory of the driver machine. While it's true that Lanczos does not use a constant number of MR iterations, the phrase our implementation is key in saying we have to hold all the output vectors in memory. This wasn't even a very integral part of our impl. It's fairly simple to implement the linear combinations of the Ritz vectors after iterations are complete as an operation keeping only 3 vectors in memory at a time, we just never made that optimization. Best, Sebastian On 01.08.2013 12:15, Fernando Fernández wrote: Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have successfully used Lanczos in several projects and it's been a surprise to me finding that the main reason (according to what I've read that might not be the full story) is that it's not being used. At the begining I supposed it was because SSVD is supposed to be much faster with similar results, but after making some tests I have found that running times are similar or even worse than lanczos for some configurations (I have tried several combinations of parameters, given child processes enough memory, etc. and had no success in running SSVD at least in 3/4 of time Lanczos runs, thouh they might be some combinations of parameters I have still not tried). It seems to be quite tricky to find a good combination of parameters for SSVD and I have seen also a precision loss in some examples that makes me not confident in migrating Lanczos to SSVD from now on (How far can I trust results from a combination of parameters that runs in significant less time, or at least a good time?). Can someone convince me that SSVD is actually a better option than Lanczos? (I'm totally willing to be convinced... :) ) Thank you very much in advance. Fernando. -- -jake
multi-class classification question
Say that I am trying to determine which customers buy particular candy bars. So I want to classify training data consisting of candy bar attributes (an N dimensional vector of variables) into customer attributes (an M dimensional vector of customer attributes). Is there a preferred method when N and M are large? That is say 100 or more? I have done binary classification using AdaptiveLogisticRegression and OnlineLogisticRegression and small numbers of input features with relative success. As I'm trying to implement this for large N and M, I feel like i'm veering into the woods. Is there a code example anyone can point me to that uses mahout libraries to do multi-class classification when the number of classes is large?
Re: k-means issues
mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 17:24 Oggetto: Re: k-means issues Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: Setting up a recommender
Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.fer...@gmail.com or p...@occamsmachete.com To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: k-means issues
You also need to specify the distance measure '-dm' to clusterdump. This is the Distance Measure that was used for clustering. (Again look at the example in /examples/bin/cluster-reuters.sh - it has all the steps u r trying to accomplish) From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 2:51 PM Subject: Re: k-means issues mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 17:24 Oggetto: Re: k-means issues Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: k-means issues
The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 17:24 Oggetto: Re: k-means issues Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like chinajapansenkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: k-means issues
thanks a lot. will try your suggestions asap. i was sort of following this http://goo.gl/u8VFZN - Messaggio originale - Da: Jeff Eastman j...@windwardsolutions.com A: user@mahout.apache.org Cc: Inviato: Giovedì 1 Agosto 2013 21:02 Oggetto: Re: k-means issues The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 17:24 Oggetto: Re: k-means issues Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: k-means issues
Thanks for pointing that out. I corrected the Wiki page. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 3:08 PM Subject: Re: k-means issues thanks a lot. will try your suggestions asap. i was sort of following this http://goo.gl/u8VFZN - Messaggio originale - Da: Jeff Eastman j...@windwardsolutions.com A: user@mahout.apache.org Cc: Inviato: Giovedì 1 Agosto 2013 21:02 Oggetto: Re: k-means issues The clustering arguments are usually directories, not files. Try: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints On 8/1/13 2:51 PM, Marco wrote: mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt -p mahout/kmeans-clusters/clusteredPoints - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 17:24 Oggetto: Re: k-means issues Could u post the Command line u r using for clusterdump? From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com Sent: Thursday, August 1, 2013 10:29 AM Subject: Re: k-means issues ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get Wrote 0 clusters - Messaggio originale - Da: Suneel Marthi suneel_mar...@yahoo.com A: user@mahout.apache.org user@mahout.apache.org; Marco zentrop...@yahoo.co.uk Cc: Inviato: Giovedì 1 Agosto 2013 16:04 Oggetto: Re: k-means issues Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too :)) You need to specify the clustering option -cl in your kmeans command. From: Marco zentrop...@yahoo.co.uk To: user@mahout.apache.org user@mahout.apache.org Sent: Thursday, August 1, 2013 9:55 AM Subject: k-means issues So I've got 13000 text files representing topics in certain newspaper articles. Each file is just a tab-separated list of topics (so something like china japan senkaku dispute or italy lampedusa immgration). I want to run k-means clusteriazion on them. Here's what I do (i'm actually doing it on a subset of 100 files): 1) run seqdirectory to produce sequence file from raw text files 2) run seq2sparse to produce vectors from sequence file (if i do seqdumper on tfidf-vectors/part-r-0 i get something like Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476} and if i do it on dictionary.fie-0 i get Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable Key: china: Value: 0 Key: japan: Value: 1 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o mahout/kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters mahout/tmp) first thing i notice here is it logs: INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} the Input Vectors: {} part puzzles me. Even worse, this doesn't seem to create the clusteredPoints directory at all. What am I doing wrong?
Re: multi-class classification question
I have talked to one user who had ~60,000 classes and they were able to use OLR with success. The way that they did this was to arrange the output classes into a multi-level tree. Then the trained classifiers at each level of the tree. At any level, if there was a dominating result, then only that sub-tree would be searched. Otherwise, all of the top few trees would be searched. Thus, execution would proceed by evaluating the classifier at the root of the tree. One or more sub-trees would be selected. Each of the classifiers at the roots of these sub-trees would be evaluated. This would give a set of sub-sub-trees that eventually bottomed out with possible answers. These possible answers are combined to get a final set of categories. The detailed meanings of dominating and top few and answers are combined are left as an exercise, but I think you can see the general outline. The detailed definitions are very likely application specific in any case. On Thu, Aug 1, 2013 at 11:25 AM, yikes aroni yikesar...@gmail.com wrote: Say that I am trying to determine which customers buy particular candy bars. So I want to classify training data consisting of candy bar attributes (an N dimensional vector of variables) into customer attributes (an M dimensional vector of customer attributes). Is there a preferred method when N and M are large? That is say 100 or more? I have done binary classification using AdaptiveLogisticRegression and OnlineLogisticRegression and small numbers of input features with relative success. As I'm trying to implement this for large N and M, I feel like i'm veering into the woods. Is there a code example anyone can point me to that uses mahout libraries to do multi-class classification when the number of classes is large?
Re: Setting up a recommender
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc. Different DRM's produce different fields in the same docs. There will also be item meta-data in the field. For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor p...@occamsmachete.com Actually, that is a grand idea. Let's do a hangout. From the who-is-free-whenhttps://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewformsurvey, it looks like lots of people are available tomorrow at 2PM PDT. Would that work? To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. Yes. I absolutely agree that you can do this. These should, strictly speaking, be columns in the item-item matrix. The item-item matrix may or may not be symmetric. If it is symmetric, then column or row doesn't matter. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Yes. Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? If you store the column, then yes. If you store the row, then using a query on the field containing the similar items is the right answer. The key difference that I have is what happens in the next step. When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. Good. I think that there is a row/column confusion here, but they are probably nearly identical in your application. The key point is what happens *after* you do the query that you are suggesting. In your case, you have to retrieve the meta-data associated with each of related items. I like to store this meta-data in a Solr field (or three) so this involves at least one additional query. You can automatically chain this second query by using the join operation that Solr provides, but the second query still happens. If you do the query the way that I suggest, this second query doesn't need to happen. You get the meta-data directly. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something above is wrong. No. First, you need to retrieve all the other documents that are referenced to get their display meta-data. So this isn't just a one document fetch. Second, the similar items point inwards, not outwards. Thus, the query you want has the id of the current item and searches the similar_items field. The result of that search is all of the similar items. The confusion here may stem from the name of the field. A name like linked-from-items or some such might help here. Another way to look at this is that there should be no procedural difference if you have 10 items or 20 in your history. Either way, your history is a query against the appropriate link fields. Likewise, there should be no difference between having 10 items or 2 items in your history. There shouldn't even be any difference if you have even just 1 item in your history. Finding items similar to a single item is exactly like having 1 item in your history. So that should be done by searching with that one item in the appropriate link fields.
Re: Setting up a recommender
I am wondering about row/column confusion as well - fleshing out the doc/design with more specifics (which Pat is kind of doing, basically) should make things obvious eventually, imo. The way Pat had phrased it got me to wondering what rationale you use to rank the results when you are querying the columns (similar column, similar via action 2 column, etc.). He had mentioned the auxiliary case of simply getting most similar items to a given docid by just going to the row for that docid and using the pre-sorted values in the similar column, and I thought Ted might have hinted that you could just as well do a solr query of the column with that single docid as the query; however, in the latter case I wonder if the order and list itself could be weird, as some items may show up simply because they are not similar to many things: lower LLR values that got filtered in the list for docid itself won't get filtered when you're looking at the other not similar to very many items things when generating their list for the solr field.. I guess using an absolute cutoff for LLR in the filtering could deal with some of this issue. All hypothetical at the moment (for me, anyway), as real data might trivially dismiss some of these concerns as irrelevant. I think the hangout is a good idea, too, btw, and hope to be able to sit in if it happens. Very excited about this approach. On Thu, Aug 1, 2013 at 6:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote: Sorry to be dense but I think there is some miscommunication. The most important question is: am I writing the item-item similarity matrix DRM out to Solr, one row = one Solr doc? Each row = one *field* in a Solr doc. Different DRM's produce different fields in the same docs. There will also be item meta-data in the field. For the mapreduce Mahout Item-based recommender this is in tmp/similarityMatrix. If not then please stop me. If I'm off base here, maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor p...@occamsmachete.com Actually, that is a grand idea. Let's do a hangout. From the who-is-free-when https://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewform survey, it looks like lots of people are available tomorrow at 2PM PDT. Would that work? To be clear below I'm not talking about history based recs, which is the primary use case. I am talking about a query that does not use history, that only finds similar items based on training data. The item-item similarity matrix DRM contains Key = item ID, Value = list of item IDs with similarity strengths. Yes. I absolutely agree that you can do this. These should, strictly speaking, be columns in the item-item matrix. The item-item matrix may or may not be symmetric. If it is symmetric, then column or row doesn't matter. This is equivalent to the list returned by ItemBasedRecommender's public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws TasteException Yes. Specified by: mostSimilarItems in interface ItemBasedRecommender Parameters: itemID - ID of item for which to find most similar other items howMany - desired number of most similar items to find Returns: items most similar to the given item, ordered from most similar to least To get the list from Solr you would fetch the doc associated with itemID, no? If you store the column, then yes. If you store the row, then using a query on the field containing the similar items is the right answer. The key difference that I have is what happens in the next step. When using the Mahout mapreduce item-based recommender we get the similarity matrix and do just that. We get the row associated with the Mahout itemID and recommend the top k items from the vector. This performs well in cross-validation tests. Good. I think that there is a row/column confusion here, but they are probably nearly identical in your application. The key point is what happens *after* you do the query that you are suggesting. In your case, you have to retrieve the meta-data associated with each of related items. I like to store this meta-data in a Solr field (or three) so this involves at least one additional query. You can automatically chain this second query by using the join operation that Solr provides, but the second query still happens. If you do the query the way that I suggest, this second query doesn't need to happen. You get the meta-data directly. On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote: For item similarities there is no need to do more than fetch one doc that contains the similarities, right? I've successfully used this method with the Mahout recommender but please correct me if something
Re: Setting up a recommender
Yes, storing the similar_items in a field, cross_action_similar_items in another field all on the same doc ided by item ID. Agree that there may be other fields. Storing the rows of [B'B] is ok because it's symmetric. However we did talk about the [B'A] case and I thought we agreed to store the rows there too because they were from Bs items. This was the discussion about having different items for cross actions. The excerpt below is Ted responding to my question. So do we want the columns of [B'A]? It's only a transpose away. On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote: [B'A] = iphone ipadnexus galaxy surface iphone 2 2 2 1 0 ipad2 2 2 1 0 nexus 1 1 1 1 0 galaxy 1 1 1 1 0 surface 0 0 0 0 1 The rows are what we want from [B'A] since the row items are from B, right? Yes. It is easier to understand if you have different kinds of items as well as different actions. For instance, suppose that you have user x query terms (A) and user x device (B). B'A is then device x term so that there is a row per device and the row contains terms. This is good when searching for devices using terms. Talking about getting the actual doc field values, which will include the similar_items field and other metadata. The actual ids in the similar_items field work well for anonymous/no-history recs but maybe there is a second query or fetch that I'm missing? I assumed that a fetch of the doc and it's fields by item ID was as fast a way to do this as possible. If there is some way to get the same result by doing a query that is faster, I'm all for it? Can do tomorrow at 2.
Re: Why is Lanczos deprecated?
There's a part of Nathan Halko's dissertation referenced on algorithm page running comparison. In particular, he was not able to compute more than 40 eigenvectors with Lanczos on wikipedia dataset. You may refer to that study. On the accuracy part, it was not observed that it was a problem, assuming high level of random noise is not the case, at least not in LSA-like application used there. That said, i am all for diversity of tools, I would actually be +0 on deprecating Lanczos, it is not like we are lacking support for it. SSVD could use improvements too. On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández fernando.fernandez.gonza...@gmail.com wrote: Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have successfully used Lanczos in several projects and it's been a surprise to me finding that the main reason (according to what I've read that might not be the full story) is that it's not being used. At the begining I supposed it was because SSVD is supposed to be much faster with similar results, but after making some tests I have found that running times are similar or even worse than lanczos for some configurations (I have tried several combinations of parameters, given child processes enough memory, etc. and had no success in running SSVD at least in 3/4 of time Lanczos runs, thouh they might be some combinations of parameters I have still not tried). It seems to be quite tricky to find a good combination of parameters for SSVD and I have seen also a precision loss in some examples that makes me not confident in migrating Lanczos to SSVD from now on (How far can I trust results from a combination of parameters that runs in significant less time, or at least a good time?). Can someone convince me that SSVD is actually a better option than Lanczos? (I'm totally willing to be convinced... :) ) Thank you very much in advance. Fernando.
Re: Question for RecommenderJob
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no errors occur. I check the output but it has null. I think the problem is my data set. Is it too small about my item set that only 200 elements? On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote: Which version of Mahout are you using? Did you check the output, are you sure that no errors occur? Best, Sebastian On 01.08.2013 09:59, hahn jiang wrote: Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many times use different arguments. But the result is also null. This is one of my commands. Would you help me for tell me why it is null please? bash recommender-job.sh --input input/user-item-value --output output/recommender --numRecommendations 10 --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300 --maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity 1000 --booleanData true Thanks
Re: Why is Lanczos deprecated?
I would also be fine with keeping if there is demand. I just proposed to deprecate it and nobody voted against that at that point in time. --sebastian On 02.08.2013 03:12, Dmitriy Lyubimov wrote: There's a part of Nathan Halko's dissertation referenced on algorithm page running comparison. In particular, he was not able to compute more than 40 eigenvectors with Lanczos on wikipedia dataset. You may refer to that study. On the accuracy part, it was not observed that it was a problem, assuming high level of random noise is not the case, at least not in LSA-like application used there. That said, i am all for diversity of tools, I would actually be +0 on deprecating Lanczos, it is not like we are lacking support for it. SSVD could use improvements too. On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández fernando.fernandez.gonza...@gmail.com wrote: Hi everyone, Sorry if I duplicate the question but I've been looking for an answer and I haven't found an explanation other than it's not being used (together with some other algorithms). If it's been discussed in depth before maybe you can point me to some link with the discussion. I have successfully used Lanczos in several projects and it's been a surprise to me finding that the main reason (according to what I've read that might not be the full story) is that it's not being used. At the begining I supposed it was because SSVD is supposed to be much faster with similar results, but after making some tests I have found that running times are similar or even worse than lanczos for some configurations (I have tried several combinations of parameters, given child processes enough memory, etc. and had no success in running SSVD at least in 3/4 of time Lanczos runs, thouh they might be some combinations of parameters I have still not tried). It seems to be quite tricky to find a good combination of parameters for SSVD and I have seen also a precision loss in some examples that makes me not confident in migrating Lanczos to SSVD from now on (How far can I trust results from a combination of parameters that runs in significant less time, or at least a good time?). Can someone convince me that SSVD is actually a better option than Lanczos? (I'm totally willing to be convinced... :) ) Thank you very much in advance. Fernando.
Re: Question for RecommenderJob
The size should not matter, you should get output, what do you exactly mean by it has null? --sebastian On 02.08.2013 03:44, hahn jiang wrote: The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no errors occur. I check the output but it has null. I think the problem is my data set. Is it too small about my item set that only 200 elements? On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote: Which version of Mahout are you using? Did you check the output, are you sure that no errors occur? Best, Sebastian On 01.08.2013 09:59, hahn jiang wrote: Hi all, I have a question when I use RecommenderJob for item-based recommendation. My input data format is userid,itemid,1, so I set booleanData option is true. The length of users is 9,000,000 but the length of item is 200. When I run the RecommenderJob, the result is null. I try many times use different arguments. But the result is also null. This is one of my commands. Would you help me for tell me why it is null please? bash recommender-job.sh --input input/user-item-value --output output/recommender --numRecommendations 10 --similarityClassname SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300 --maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity 1000 --booleanData true Thanks