error running lucene.vectors

2013-08-01 Thread Swami Kevala
I'm running the command mahout lucene.vectors (via cygwin) on a Solr (4.4) index

(using Mahout 0.8)

I'm getting the following error

SEVERE: There are too many documents that do not have a term vector for text

Exception in thread main java.lang.IllegalStateException: There are too
many documents that do not have a term vector for text at
org.apache.mahout.utils.vectors.lucene.AbstractLuceneIterator.computeNext
(AbstractLuceneIterator.java:97)

I tried adding the flag:  --maxPercentErrorDocs 0.9 and I still get the same
error.

I have defined termvectors for my Solr 'text' field






Re: Data distribution guidance for recommendation engines

2013-08-01 Thread Sean Owen
On Thu, Aug 1, 2013 at 3:15 AM, Chloe Guszo chloe.gu...@gmail.com wrote:
 If I split my data into train and test sets, I can show good performance of

Good performance according to what metric? it makes a lot of
difference whether you are talking about precision/recall or RMSE.

 the model on the train set. What might I expect given an uneven
 distribution of ratings? Imagine a situation where 50% of the ratings are
 1s, and the rest 2-5. Will the model be biased towards rating items a 1? Do

In the general case, recommenders don't rate items at all, they rank
items. So this may not be a question that matters.

 about the rating scale itself. For example, given [1:3] vs [1:10] ranges,
 in with the former, you've got a 1/3 chance of predicting the correct
 rating, say, while in the latter case it is a 1/10.  Or, when is sparse too

Why do you say that... the recommender is not choosing ratings randomly.


 Ultimately, I'm trying to figure out under what conditions one would look
 at a model and say that is crap, pardon my language. Do any more

You use evaluation metrics, which are imperfect, but about the best
you can do in the lab. If you're already doing that, you're doing
fine. This is true no matter what your input is like.

If your input is things like click count, then they will certainly be
mostly 1 and follow a power-law distribution. This is no problem but
you want to follow the 'implicit feedback' version of ALS, where you
are not trying to reconstruct the input but use the input as weights.


Question for RecommenderJob

2013-08-01 Thread hahn jiang
Hi all,


I have a question when I use RecommenderJob for item-based recommendation.

My input data format is userid,itemid,1, so I set booleanData option is
true.

The length of users is 9,000,000 but the length of item is 200.


When I run the RecommenderJob, the result is null. I try many times use
different arguments. But the result is also null.

This is one of my commands. Would you help me for  tell me why it is null
please?


bash recommender-job.sh --input input/user-item-value --output
output/recommender --numRecommendations 10 --similarityClassname
SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300
--maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity
1000 --booleanData true


Thanks


Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Simon Chan
We are building PredictionIO that helps to handle a number of business
logics. Recommending only items that the user has never expressed a
preference before is supported.
It is a layer on top of Mahout. Hope it is helpful.


Simon

On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Go with 0.8.  Definitely.

 Hadoop scaleout should be easy.


 On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki 
 ra...@projectbotticelli.com wrote:

  Thank you!
 
  In general, should I be putting our efforts into using 0.8 or stick with
  0.7 for now, re RecommenderJob?
 
  On another note, which might be a different thread, but would you have
 any
  ready-made accuracy and reliability validation code to suggest when using
  RecommenderJob, or do I need to stick with predicting from test data/test
  partitions, and analysing resulting confusion matrices in R etc? Anything
  turnkey aides to entice new users.
 
  Rafal
 
  Ps. Another reason for using RJ in our use case is the hopeful, assumed
  promise of a Hadoop-derived scale-out, when needed in the near future.
  Mixed results so far on that end.
  --
  Rafal Lukawiecki
  Pardon my brevity, sent from a telephone.
 
  On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote:
 
   On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki 
   ra...@projectbotticelli.com wrote:
  
   Many thanks, I'll report the issue, when I figure out where. :)
  
   I can help with that!
  
   https://issues.apache.org/jira/browse/MAHOUT
 



Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, is there any documentation available, or more info on PredictionIO?
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote:

 We are building PredictionIO that helps to handle a number of business
 logics. Recommending only items that the user has never expressed a
 preference before is supported.
 It is a layer on top of Mahout. Hope it is helpful.
 
 
 Simon
 
 On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Go with 0.8.  Definitely.
 
 Hadoop scaleout should be easy.
 
 
 On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki 
 ra...@projectbotticelli.com wrote:
 
 Thank you!
 
 In general, should I be putting our efforts into using 0.8 or stick with
 0.7 for now, re RecommenderJob?
 
 On another note, which might be a different thread, but would you have
 any
 ready-made accuracy and reliability validation code to suggest when using
 RecommenderJob, or do I need to stick with predicting from test data/test
 partitions, and analysing resulting confusion matrices in R etc? Anything
 turnkey aides to entice new users.
 
 Rafal
 
 Ps. Another reason for using RJ in our use case is the hopeful, assumed
 promise of a Hadoop-derived scale-out, when needed in the near future.
 Mixed results so far on that end.
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.
 
 On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki 
 ra...@projectbotticelli.com wrote:
 
 Many thanks, I'll report the issue, when I figure out where. :)
 
 I can help with that!
 
 https://issues.apache.org/jira/browse/MAHOUT
 
 


Why is Lanczos deprecated?

2013-08-01 Thread Fernando Fernández
Hi everyone,

Sorry if I duplicate the question but I've been looking for an answer and I
haven't found an explanation other than it's not being used (together with
some other algorithms). If it's been discussed in depth before maybe you
can point me to some link with the discussion.

I have successfully used Lanczos in several projects and it's been a
surprise to me finding that the main reason (according to what I've read
that might not be the full story) is that it's not being used. At the
begining I supposed it was because SSVD is supposed to be much faster with
similar results, but after making some tests I have found that running
times are similar or even worse than lanczos for some configurations (I
have tried several combinations of parameters, given child processes enough
memory, etc. and had no success in running SSVD at least in 3/4 of time
Lanczos runs, thouh they might be some combinations of parameters I have
still not tried). It seems to be quite tricky to find a good combination of
parameters for SSVD and I have seen also a precision loss in some examples
that makes me not confident in migrating Lanczos to SSVD from now on (How
far can I trust results from a combination of parameters that runs in
significant less time, or at least a good time?).

Can someone convince me that SSVD is actually a better option than Lanczos?
(I'm totally willing to be convinced... :) )

Thank you very much in advance.

Fernando.


Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Simon, my apologies for my dumb question. I found the web site for prediction 
IO—I did not realise it was a separate project, and I was looking for info in 
the existing Mahout documentation. I will research it now for our use case.
--
Rafal Lukawiecki
Strategic Consultant and Director 
Project Botticelli Ltd

On 1 Aug 2013, at 09:52, Rafal Lukawiecki ra...@projectbotticelli.com wrote:

Simon, is there any documentation available, or more info on PredictionIO?
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 1 Aug 2013, at 09:13, Simon Chan simonc...@gmail.com wrote:

 We are building PredictionIO that helps to handle a number of business
 logics. Recommending only items that the user has never expressed a
 preference before is supported.
 It is a layer on top of Mahout. Hope it is helpful.
 
 
 Simon
 
 On Wed, Jul 31, 2013 at 4:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Go with 0.8.  Definitely.
 
 Hadoop scaleout should be easy.
 
 
 On Wed, Jul 31, 2013 at 4:19 PM, Rafal Lukawiecki 
 ra...@projectbotticelli.com wrote:
 
 Thank you!
 
 In general, should I be putting our efforts into using 0.8 or stick with
 0.7 for now, re RecommenderJob?
 
 On another note, which might be a different thread, but would you have
 any
 ready-made accuracy and reliability validation code to suggest when using
 RecommenderJob, or do I need to stick with predicting from test data/test
 partitions, and analysing resulting confusion matrices in R etc? Anything
 turnkey aides to entice new users.
 
 Rafal
 
 Ps. Another reason for using RJ in our use case is the hopeful, assumed
 promise of a Hadoop-derived scale-out, when needed in the near future.
 Mixed results so far on that end.
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.
 
 On 1 Aug 2013, at 00:09, Ted Dunning ted.dunn...@gmail.com wrote:
 
 On Wed, Jul 31, 2013 at 4:06 PM, Rafal Lukawiecki 
 ra...@projectbotticelli.com wrote:
 
 Many thanks, I'll report the issue, when I figure out where. :)
 
 I can help with that!
 
 https://issues.apache.org/jira/browse/MAHOUT
 
 




RE: How to SSVD output to generate Clusters

2013-08-01 Thread Stuti Awasthi
Thanks Ted, Dmitriy

Il check the Spectral Clustering as well PCA option but first with normal 
approach I want to execute it once. 

Here is what I am doing with Mahout 0.7:
1. seqdirectory :
 ~/mahout-distribution-0.7/bin/mahout seqdirectory -i 
/stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq

2.seq2sparse
~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o 
/stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70

3. ssvd
~/mahout-distribution-0.7/bin/mahout ssvd -i 
/stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V true 
--reduceTasks 1

4.kmeans: with U as input
~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c 
/stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl -k 10

5. Clusterdump
~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i 
/stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d 
/stuti/SSVD/data-vectors/dictionary.file-* -o ~/ClusterOutput/SSVD/KMeans_10 -p 
/stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV

Output :
Normally if I use Clusterdump with CSV option, the I receive the ClusterId and 
associated documents names but this time Im getting the output like :

120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_,
 ...

I think there is a problem because of NamedVector as after some search I get 
this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067 

My Queries :
1. Is the process which Im doing is correct ? should U be directly fed as input 
to Clustering Algorithm

2. The Output issue is because of NamedVector ?? If yes , then if I use Mahout 
0.8 will the issue be resolved ?

3. Im confused between parameter -k in SSVD and -k in Clustering(KMeans). 
How these are different ? As -k in Clustering means Number of cluster to be 
created . What is the purpose of -k(rank) in SSVD
(My apologies, but I am having some problem in grasping the SSVD algorithm. The 
concept of Rank is not clear to me)

4. If I generate -k =100 in SSVD, will I still be able to create say 10 
Clusters using the clustering with this data.

Thanks
Stuti Awasthi

-Original Message-
From: Dmitriy Lyubimov [mailto:dlie...@gmail.com] 
Sent: Wednesday, July 31, 2013 11:15 PM
To: user@mahout.apache.org
Subject: Re: How to SSVD output to generate Clusters

many people also use PCA options workflow with SSVD and then try clusterize the 
output U*Sigma which is dimensionally reduced representation of original 
row-wise dataset. To enable PCA and U*Sigma output, use

ssvd -pca true -us true -u false -v false -k=... -q=1 ...

-q=1 recommended for accuracy.



On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com wrote:

 Hi All,

 I wanted to group the documents with same context but which belongs to 
 one single domain together. I have tried KMeans and LDA provided in 
 Mahout to perform the clustering but the groups which are generated 
 are not very good. Hence I thought to use LSA to indentify the context 
 related to the word and then perform the Clustering.

 I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as 
 output of SSVD.
 I am not sure how to use the output of SSVD to fed to the Clustering 
 Algorithm so that we can generate the clusters of the documents which 
 might be talking about same context.

 Any pointers how can I achieve this ?

 Regards
 Stuti Awasthi


 ::DISCLAIMER::

 --
 --
 

 The contents of this e-mail and any attachment(s) are confidential and 
 intended for the named recipient(s) only.
 E-mail transmission is not guaranteed to be secure or error-free as 
 information could be intercepted, corrupted, lost, destroyed, arrive 
 late or incomplete, or may contain viruses in transmission. The e mail 
 and its contents (with or without referred errors) shall therefore not 
 attach any liability on the originator or HCL or its affiliates.
 Views or opinions, if any, presented in this email are solely those of 
 the author and may not necessarily reflect the views or opinions of 
 HCL or its affiliates. Any form of reproduction, dissemination, 
 copying, disclosure, modification, distribution and / or publication 
 of this message without the prior written consent of authorized 
 representative of HCL is strictly prohibited. If you have received 
 this email in error please delete it and notify the sender 
 immediately.
 Before opening any email and/or attachments, please check them for 
 viruses and other defects.


 --
 

Re: How to SSVD output to generate Clusters

2013-08-01 Thread Chirag Lakhani
Maybe someone can clarify this issue but the spectral clustering
implementation assumes an affinity graph, am I correct?  Are there direct
ways of going from a list of feature vectors to an affinity matrix in order
to then implement spectral clustering?


On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote:

 Thanks Ted, Dmitriy

 Il check the Spectral Clustering as well PCA option but first with normal
 approach I want to execute it once.

 Here is what I am doing with Mahout 0.7:
 1. seqdirectory :
  ~/mahout-distribution-0.7/bin/mahout seqdirectory -i
 /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq

 2.seq2sparse
 ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq -o
 /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70

 3. ssvd
 ~/mahout-distribution-0.7/bin/mahout ssvd -i
 /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true -V
 true --reduceTasks 1

 4.kmeans: with U as input
 ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c
 /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl
 -k 10

 5. Clusterdump
 ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i
 /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d
 /stuti/SSVD/data-vectors/dictionary.file-* -o
 ~/ClusterOutput/SSVD/KMeans_10 -p
 /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV

 Output :
 Normally if I use Clusterdump with CSV option, the I receive the ClusterId
 and associated documents names but this time Im getting the output like :

 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_,
 ...

 I think there is a problem because of NamedVector as after some search I
 get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067

 My Queries :
 1. Is the process which Im doing is correct ? should U be directly fed as
 input to Clustering Algorithm

 2. The Output issue is because of NamedVector ?? If yes , then if I use
 Mahout 0.8 will the issue be resolved ?

 3. Im confused between parameter -k in SSVD and -k in
 Clustering(KMeans). How these are different ? As -k in Clustering means
 Number of cluster to be created . What is the purpose of -k(rank) in SSVD
 (My apologies, but I am having some problem in grasping the SSVD
 algorithm. The concept of Rank is not clear to me)

 4. If I generate -k =100 in SSVD, will I still be able to create say 10
 Clusters using the clustering with this data.

 Thanks
 Stuti Awasthi

 -Original Message-
 From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
 Sent: Wednesday, July 31, 2013 11:15 PM
 To: user@mahout.apache.org
 Subject: Re: How to SSVD output to generate Clusters

 many people also use PCA options workflow with SSVD and then try
 clusterize the output U*Sigma which is dimensionally reduced representation
 of original row-wise dataset. To enable PCA and U*Sigma output, use

 ssvd -pca true -us true -u false -v false -k=... -q=1 ...

 -q=1 recommended for accuracy.



 On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com
 wrote:

  Hi All,
 
  I wanted to group the documents with same context but which belongs to
  one single domain together. I have tried KMeans and LDA provided in
  Mahout to perform the clustering but the groups which are generated
  are not very good. Hence I thought to use LSA to indentify the context
  related to the word and then perform the Clustering.
 
  I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as
  output of SSVD.
  I am not sure how to use the output of SSVD to fed to the Clustering
  Algorithm so that we can generate the clusters of the documents which
  might be talking about same context.
 
  Any pointers how can I achieve this ?
 
  Regards
  Stuti Awasthi
 
 
  ::DISCLAIMER::
 
  --
  --
  
 
  The contents of this e-mail and any attachment(s) are confidential and
  intended for the named recipient(s) only.
  E-mail transmission is not guaranteed to be secure or error-free as
  information could be intercepted, corrupted, lost, destroyed, arrive
  late or incomplete, or may contain viruses in transmission. The e mail
  and its contents (with or without referred errors) shall therefore not
  attach any liability on the originator or HCL or its affiliates.
  Views or opinions, if any, presented in this email are solely those of
  the author and may not necessarily reflect the views or opinions of
  HCL or its affiliates. Any form of reproduction, dissemination,
  copying, disclosure, modification, distribution and / or publication
  of this message without the prior 

CHEMDNER CFP and training data

2013-08-01 Thread Martin Krallinger
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name
recognition task (see
http://www.biocreative.org/tasks/biocreative-iv/chemdner)

(1) The CHEMDNER task (part of The BioCreative IV competition) is a
community challenge on named entity recognition of chemical compounds. The
goal of this task is to promote the implementation of systems that are able
to detect mentions in text of chemical compounds and drugs.

(2) The datasets relevant to the CHEMDNER tasks will all be listed under
the following link:

http://www.biocreative.org/resources/corpora/bc-iv-chemdner-corpus

The CHEMDER training set is now online, together with an updated version of
the annotation guidelines. They are available at:

http://www.biocreative.org/media/store/files/2013/CHEMDNER_TRAIN_V01.zip

 (3) Dates: Please notice the following CHEMDNER schedule, in particular
the test set prediction due.

*25th June: sample data collection, detailed task description, annotation
and evaluation script
31st July: training data collection, annotations and updated guidelines
16th August : development data annotations
3nd September: test set release
12th September: test set prediction due
17th September: invite teams for workshop presentation talks
19th September: CHEMDNER workshop proceedings paper due (2-4 pages)
7th-9th October: BioCreative IV workshop
http://www.biocreative.org/events/biocreative-iv/workshop/
*

(4) Frequently asked questions (FAQ). Considering the numerous questions we
have got from various teams (many of them related to the CDI subtask
ranking). We have placed online a FAQ document at:

http://www.biocreative.org/media/store/files/2013/chemdner_faq.pdf

(5) Evaluation workshop. The evaluation results will be presented at this
workshop and in the corresponding workshop proceedings evaluation paper.
This is in line with other community challenges such as Critical Assessment
of protein Structure Prediction (CASP) experiments. There will be a session
devoted to the CHEMDNER task during the workshop as well as a poster
session and selected talks from participating teams. The link of the
workshop as well as additional details and registration is online at:

http://www.biocreative.org/events/biocreative-iv/workshop

(6) Workshop proceedings. The second volume of the BioCreative workshop
proceedings will be devoted entirely to the CHEMDER task. The proceedings
papers for the CHEMDER task should be 2-4 pages long, describing your
system and results obtained for the training or development set (or both).
Please refer to the following URL for more information. For more details
refer to:

http://www.biocreative.org/events/biocreative-iv/workshop/#proceedings

(7) CHEMDER special issue publications

There will be a journal issue devoted to BioCreative IV and also one for
the CHEMDER task. This is in line with previous BioCreative challenges were
special issues were published in BMC Bioinformatics, Genome Biology and the
journal Database. We will announce more details on the selection process
and the target journal after the workshop.






Martin Krallinger

Structural Computational Biology Group

Structural Biology and BioComputing Programme

Spanish National Cancer Research Centre (CNIO)




k-means issues

2013-08-01 Thread Marco


So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?



Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
Which version of Mahout are you using? Did you check the output, are you
sure that no errors occur?

Best,
Sebastian

On 01.08.2013 09:59, hahn jiang wrote:
 Hi all,
 
 
 I have a question when I use RecommenderJob for item-based recommendation.
 
 My input data format is userid,itemid,1, so I set booleanData option is
 true.
 
 The length of users is 9,000,000 but the length of item is 200.
 
 
 When I run the RecommenderJob, the result is null. I try many times use
 different arguments. But the result is also null.
 
 This is one of my commands. Would you help me for  tell me why it is null
 please?
 
 
 bash recommender-job.sh --input input/user-item-value --output
 output/recommender --numRecommendations 10 --similarityClassname
 SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300
 --maxPrefsPerUser 300 --minPrefsPerUser 1 --maxPrefsPerUserInItemSimilarity
 1000 --booleanData true
 
 
 Thanks
 



Re: k-means issues

2013-08-01 Thread Suneel Marthi
Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 







 From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues
 



So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?

Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
IIRC the main reasons for deprecating Lanczos was that in contrast to
SSVD, it does not use a constant number of MapReduce jobs and that our
implementation has the constraint that all the resulting vectors have to
fit into the memory of the driver machine.

Best,
Sebastian

On 01.08.2013 12:15, Fernando Fernández wrote:
 Hi everyone,
 
 Sorry if I duplicate the question but I've been looking for an answer and I
 haven't found an explanation other than it's not being used (together with
 some other algorithms). If it's been discussed in depth before maybe you
 can point me to some link with the discussion.
 
 I have successfully used Lanczos in several projects and it's been a
 surprise to me finding that the main reason (according to what I've read
 that might not be the full story) is that it's not being used. At the
 begining I supposed it was because SSVD is supposed to be much faster with
 similar results, but after making some tests I have found that running
 times are similar or even worse than lanczos for some configurations (I
 have tried several combinations of parameters, given child processes enough
 memory, etc. and had no success in running SSVD at least in 3/4 of time
 Lanczos runs, thouh they might be some combinations of parameters I have
 still not tried). It seems to be quite tricky to find a good combination of
 parameters for SSVD and I have seen also a precision loss in some examples
 that makes me not confident in migrating Lanczos to SSVD from now on (How
 far can I trust results from a combination of parameters that runs in
 significant less time, or at least a good time?).
 
 Can someone convince me that SSVD is actually a better option than Lanczos?
 (I'm totally willing to be convinced... :) )
 
 Thank you very much in advance.
 
 Fernando.
 



Re: k-means issues

2013-08-01 Thread Marco
ok i did put -cl and got clusteredPoints, but then I do clusterdump and always 
get Wrote 0 clusters




- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 







From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues




So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?


Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
One trick to getting more mappers on a job when running from the command
line is to pass a '-Dmapred.max.split.size=' argument. The  is a
size in bytes. So if you have some hypothetical 10MB input set, but you
want to force ~100 mappers, use '-Dmapred.max.split.size=100'


On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit galp...@ebay.comwrote:


 Hi,

 It sounds to me like this could be related to one of the Qs I've posted
 several days ago (is it?):
 My mahout clustering processes seem to be running very slow (several good
 hours on just ~1M items), and I'm wondering if there's anything that needs
 to be changed in setting/configuration. (and how?)
 I'm running on a large cluster and could potentially use thousands
 of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
 are only using max 5 mappers (I tried it on several data sets).
 I've tried to define the number of mappers by something like:
 -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
 only uses =5 mappers.
 Is there a different way to set the number of mappers/reducers for
 a mahout process?
 Or is there another configuration issue I need to consider?

 I'd definitely be happy to use such a parameter, does it not exist?
 (I'm running mahout as installed on the cluster)

 Is there currently a workaround, besides running a mahout jar as an hadoop
 job?
 When I originally tried to run a mahout jar that uses KMeansDriver (and
 that runs great on my local machine)- it did not even initiate a job on the
 hadoop cluster. It seemed to be running parallel but in fact it was running
 only on the local node. Is this a known issue? Is there a fix for
 this? (I ended up dropping it and calling mahout step by step from command
 line, but I'd be happy to know if there a fix for this).

 Thanks,

 Galit.

 -Original Message-
 From: Ryan Josal [mailto:rjo...@gmail.com]
 Sent: Monday, July 29, 2013 9:33 PM
 To: Adam Baron
 Cc: Ryan Josal; user@mahout.apache.org
 Subject: Re: Run more than one mapper for TestForest?

 If you're running mahout from the CLI, you'll have to modify the Hadoop
 config file or your env manually for each job.  This is code I put in to my
 custom job executions so I didn't have to calculate and set that up every
 time.  Maybe that's your best route in that position.  You could just
 provide your own mahout jar and run it as you would any other Hadoop job
 and ignore the installed Mahout.  I do think this could be a useful
 parameter for a number of standard mahout jobs though; I know I would use
 it.  Does anyone in the mahout community see this as a generally useful
 feature for a Mahout job?

 Ryan

 On Jul 29, 2013, at 10:25, Adam Baron adam.j.ba...@gmail.com wrote:

  Ryan,
 
  Thanks for the fix, the code looks reasonable to me.  Which version of
 Mahout will this be in?  0.9?
 
  Unfortunately, I'm using a large shared Hadoop cluster which is not
 administered by my team.   So I'm not in a position push the latest from
 the Mahout dev trunk into our environment; the admins will only install
 official releases.
 
  Regards,
Adam
 
  On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal r...@josal.com wrote:
  Late reply, but for what it's still worth, since I've seen a couple
 other threads here on the topic of too few mappers, I added a parameter to
 set a minimum number of mappers.  Some of my mahout jobs needed more
 mappers, but were not given many because of the small input file size.
 
  addOption(minMapTasks, m, Minimum number of map tasks to
  run, String.valueOf(1));
 
 
  int minMapTasks = Integer.parseInt(getOption(minMapTasks));
  int mapTasksThatWouldRun = (int)
 (vectorFileSizeBytes/getSplitSize()) + 1;
  log.info(map tasks min:  + minMapTasks +  current:  +
 mapTasksThatWouldRun);
  if (minMapTasks  mapTasksThatWouldRun) {
  String splitSizeBytes =
 String.valueOf(vectorFileSizeBytes/minMapTasks);
  log.info(Forcing mapred.max.split.size to  +
 splitSizeBytes +  to ensure minimum map tasks =  + minMapTasks);
  hadoopConf.set(mapred.max.split.size, splitSizeBytes);
  }
 
  // there is actually a private method in hadoop to calculate this
  private long getSplitSize() {
  long blockSize = hadoopConf.getLong(dfs.block.size, 64 * 1024
 * 1024);
  long maxSize = hadoopConf.getLong(mapred.max.split.size,
 Long.MAX_VALUE);
  int minSize = hadoopConf.getInt(mapred.min.split.size, 1);
  long splitSize = Math.max(minSize, Math.min(maxSize,
 blockSize));
  log.info(String.format(min: %,d block: %,d max: %,d split:
 %,d, minSize, blockSize, maxSize, splitSize));
  return splitSize;
  }
 
  It seems like there should be a more straightforward way to do this,
 but it works for me and I've used it on a lot of jobs to set a minimum
 number of mappers.
 
  Ryan
 
  On Jul 5, 2013, 

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
Oops, I'm sorry. I had one too many zeros there, should be
'-Dmapred.max.split.size=10'

Just (input size)/(desired number of mappers)


Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 5:49 AM, Stuti Awasthi stutiawas...@hcl.com wrote:

 I think there is a problem because of NamedVector as after some search I
 get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067


Note also that this bug is fixed in 0.8


Re: How to SSVD output to generate Clusters

2013-08-01 Thread Ted Dunning
The original motivation of spectral clustering talks about graphs.

But the idea of clustering the reduced dimension form of a matrix simply
depends on the fact[1] that the metric is approximately preserved by the
reduced form and is thus applicable to any matrix.


[1] Johnson-Lindenstrauss yet again.


On Thu, Aug 1, 2013 at 6:22 AM, Chirag Lakhani clakh...@zaloni.com wrote:

 Maybe someone can clarify this issue but the spectral clustering
 implementation assumes an affinity graph, am I correct?  Are there direct
 ways of going from a list of feature vectors to an affinity matrix in order
 to then implement spectral clustering?


 On Thu, Aug 1, 2013 at 8:49 AM, Stuti Awasthi stutiawas...@hcl.com
 wrote:

  Thanks Ted, Dmitriy
 
  Il check the Spectral Clustering as well PCA option but first with normal
  approach I want to execute it once.
 
  Here is what I am doing with Mahout 0.7:
  1. seqdirectory :
   ~/mahout-distribution-0.7/bin/mahout seqdirectory -i
  /stuti/SSVD/ClusteringInput -o /stuti/SSVD/data-seq
 
  2.seq2sparse
  ~/mahout-distribution-0.7/bin/mahout seq2sparse -i /stuti/SSVD/data-seq
 -o
  /stuti/SSVD/data-vectors -s 5 -ml 50 -nv -ng 3 -n 2 -x 70
 
  3. ssvd
  ~/mahout-distribution-0.7/bin/mahout ssvd -i
  /stuti/SSVD/data-vectors/tf-vectors -o /stuti/SSVD/Output -k 10 -U true
 -V
  true --reduceTasks 1
 
  4.kmeans: with U as input
  ~/mahout-distribution-0.7/bin/mahout kmeans -i /stuti/SSVD/Output/U -c
  /stuti/intial-centroids -o /stuti/SSVD/Cluster/kmeans-clusters -dm
  org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 -x 20 -cl
  -k 10
 
  5. Clusterdump
  ~/mahout-distribution-0.7/bin/mahout clusterdump -dt sequencefile -i
  /stuti/SSVD/Cluster/kmeans-clusters/clusters-*-final -d
  /stuti/SSVD/data-vectors/dictionary.file-* -o
  ~/ClusterOutput/SSVD/KMeans_10 -p
  /stuti/SSVD/Cluster/kmeans-clusters/clusteredPoints -n 10 -b 200 -of CSV
 
  Output :
  Normally if I use Clusterdump with CSV option, the I receive the
 ClusterId
  and associated documents names but this time Im getting the output like :
 
 
 120,_0_-0.06453357851086772_1_-0.11705342976172932_2_0.04432960668756471_3_0.10046604725589514_4_-0.06602768838676538_5_-0.16253383395031692_6_-0.0042184763959784155_7_0.03321981657725734_8_-0.04904708660966478_9_0.015635264416337353_,
  ...
 
  I think there is a problem because of NamedVector as after some search I
  get this Jira. https://issues.apache.org/jira/browse/MAHOUT-1067
 
  My Queries :
  1. Is the process which Im doing is correct ? should U be directly fed as
  input to Clustering Algorithm
 
  2. The Output issue is because of NamedVector ?? If yes , then if I use
  Mahout 0.8 will the issue be resolved ?
 
  3. Im confused between parameter -k in SSVD and -k in
  Clustering(KMeans). How these are different ? As -k in Clustering means
  Number of cluster to be created . What is the purpose of -k(rank) in SSVD
  (My apologies, but I am having some problem in grasping the SSVD
  algorithm. The concept of Rank is not clear to me)
 
  4. If I generate -k =100 in SSVD, will I still be able to create say 10
  Clusters using the clustering with this data.
 
  Thanks
  Stuti Awasthi
 
  -Original Message-
  From: Dmitriy Lyubimov [mailto:dlie...@gmail.com]
  Sent: Wednesday, July 31, 2013 11:15 PM
  To: user@mahout.apache.org
  Subject: Re: How to SSVD output to generate Clusters
 
  many people also use PCA options workflow with SSVD and then try
  clusterize the output U*Sigma which is dimensionally reduced
 representation
  of original row-wise dataset. To enable PCA and U*Sigma output, use
 
  ssvd -pca true -us true -u false -v false -k=... -q=1 ...
 
  -q=1 recommended for accuracy.
 
 
 
  On Wed, Jul 31, 2013 at 5:09 AM, Stuti Awasthi stutiawas...@hcl.com
  wrote:
 
   Hi All,
  
   I wanted to group the documents with same context but which belongs to
   one single domain together. I have tried KMeans and LDA provided in
   Mahout to perform the clustering but the groups which are generated
   are not very good. Hence I thought to use LSA to indentify the context
   related to the word and then perform the Clustering.
  
   I am able to run SSVD of Mahout and generated 3 files : Sigma,U,V as
   output of SSVD.
   I am not sure how to use the output of SSVD to fed to the Clustering
   Algorithm so that we can generate the clusters of the documents which
   might be talking about same context.
  
   Any pointers how can I achieve this ?
  
   Regards
   Stuti Awasthi
  
  
   ::DISCLAIMER::
  
   --
   --
   
  
   The contents of this e-mail and any attachment(s) are confidential and
   intended for the named recipient(s) only.
   E-mail transmission is not guaranteed to be secure or error-free as
   information could be intercepted, corrupted, lost, destroyed, arrive
   late or incomplete, or 

Re: k-means issues

2013-08-01 Thread Suneel Marthi


Could u post the Command line u r using for clusterdump?





 From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues
 

ok i did put -cl and got clusteredPoints, but then I do clusterdump and always 
get Wrote 0 clusters




- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 







From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues




So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Ryan Josal
Galit, yes this does sound like this is related, and as Matt said, you can test 
this by setting the max split size on the CLI.  I didn't personally find this 
to be a reliable and efficient method, so I wrote the -m parameter to my job to 
set it right every time.  It seems that this would be useful to have as a 
general parameter for Mahout jobs; is there agreement on this, and if so can I 
get some guidance on how to contribute?

Ryan

On Aug 1, 2013, at 8:00, Matt Molek mpmo...@gmail.com wrote:

 One trick to getting more mappers on a job when running from the command
 line is to pass a '-Dmapred.max.split.size=' argument. The  is a
 size in bytes. So if you have some hypothetical 10MB input set, but you
 want to force ~100 mappers, use '-Dmapred.max.split.size=100'
 
 
 On Wed, Jul 31, 2013 at 4:57 AM, Fuhrmann Alpert, Galit 
 galp...@ebay.comwrote:
 
 
 Hi,
 
 It sounds to me like this could be related to one of the Qs I've posted
 several days ago (is it?):
 My mahout clustering processes seem to be running very slow (several good
 hours on just ~1M items), and I'm wondering if there's anything that needs
 to be changed in setting/configuration. (and how?)
I'm running on a large cluster and could potentially use thousands
 of nodes (mappers/reducers). However, my mahout processes (kmeans/canopy.)
 are only using max 5 mappers (I tried it on several data sets).
I've tried to define the number of mappers by something like:
 -Dmapred.map.tasks=100 but this didn't seem to have an effect, it still
 only uses =5 mappers.
Is there a different way to set the number of mappers/reducers for
 a mahout process?
Or is there another configuration issue I need to consider?
 
 I'd definitely be happy to use such a parameter, does it not exist?
 (I'm running mahout as installed on the cluster)
 
 Is there currently a workaround, besides running a mahout jar as an hadoop
 job?
 When I originally tried to run a mahout jar that uses KMeansDriver (and
 that runs great on my local machine)- it did not even initiate a job on the
 hadoop cluster. It seemed to be running parallel but in fact it was running
 only on the local node. Is this a known issue? Is there a fix for
 this? (I ended up dropping it and calling mahout step by step from command
 line, but I'd be happy to know if there a fix for this).
 
 Thanks,
 
 Galit.
 
 -Original Message-
 From: Ryan Josal [mailto:rjo...@gmail.com]
 Sent: Monday, July 29, 2013 9:33 PM
 To: Adam Baron
 Cc: Ryan Josal; user@mahout.apache.org
 Subject: Re: Run more than one mapper for TestForest?
 
 If you're running mahout from the CLI, you'll have to modify the Hadoop
 config file or your env manually for each job.  This is code I put in to my
 custom job executions so I didn't have to calculate and set that up every
 time.  Maybe that's your best route in that position.  You could just
 provide your own mahout jar and run it as you would any other Hadoop job
 and ignore the installed Mahout.  I do think this could be a useful
 parameter for a number of standard mahout jobs though; I know I would use
 it.  Does anyone in the mahout community see this as a generally useful
 feature for a Mahout job?
 
 Ryan
 
 On Jul 29, 2013, at 10:25, Adam Baron adam.j.ba...@gmail.com wrote:
 
 Ryan,
 
 Thanks for the fix, the code looks reasonable to me.  Which version of
 Mahout will this be in?  0.9?
 
 Unfortunately, I'm using a large shared Hadoop cluster which is not
 administered by my team.   So I'm not in a position push the latest from
 the Mahout dev trunk into our environment; the admins will only install
 official releases.
 
 Regards,
  Adam
 
 On Sun, Jul 28, 2013 at 5:37 PM, Ryan Josal r...@josal.com wrote:
 Late reply, but for what it's still worth, since I've seen a couple
 other threads here on the topic of too few mappers, I added a parameter to
 set a minimum number of mappers.  Some of my mahout jobs needed more
 mappers, but were not given many because of the small input file size.
 
addOption(minMapTasks, m, Minimum number of map tasks to
 run, String.valueOf(1));
 
 
int minMapTasks = Integer.parseInt(getOption(minMapTasks));
int mapTasksThatWouldRun = (int)
 (vectorFileSizeBytes/getSplitSize()) + 1;
log.info(map tasks min:  + minMapTasks +  current:  +
 mapTasksThatWouldRun);
if (minMapTasks  mapTasksThatWouldRun) {
String splitSizeBytes =
 String.valueOf(vectorFileSizeBytes/minMapTasks);
log.info(Forcing mapred.max.split.size to  +
 splitSizeBytes +  to ensure minimum map tasks =  + minMapTasks);
hadoopConf.set(mapred.max.split.size, splitSizeBytes);
}
 
// there is actually a private method in hadoop to calculate this
private long getSplitSize() {
long blockSize = hadoopConf.getLong(dfs.block.size, 64 * 1024
 * 1024);
long maxSize = hadoopConf.getLong(mapred.max.split.size,
 Long.MAX_VALUE);

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Not following so…

Here so is what I've done in probably too much detail:

1) ingest raw log files and split them up by action
2) turn these into Mahout preference files using Mahout type IDs, keeping a map 
of IDs
3) run the Mahout Item-based recommender using LLR for similarity
4) created a Mahout style cross-recommender using cooccurrence similarity using 
matrix math
5) given two similairty matrixes and a user history matrix I am writing them to 
csv files with Mahout ID replaced by the original string external IDs for users 
and items

input log file before splitting:
u1  purchaseiphone
u1  purchaseipad
u2  purchasenexus-tablet
u2  purchasegalaxy
u3  purchasesurface
u4  purchaseiphone
u4  purchaseipad
u1  viewiphone
u1  viewipad
u1  viewnexus-tablet
u1  viewgalaxy
u2  viewiphone
u2  viewipad
u2  viewnexus-tablet
u2  viewgalaxy
u3  viewsurface
u4  viewiphone
u4  viewipad
u4  viewnexus-tablet


Input user history DRM after ID translation to mahout IDs and splitting for 
action purchase

B   user/item   iphone  ipadnexus-tabletgalaxy  surface
u1  1   1   0   0   0
u2  0   0   1   1   0
u3  0   0   0   0   1
u4  1   1   0   0   0

Map of IDs Mahout to Original/External
0 - iphone
1 - ipad
2 - nexus-tablet
3 - galaxy
4 - surface

To be specific the DRM from the RecommenderJob with item-item similarities 
using LLR looks like this:
Input Path: out/p-recs/sims/part-r-0
Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
org.apache.mahout.math.VectorWritable
Key: 0: Value: {1:0.8472157541208549}
Key: 1: Value: {0:0.8472157541208549}
Key: 2: Value: {3:0.8181382096075936}
Key: 3: Value: {2:0.8181382096075936}
Key: 4: Value: {}

This will be written to a directory for later Solr indexing as a csv of the 
form:
item_id,similar_items,cross_action_similar_items
iphone,ipad,
ipad,iphone,
nexus-tablet,galaxy,
galaxy, nexus-tablet,
surface,,

By using a user's history vector as a query you get results = recommendations
So if the user is u1, the history vector is:
iphone ipad

The Solr results for query iphone ipad using field similar_items will be 
1. Doc ID, ipad
2. Doc ID, iphone

If you want item similarities, for instance if a user is anonymous with no 
history and is looking at an iphone product page. You would fetch the doc for 
id =  iphone and get:
ipad

Perhaps a bad example for ordering, since there is only one ID in the doc but 
the items in the similar_items field would be ordered by similarity strength. 

Likewise for the cross-action similarities though the matrix will have 
cooccurrence [B'A] values in the DRM.

For item similarities there is no need to do more than fetch one doc that 
contains the similarities, right? I've successfully used this method with the 
Mahout recommender but please correct me if something above is wrong. 


On Jul 31, 2013, at 4:52 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Pat,

See inline


On Wed, Jul 31, 2013 at 1:29 PM, Pat Ferrel p...@occamsmachete.com wrote:

 So the XML as CSV would be:
 item_id,similar_items,cross_action_similar_items
 ipad,iphone,iphone nexus
 iphone,ipad,ipad galaxy
 

Right.  Doesn't matter what format.  Might want quotes around space
delimited lists, but anything will do.


 
 Note: As I mentioned before the order of the items in the field will
 encode rank of the similarity strength. This is for cases where you want to
 find similar items to a context item. You would fetch the doc for the
 context item by it's item ID and show the top k items in the doc. Ted's
 caveat would probably be to dither them.
 

I always say dither so that is an easy one.

But fetching similar items of a center item by fetching the center item and
then fetching each of the referenced items is typically slower by about 2x
than running the search for mentions of the center item.


 Sounds like Ted is generating data. Andrew or M Lyon do either of you want
 to set the demo system up? If so you'll need to find a system--free tier
 AWS, Ted's box, etc. Then install all the needed stuff.
 
 I'll get the output working to csv.
 
 On Jul 31, 2013, at 11:51 AM, Pat Ferrel pat.fer...@gmail.com wrote:
 
 OK and yes. The docs will look like:
 
 add
   doc
  field name='item_id'ipad/field
  field name='similar_items'iphone/field
  field name='cross_action_similar_items'iphone nexus/field
   /doc
  doc
field name='item_id'iphone/field
field name='similar_items'ipad/field
field name='cross_action_similar_items'ipad galaxy/field
  /doc
 /add
 
 
 On Jul 31, 2013, at 11:42 AM, B Lyon bradfl...@gmail.com wrote:
 
 I'm interested in helping as well.
 Btw I thought that what was stored in the solr fields were the llr-filtered
 items (ids I guess) for the 

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Hi Sebastian,

I've rechecked the results, and, I'm afraid that the issue has not gone away, 
contrary to my yesterday's enthusiastic response. Using 0.8 I have retested 
with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 
prefs). I have also supplied the prefs file, without the preference value, that 
is as: user,item (one per line) as a --filterFile, with and without the 
-maxPrefsPerUser, and I am afraid we are also seeing recommendations for items 
the user has expressed a prior preference for.

I suppose I need to file a bug report. 

Rafal
--
Rafal Lukawiecki
Pardon my brevity, sent from a telephone.

On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com 
wrote:

 Dear Sebastian,
 
 It looks like setting --maxPrefsPerUser 1 have resolved the issue in our 
 case—it seems that the most preferences a user had was just about 5000, so I 
 doubled it just-in-case, but when I operationalise this model, I will make 
 sure to calculate the actual max number of preferences and set the parameter 
 accordingly. I will double-check the resultset to make sure the issue is 
 really gone, as I have only checked the few cases where we have spotted a 
 recommendation of a previously preferred item.
 
 Would you like me to file a bug, and would you like me to test it on 0.8 or 
 another version? I am using 0.7.
 
 Thanks for your kind support.
 Rafal
 --
 Rafal Lukawiecki
 Strategic Consultant and Director 
 Project Botticelli Ltd
 
 On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com
 wrote:
 
 Hi Rafal,
 
 can you try to set the option --maxPrefsPerUser to the maximum number of
 interactions per user and see if you still get the error?
 
 Best,
 Sebastian
 
 On 30.07.2013 19:29, Rafal Lukawiecki wrote:
 Thank you Sebastian. The data set is not that large, as we are running tests 
 on a subset. It is about 24k users, 40k items, the preference file has 65k 
 preferences as triples. This was using Similarity Cooccurrence.
 
 I can see if I could anonymise the data set to share if that would be 
 helpful.
 
 Thanks for your kind help. 
 
 Rafal
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.
 
 On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote:
 
 Hi Rafal,
 
 can you issue a ticket for this problem at
 https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that
 check whether this happens and currently they work fine. I can only imagine
 that the problem occurs in larger datasets where we sample the data in some
 places. Can you describe a scenario/dataset where this happens?
 
 Best,
 Sebastian
 
 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com
 
 I'm new here, just registered. Many thanks to everyone for working on an
 amazing piece of software, thank you for building Mahout and for your
 support. My apologies if this is not the right place to ask the question—I
 have searched for the issue, and I can see this problem has been reported
 here:
 http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
 
 Unfortunately, the trail leads to the newsgroups, and I have not found a
 way, yet, to get an answer from them, without asking you.
 
 Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, and I
 am finding that it is recommending items that the user has already
 expressed a preference for in their input file. I understand that this
 should not be happening, and I am not sure if there is a know fix or if I
 should be looking for a workaround (such as using the entire input as the
 filterFile).
 
 I will double-check that there is no error on my side, but so far it does
 not seem that way.
 
 Many thanks and my regards from Ireland,
 Rafal Lukawiecki
 
 --
 
 Rafal Lukawiecki
 
 Strategic Consultant and Director
 
 Project Botticelli Ltd
 
 
 


Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Ok, please file a bug report detailing what you've tested and what results
you got.

Just to clarify, setting maxPrefsPerUser to a high number still does not
help? That surprises me.


2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com

 Hi Sebastian,

 I've rechecked the results, and, I'm afraid that the issue has not gone
 away, contrary to my yesterday's enthusiastic response. Using 0.8 I have
 retested with and without --maxPrefsPerUser 9000 parameter (no user has
 more than 5000 prefs). I have also supplied the prefs file, without the
 preference value, that is as: user,item (one per line) as a --filterFile,
 with and without the -maxPrefsPerUser, and I am afraid we are also seeing
 recommendations for items the user has expressed a prior preference for.

 I suppose I need to file a bug report.

 Rafal
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.

 On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com
 wrote:

  Dear Sebastian,
 
  It looks like setting --maxPrefsPerUser 1 have resolved the issue in
 our case—it seems that the most preferences a user had was just about 5000,
 so I doubled it just-in-case, but when I operationalise this model, I will
 make sure to calculate the actual max number of preferences and set the
 parameter accordingly. I will double-check the resultset to make sure the
 issue is really gone, as I have only checked the few cases where we have
 spotted a recommendation of a previously preferred item.
 
  Would you like me to file a bug, and would you like me to test it on 0.8
 or another version? I am using 0.7.
 
  Thanks for your kind support.
  Rafal
  --
  Rafal Lukawiecki
  Strategic Consultant and Director
  Project Botticelli Ltd
 
  On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com
  wrote:
 
  Hi Rafal,
 
  can you try to set the option --maxPrefsPerUser to the maximum number of
  interactions per user and see if you still get the error?
 
  Best,
  Sebastian
 
  On 30.07.2013 19:29, Rafal Lukawiecki wrote:
  Thank you Sebastian. The data set is not that large, as we are running
 tests on a subset. It is about 24k users, 40k items, the preference file
 has 65k preferences as triples. This was using Similarity Cooccurrence.
 
  I can see if I could anonymise the data set to share if that would be
 helpful.
 
  Thanks for your kind help.
 
  Rafal
  --
  Rafal Lukawiecki
  Pardon my brevity, sent from a telephone.
 
  On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote:
 
  Hi Rafal,
 
  can you issue a ticket for this problem at
  https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that
  check whether this happens and currently they work fine. I can only
 imagine
  that the problem occurs in larger datasets where we sample the data in
 some
  places. Can you describe a scenario/dataset where this happens?
 
  Best,
  Sebastian
 
  2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com
 
  I'm new here, just registered. Many thanks to everyone for working on
 an
  amazing piece of software, thank you for building Mahout and for your
  support. My apologies if this is not the right place to ask the
 question—I
  have searched for the issue, and I can see this problem has been
 reported
  here:
 
 http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
 
  Unfortunately, the trail leads to the newsgroups, and I have not
 found a
  way, yet, to get an answer from them, without asking you.
 
  Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7,
 and I
  am finding that it is recommending items that the user has already
  expressed a preference for in their input file. I understand that this
  should not be happening, and I am not sure if there is a know fix or
 if I
  should be looking for a workaround (such as using the entire input as
 the
  filterFile).
 
  I will double-check that there is no error on my side, but so far it
 does
  not seem that way.
 
  Many thanks and my regards from Ireland,
  Rafal Lukawiecki
 
  --
 
  Rafal Lukawiecki
 
  Strategic Consultant and Director
 
  Project Botticelli Ltd
 
 
 



Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Rafal Lukawiecki
Should I have set that parameter to a value much much larger than the maximum 
number of actually expressed preferences by a user?

I'm working on an anonymised data set. If it works as an error test case, I'd 
be happy to share it for your re-test. I am still hoping it is my error, not 
Mahout's.

Rafal
--
Rafal Lukawiecki
Pardon brevity, mobile device.

On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org wrote:

 Ok, please file a bug report detailing what you've tested and what results
 you got.
 
 Just to clarify, setting maxPrefsPerUser to a high number still does not
 help? That surprises me.
 
 
 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
 
 Hi Sebastian,
 
 I've rechecked the results, and, I'm afraid that the issue has not gone
 away, contrary to my yesterday's enthusiastic response. Using 0.8 I have
 retested with and without --maxPrefsPerUser 9000 parameter (no user has
 more than 5000 prefs). I have also supplied the prefs file, without the
 preference value, that is as: user,item (one per line) as a --filterFile,
 with and without the -maxPrefsPerUser, and I am afraid we are also seeing
 recommendations for items the user has expressed a prior preference for.
 
 I suppose I need to file a bug report.
 
 Rafal
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.
 
 On 31 Jul 2013, at 22:35, Rafal Lukawiecki ra...@projectbotticelli.com
 wrote:
 
 Dear Sebastian,
 
 It looks like setting --maxPrefsPerUser 1 have resolved the issue in
 our case—it seems that the most preferences a user had was just about 5000,
 so I doubled it just-in-case, but when I operationalise this model, I will
 make sure to calculate the actual max number of preferences and set the
 parameter accordingly. I will double-check the resultset to make sure the
 issue is really gone, as I have only checked the few cases where we have
 spotted a recommendation of a previously preferred item.
 
 Would you like me to file a bug, and would you like me to test it on 0.8
 or another version? I am using 0.7.
 
 Thanks for your kind support.
 Rafal
 --
 Rafal Lukawiecki
 Strategic Consultant and Director
 Project Botticelli Ltd
 
 On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com
 wrote:
 
 Hi Rafal,
 
 can you try to set the option --maxPrefsPerUser to the maximum number of
 interactions per user and see if you still get the error?
 
 Best,
 Sebastian
 
 On 30.07.2013 19:29, Rafal Lukawiecki wrote:
 Thank you Sebastian. The data set is not that large, as we are running
 tests on a subset. It is about 24k users, 40k items, the preference file
 has 65k preferences as triples. This was using Similarity Cooccurrence.
 
 I can see if I could anonymise the data set to share if that would be
 helpful.
 
 Thanks for your kind help.
 
 Rafal
 --
 Rafal Lukawiecki
 Pardon my brevity, sent from a telephone.
 
 On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org wrote:
 
 Hi Rafal,
 
 can you issue a ticket for this problem at
 https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests that
 check whether this happens and currently they work fine. I can only
 imagine
 that the problem occurs in larger datasets where we sample the data in
 some
 places. Can you describe a scenario/dataset where this happens?
 
 Best,
 Sebastian
 
 2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com
 
 I'm new here, just registered. Many thanks to everyone for working on
 an
 amazing piece of software, thank you for building Mahout and for your
 support. My apologies if this is not the right place to ask the
 question—I
 have searched for the issue, and I can see this problem has been
 reported
 here:
 http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
 
 Unfortunately, the trail leads to the newsgroups, and I have not
 found a
 way, yet, to get an answer from them, without asking you.
 
 Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7,
 and I
 am finding that it is recommending items that the user has already
 expressed a preference for in their input file. I understand that this
 should not be happening, and I am not sure if there is a know fix or
 if I
 should be looking for a workaround (such as using the entire input as
 the
 filterFile).
 
 I will double-check that there is no error on my side, but so far it
 does
 not seem that way.
 
 Many thanks and my regards from Ireland,
 Rafal Lukawiecki
 
 --
 
 Rafal Lukawiecki
 
 Strategic Consultant and Director
 
 Project Botticelli Ltd
 


Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-01 Thread Sebastian Schelter
Setting it to the maximum number should be enough. Would be great if you
can share your dataset and tests.

2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com

 Should I have set that parameter to a value much much larger than the
 maximum number of actually expressed preferences by a user?

 I'm working on an anonymised data set. If it works as an error test case,
 I'd be happy to share it for your re-test. I am still hoping it is my
 error, not Mahout's.

 Rafal
 --
 Rafal Lukawiecki
 Pardon brevity, mobile device.

 On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org wrote:

  Ok, please file a bug report detailing what you've tested and what
 results
  you got.
 
  Just to clarify, setting maxPrefsPerUser to a high number still does not
  help? That surprises me.
 
 
  2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com
 
  Hi Sebastian,
 
  I've rechecked the results, and, I'm afraid that the issue has not gone
  away, contrary to my yesterday's enthusiastic response. Using 0.8 I have
  retested with and without --maxPrefsPerUser 9000 parameter (no user has
  more than 5000 prefs). I have also supplied the prefs file, without the
  preference value, that is as: user,item (one per line) as a
 --filterFile,
  with and without the -maxPrefsPerUser, and I am afraid we are also
 seeing
  recommendations for items the user has expressed a prior preference for.
 
  I suppose I need to file a bug report.
 
  Rafal
  --
  Rafal Lukawiecki
  Pardon my brevity, sent from a telephone.
 
  On 31 Jul 2013, at 22:35, Rafal Lukawiecki 
 ra...@projectbotticelli.com
  wrote:
 
  Dear Sebastian,
 
  It looks like setting --maxPrefsPerUser 1 have resolved the issue
 in
  our case—it seems that the most preferences a user had was just about
 5000,
  so I doubled it just-in-case, but when I operationalise this model, I
 will
  make sure to calculate the actual max number of preferences and set the
  parameter accordingly. I will double-check the resultset to make sure
 the
  issue is really gone, as I have only checked the few cases where we have
  spotted a recommendation of a previously preferred item.
 
  Would you like me to file a bug, and would you like me to test it on
 0.8
  or another version? I am using 0.7.
 
  Thanks for your kind support.
  Rafal
  --
  Rafal Lukawiecki
  Strategic Consultant and Director
  Project Botticelli Ltd
 
  On 31 Jul 2013, at 06:22, Sebastian Schelter ssc.o...@googlemail.com
  wrote:
 
  Hi Rafal,
 
  can you try to set the option --maxPrefsPerUser to the maximum number
 of
  interactions per user and see if you still get the error?
 
  Best,
  Sebastian
 
  On 30.07.2013 19:29, Rafal Lukawiecki wrote:
  Thank you Sebastian. The data set is not that large, as we are running
  tests on a subset. It is about 24k users, 40k items, the preference file
  has 65k preferences as triples. This was using Similarity Cooccurrence.
 
  I can see if I could anonymise the data set to share if that would be
  helpful.
 
  Thanks for your kind help.
 
  Rafal
  --
  Rafal Lukawiecki
  Pardon my brevity, sent from a telephone.
 
  On 30 Jul 2013, at 18:18, Sebastian Schelter s...@apache.org
 wrote:
 
  Hi Rafal,
 
  can you issue a ticket for this problem at
  https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests
 that
  check whether this happens and currently they work fine. I can only
  imagine
  that the problem occurs in larger datasets where we sample the data
 in
  some
  places. Can you describe a scenario/dataset where this happens?
 
  Best,
  Sebastian
 
  2013/7/30 Rafal Lukawiecki ra...@projectbotticelli.com
 
  I'm new here, just registered. Many thanks to everyone for working
 on
  an
  amazing piece of software, thank you for building Mahout and for
 your
  support. My apologies if this is not the right place to ask the
  question—I
  have searched for the issue, and I can see this problem has been
  reported
  here:
 
 http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
 
  Unfortunately, the trail leads to the newsgroups, and I have not
  found a
  way, yet, to get an answer from them, without asking you.
 
  Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7,
  and I
  am finding that it is recommending items that the user has already
  expressed a preference for in their input file. I understand that
 this
  should not be happening, and I am not sure if there is a know fix or
  if I
  should be looking for a workaround (such as using the entire input
 as
  the
  filterFile).
 
  I will double-check that there is no error on my side, but so far it
  does
  not seem that way.
 
  Many thanks and my regards from Ireland,
  Rafal Lukawiecki
 
  --
 
  Rafal Lukawiecki
 
  Strategic Consultant and Director
 
  Project Botticelli Ltd
 



Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:


 For item similarities there is no need to do more than fetch one doc that
 contains the similarities, right? I've successfully used this method with
 the Mahout recommender but please correct me if something above is wrong.


No.

First, you need to retrieve all the other documents that are referenced to
get their display meta-data. So this isn't just a one document fetch.

Second, the similar items point inwards, not outwards.  Thus, the query you
want has the id of the current item and searches the similar_items field.
 The result of that search is all of the similar items.

The confusion here may stem from the name of the field.  A name like
linked-from-items or some such might help here.


Another way to look at this is that there should be no procedural
difference if you have 10 items or 20 in your history.  Either way, your
history is a query against the appropriate link fields.  Likewise, there
should be no difference between having 10 items or 2 items in your history.
 There shouldn't even be any difference if you have even just 1 item in
your history.

Finding items similar to a single item is exactly like having 1 item in
your history.  So that should be done by searching with that one item in
the appropriate link fields.


Re: Why is Lanczos deprecated?

2013-08-01 Thread Jake Mannix
On Thu, Aug 1, 2013 at 7:08 AM, Sebastian Schelter s...@apache.org wrote:

 IIRC the main reasons for deprecating Lanczos was that in contrast to
 SSVD, it does not use a constant number of MapReduce jobs and that our
 implementation has the constraint that all the resulting vectors have to
 fit into the memory of the driver machine.


While it's true that Lanczos does not use a constant number of MR
iterations,
the phrase our implementation is key in saying we have to hold all the
output
vectors in memory.  This wasn't even a very integral part of our impl.
 It's fairly
simple to implement the linear combinations of the Ritz vectors after
iterations
are complete as an operation keeping only 3 vectors in memory at a time, we
just never made that optimization.



 Best,
 Sebastian

 On 01.08.2013 12:15, Fernando Fernández wrote:
  Hi everyone,
 
  Sorry if I duplicate the question but I've been looking for an answer
 and I
  haven't found an explanation other than it's not being used (together
 with
  some other algorithms). If it's been discussed in depth before maybe you
  can point me to some link with the discussion.
 
  I have successfully used Lanczos in several projects and it's been a
  surprise to me finding that the main reason (according to what I've read
  that might not be the full story) is that it's not being used. At the
  begining I supposed it was because SSVD is supposed to be much faster
 with
  similar results, but after making some tests I have found that running
  times are similar or even worse than lanczos for some configurations (I
  have tried several combinations of parameters, given child processes
 enough
  memory, etc. and had no success in running SSVD at least in 3/4 of time
  Lanczos runs, thouh they might be some combinations of parameters I have
  still not tried). It seems to be quite tricky to find a good combination
 of
  parameters for SSVD and I have seen also a precision loss in some
 examples
  that makes me not confident in migrating Lanczos to SSVD from now on (How
  far can I trust results from a combination of parameters that runs in
  significant less time, or at least a good time?).
 
  Can someone convince me that SSVD is actually a better option than
 Lanczos?
  (I'm totally willing to be convinced... :) )
 
  Thank you very much in advance.
 
  Fernando.
 




-- 

  -jake


multi-class classification question

2013-08-01 Thread yikes aroni
Say that I am trying to determine which customers buy particular candy
bars. So I want to classify training data consisting of candy bar
attributes (an N dimensional vector of variables) into customer attributes
(an M dimensional vector of customer attributes).

Is there a preferred method when N and M are large? That is say 100 or more?

I have done binary classification using AdaptiveLogisticRegression and
OnlineLogisticRegression and small numbers of input features with relative
success. As I'm trying to implement this for large N and M, I feel like i'm
veering into the woods. Is there a code example anyone can point me to that
uses mahout libraries to do multi-class classification when the number of
classes is large?


Re: k-means issues

2013-08-01 Thread Marco
 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints



- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 17:24
Oggetto: Re: k-means issues



Could u post the Command line u r using for clusterdump?





From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues


ok i did put -cl and got clusteredPoints, but then I do clusterdump and always 
get Wrote 0 clusters




- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 







From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues




So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?


Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Sorry to be dense but I think there is some miscommunication. The most 
important question is: am I writing the item-item similarity matrix DRM out to 
Solr, one row = one Solr doc? For the mapreduce Mahout Item-based recommender 
this is in tmp/similarityMatrix. If not then please stop me. If I'm off base 
here, maybe a skype or im session will straighten me out. pat.fer...@gmail.com 
or p...@occamsmachete.com


To be clear below I'm not talking about history based recs, which is the 
primary use case. I am talking about a query that does not use history, that 
only finds similar items based on training data. The item-item similarity 
matrix DRM contains Key = item ID, Value = list of item IDs with similarity 
strengths.

This is equivalent to the list returned by ItemBasedRecommender's
public ListRecommendedItem mostSimilarItems(long itemID, int howMany) throws 
TasteException

Specified by:
mostSimilarItems in interface ItemBasedRecommender

Parameters:
itemID - ID of item for which to find most similar other items
howMany - desired number of most similar items to find

Returns:
items most similar to the given item, ordered from most similar to least

To get the list from Solr you would fetch the doc associated with itemID, no? 

When using the Mahout mapreduce item-based recommender we get the similarity 
matrix and do just that. We get the row associated with the Mahout itemID and 
recommend the top k items from the vector. This performs well in 
cross-validation tests.



On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:

On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:

 
 For item similarities there is no need to do more than fetch one doc that
 contains the similarities, right? I've successfully used this method with
 the Mahout recommender but please correct me if something above is wrong.


No.

First, you need to retrieve all the other documents that are referenced to
get their display meta-data. So this isn't just a one document fetch.

Second, the similar items point inwards, not outwards.  Thus, the query you
want has the id of the current item and searches the similar_items field.
The result of that search is all of the similar items.

The confusion here may stem from the name of the field.  A name like
linked-from-items or some such might help here.


Another way to look at this is that there should be no procedural
difference if you have 10 items or 20 in your history.  Either way, your
history is a query against the appropriate link fields.  Likewise, there
should be no difference between having 10 items or 2 items in your history.
There shouldn't even be any difference if you have even just 1 item in
your history.

Finding items similar to a single item is exactly like having 1 item in
your history.  So that should be done by searching with that one item in
the appropriate link fields.



Re: k-means issues

2013-08-01 Thread Suneel Marthi
You also need to specify the distance measure '-dm' to clusterdump. This is the 
Distance Measure that was used for clustering.

(Again look at the example in /examples/bin/cluster-reuters.sh - it has all the 
steps u r trying to accomplish)





 From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Sent: Thursday, August 1, 2013 2:51 PM
Subject: Re: k-means issues
 

 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints



- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 17:24
Oggetto: Re: k-means issues



Could u post the Command line u r using for clusterdump?





From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com 
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues


ok i did put -cl and got clusteredPoints, but then I do clusterdump and always 
get Wrote 0 clusters




- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc: 
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command. 







From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues




So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like china    
japan    senkaku    dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file 

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like 
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me. 


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?

Re: k-means issues

2013-08-01 Thread Jeff Eastman

The clustering arguments are usually directories, not files. Try:

 mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p 
mahout/kmeans-clusters/clusteredPoints



On 8/1/13 2:51 PM, Marco wrote:

  mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints



- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc:
Inviato: Giovedì 1 Agosto 2013 17:24
Oggetto: Re: k-means issues



Could u post the Command line u r using for clusterdump?





From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
suneel_mar...@yahoo.com
Sent: Thursday, August 1, 2013 10:29 AM
Subject: Re: k-means issues


ok i did put -cl and got clusteredPoints, but then I do clusterdump and always get 
Wrote 0 clusters




- Messaggio originale -
Da: Suneel Marthi suneel_mar...@yahoo.com
A: user@mahout.apache.org user@mahout.apache.org; Marco 
zentrop...@yahoo.co.uk
Cc:
Inviato: Giovedì 1 Agosto 2013 16:04
Oggetto: Re: k-means issues

Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
:))

You need to specify the clustering option -cl in your kmeans command.







From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org
Sent: Thursday, August 1, 2013 9:55 AM
Subject: k-means issues




So I've got 13000 text files representing topics in certain newspaper articles.
Each file is just a tab-separated list of topics (so something like chinajapansenkaku 
   dispute or italy   lampedusa   immgration).

I want to run k-means clusteriazion on them.

Here's what I do (i'm actually doing it on a subset of 100 files):

1) run seqdirectory to produce sequence file from raw text files
2) run seq2sparse to produce vectors from sequence file

(if i do seqdumper on tfidf-vectors/part-r-0 i get something like
Key: /filename1: Value: /filename1:{72:0.7071067811865476,0:0.7071067811865476}
and if i do it on dictionary.fie-0 i get
Key class: class org.apache.hadoop.io.Text Value Class: class 
org.apache.hadoop.io.IntWritable
Key: china: Value: 0
Key: japan: Value: 1

3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
mahout/kmeans-clusters -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
mahout/tmp)
first thing i notice here is it logs:
INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce Tasks: 
org.apache.mahout.math.VectorWritable Input Vectors: {}
the Input Vectors: {} part puzzles me.


Even worse, this doesn't seem to create the clusteredPoints directory at all.

What am I doing wrong?






Re: k-means issues

2013-08-01 Thread Marco
thanks a lot. will try your suggestions asap.
i was sort of following this http://goo.gl/u8VFZN


- Messaggio originale -
Da: Jeff Eastman j...@windwardsolutions.com
A: user@mahout.apache.org
Cc: 
Inviato: Giovedì 1 Agosto 2013 21:02
Oggetto: Re: k-means issues

The clustering arguments are usually directories, not files. Try:

  mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p 
mahout/kmeans-clusters/clusteredPoints



On 8/1/13 2:51 PM, Marco wrote:
   mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints



 - Messaggio originale -
 Da: Suneel Marthi suneel_mar...@yahoo.com
 A: user@mahout.apache.org user@mahout.apache.org; Marco 
 zentrop...@yahoo.co.uk
 Cc:
 Inviato: Giovedì 1 Agosto 2013 17:24
 Oggetto: Re: k-means issues



 Could u post the Command line u r using for clusterdump?




 
 From: Marco zentrop...@yahoo.co.uk
 To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Sent: Thursday, August 1, 2013 10:29 AM
 Subject: Re: k-means issues


 ok i did put -cl and got clusteredPoints, but then I do clusterdump and 
 always get Wrote 0 clusters




 - Messaggio originale -
 Da: Suneel Marthi suneel_mar...@yahoo.com
 A: user@mahout.apache.org user@mahout.apache.org; Marco 
 zentrop...@yahoo.co.uk
 Cc:
 Inviato: Giovedì 1 Agosto 2013 16:04
 Oggetto: Re: k-means issues

 Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
 :))

 You need to specify the clustering option -cl in your kmeans command.






 
 From: Marco zentrop...@yahoo.co.uk
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, August 1, 2013 9:55 AM
 Subject: k-means issues




 So I've got 13000 text files representing topics in certain newspaper 
 articles.
 Each file is just a tab-separated list of topics (so something like china    
 japan    senkaku    dispute or italy   lampedusa   immgration).

 I want to run k-means clusteriazion on them.

 Here's what I do (i'm actually doing it on a subset of 100 files):

 1) run seqdirectory to produce sequence file from raw text files
 2) run seq2sparse to produce vectors from sequence file

 (if i do seqdumper on tfidf-vectors/part-r-0 i get something like
 Key: /filename1: Value: 
 /filename1:{72:0.7071067811865476,0:0.7071067811865476}
 and if i do it on dictionary.fie-0 i get
 Key class: class org.apache.hadoop.io.Text Value Class: class 
 org.apache.hadoop.io.IntWritable
 Key: china: Value: 0
 Key: japan: Value: 1

 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
 mahout/kmeans-clusters -dm 
 org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
 mahout/tmp)
 first thing i notice here is it logs:
 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce 
 Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
 the Input Vectors: {} part puzzles me.


 Even worse, this doesn't seem to create the clusteredPoints directory at all.

 What am I doing wrong?




Re: k-means issues

2013-08-01 Thread Suneel Marthi
Thanks for pointing that out. I corrected the Wiki page.





 From: Marco zentrop...@yahoo.co.uk
To: user@mahout.apache.org user@mahout.apache.org 
Sent: Thursday, August 1, 2013 3:08 PM
Subject: Re: k-means issues
 

thanks a lot. will try your suggestions asap.
i was sort of following this http://goo.gl/u8VFZN


- Messaggio originale -
Da: Jeff Eastman j...@windwardsolutions.com
A: user@mahout.apache.org
Cc: 
Inviato: Giovedì 1 Agosto 2013 21:02
Oggetto: Re: k-means issues

The clustering arguments are usually directories, not files. Try:

  mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final -n 20 -b 100 -o cdump.txt -p 
mahout/kmeans-clusters/clusteredPoints



On 8/1/13 2:51 PM, Marco wrote:
   mahout clusterdump -d mahout/vectors/dictionary.file-0 -dt sequencefile -i 
mahout/kmeans-clusters/clusters-1-final/part-r-0 -n 20 -b 100 -o cdump.txt 
-p mahout/kmeans-clusters/clusteredPoints



 - Messaggio originale -
 Da: Suneel Marthi suneel_mar...@yahoo.com
 A: user@mahout.apache.org user@mahout.apache.org; Marco 
 zentrop...@yahoo.co.uk
 Cc:
 Inviato: Giovedì 1 Agosto 2013 17:24
 Oggetto: Re: k-means issues



 Could u post the Command line u r using for clusterdump?




 
 From: Marco zentrop...@yahoo.co.uk
 To: user@mahout.apache.org user@mahout.apache.org; Suneel Marthi 
 suneel_mar...@yahoo.com
 Sent: Thursday, August 1, 2013 10:29 AM
 Subject: Re: k-means issues


 ok i did put -cl and got clusteredPoints, but then I do clusterdump and 
 always get Wrote 0 clusters




 - Messaggio originale -
 Da: Suneel Marthi suneel_mar...@yahoo.com
 A: user@mahout.apache.org user@mahout.apache.org; Marco 
 zentrop...@yahoo.co.uk
 Cc:
 Inviato: Giovedì 1 Agosto 2013 16:04
 Oggetto: Re: k-means issues

 Check examples/bin/cluster_reuters.sh for kmeans (it exists in Mahout 0.7 too 
 :))

 You need to specify the clustering option -cl in your kmeans command.






 
 From: Marco zentrop...@yahoo.co.uk
 To: user@mahout.apache.org user@mahout.apache.org
 Sent: Thursday, August 1, 2013 9:55 AM
 Subject: k-means issues




 So I've got 13000 text files representing topics in certain newspaper 
 articles.
 Each file is just a tab-separated list of topics (so something like china    
 japan    senkaku    dispute or italy   lampedusa   immgration).

 I want to run k-means clusteriazion on them.

 Here's what I do (i'm actually doing it on a subset of 100 files):

 1) run seqdirectory to produce sequence file from raw text files
 2) run seq2sparse to produce vectors from sequence file

 (if i do seqdumper on tfidf-vectors/part-r-0 i get something like
 Key: /filename1: Value: 
 /filename1:{72:0.7071067811865476,0:0.7071067811865476}
 and if i do it on dictionary.fie-0 i get
 Key class: class org.apache.hadoop.io.Text Value Class: class 
 org.apache.hadoop.io.IntWritable
 Key: china: Value: 0
 Key: japan: Value: 1

 3) i run k-means (mahout kmeans -i mahout/vectors/tfidf-vectors/ -k 10 -o 
 mahout/kmeans-clusters -dm 
 org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 --clusters 
 mahout/tmp)
 first thing i notice here is it logs:
 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10 num Reduce 
 Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
 the Input Vectors: {} part puzzles me.


 Even worse, this doesn't seem to create the clusteredPoints directory at all.

 What am I doing wrong?



Re: multi-class classification question

2013-08-01 Thread Ted Dunning
I have talked to one user who had ~60,000 classes and they were able to use
OLR with success.

The way that they did this was to arrange the output classes into a
multi-level tree.  Then the trained classifiers at each level of the tree.
 At any level, if there was a dominating result, then only that sub-tree
would be searched.  Otherwise, all of the top few trees would be searched.

Thus, execution would proceed by evaluating the classifier at the root of
the tree.  One or more sub-trees would be selected.  Each of the
classifiers at the roots of these sub-trees would be evaluated.  This would
give a set of sub-sub-trees that eventually bottomed out with possible
answers.  These possible answers are combined to get a final set of
categories.

The detailed meanings of dominating and top few and answers are
combined are left as an exercise, but I think you can see the general
outline.  The detailed definitions are very likely application specific in
any case.



On Thu, Aug 1, 2013 at 11:25 AM, yikes aroni yikesar...@gmail.com wrote:

 Say that I am trying to determine which customers buy particular candy
 bars. So I want to classify training data consisting of candy bar
 attributes (an N dimensional vector of variables) into customer attributes
 (an M dimensional vector of customer attributes).

 Is there a preferred method when N and M are large? That is say 100 or
 more?

 I have done binary classification using AdaptiveLogisticRegression and
 OnlineLogisticRegression and small numbers of input features with relative
 success. As I'm trying to implement this for large N and M, I feel like i'm
 veering into the woods. Is there a code example anyone can point me to that
 uses mahout libraries to do multi-class classification when the number of
 classes is large?



Re: Setting up a recommender

2013-08-01 Thread Ted Dunning
On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote:

 Sorry to be dense but I think there is some miscommunication. The most
 important question is: am I writing the item-item similarity matrix DRM out
 to Solr, one row = one Solr doc?


Each row = one *field* in a Solr doc.  Different DRM's produce different
fields in the same docs.

There will also be item meta-data in the field.


 For the mapreduce Mahout Item-based recommender this is in
 tmp/similarityMatrix. If not then please stop me. If I'm off base here,
 maybe a skype or im session will straighten me out. pat.ferrel@gmail.comor
 p...@occamsmachete.com


Actually, that is a grand idea.  Let's do a hangout.

From the 
who-is-free-whenhttps://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewformsurvey,
it looks like lots of people are available tomorrow at 2PM PDT.

Would that work?

To be clear below I'm not talking about history based recs, which is the
 primary use case. I am talking about a query that does not use history,
 that only finds similar items based on training data. The item-item
 similarity matrix DRM contains Key = item ID, Value = list of item IDs with
 similarity strengths.


Yes.  I absolutely agree that you can do this.

These should, strictly speaking, be columns in the item-item matrix.  The
item-item matrix may or may not be symmetric.  If it is symmetric, then
column or row doesn't matter.


 This is equivalent to the list returned by ItemBasedRecommender's
 public ListRecommendedItem mostSimilarItems(long itemID, int howMany)
 throws TasteException


Yes.


 Specified by:
 mostSimilarItems in interface ItemBasedRecommender

 Parameters:
 itemID - ID of item for which to find most similar other items
 howMany - desired number of most similar items to find

 Returns:
 items most similar to the given item, ordered from most similar to least

 To get the list from Solr you would fetch the doc associated with
 itemID, no?


If you store the column, then yes.

If you store the row, then using a query on the field containing the
similar items is the right answer.

The key difference that I have is what happens in the next step.

When using the Mahout mapreduce item-based recommender we get the
 similarity matrix and do just that. We get the row associated with the
 Mahout itemID and recommend the top k items from the vector. This performs
 well in cross-validation tests.


Good.

I think that there is a row/column confusion here, but they are probably
nearly identical in your application.

The key point is what happens *after* you do the query that you are
suggesting.

In your case, you have to retrieve the meta-data associated with each of
related items.  I like to store this meta-data in a Solr field (or three)
so this involves at least one additional query.  You can automatically
chain this second query by using the join operation that Solr provides,
but the second query still happens.

If you do the query the way that I suggest, this second query doesn't need
to happen.  You get the meta-data directly.








 On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com wrote:

 
  For item similarities there is no need to do more than fetch one doc that
  contains the similarities, right? I've successfully used this method with
  the Mahout recommender but please correct me if something above is wrong.


 No.

 First, you need to retrieve all the other documents that are referenced to
 get their display meta-data. So this isn't just a one document fetch.

 Second, the similar items point inwards, not outwards.  Thus, the query you
 want has the id of the current item and searches the similar_items field.
 The result of that search is all of the similar items.

 The confusion here may stem from the name of the field.  A name like
 linked-from-items or some such might help here.


 Another way to look at this is that there should be no procedural
 difference if you have 10 items or 20 in your history.  Either way, your
 history is a query against the appropriate link fields.  Likewise, there
 should be no difference between having 10 items or 2 items in your history.
 There shouldn't even be any difference if you have even just 1 item in
 your history.

 Finding items similar to a single item is exactly like having 1 item in
 your history.  So that should be done by searching with that one item in
 the appropriate link fields.




Re: Setting up a recommender

2013-08-01 Thread B Lyon
I am wondering about row/column confusion as well - fleshing out the
doc/design with more specifics (which Pat is kind of doing, basically)
should make things obvious eventually, imo.

The way Pat had phrased it got me to wondering what rationale you use to
rank the results when you are querying the columns (similar column,
similar via action 2 column, etc.).

He had mentioned the auxiliary case of simply getting most similar items to
a given docid by just going to the row for that docid and using the
pre-sorted values in the similar column, and I thought Ted might have
hinted that you could just as well do a solr query of the column with that
single docid as the query; however, in the latter case I wonder if the
order and list itself could be weird, as some items may show up simply
because they are not similar to many things: lower LLR values that got
filtered in the list for docid itself won't get filtered when you're
looking at the other not similar to very many items things when
generating their list for the solr field..  I guess using an absolute
cutoff for LLR in the filtering could deal with some of this issue.  All
hypothetical at the moment (for me, anyway), as real data might trivially
dismiss some of these concerns as irrelevant.

I think the hangout is a good idea, too, btw, and hope to be able to sit in
if it happens.  Very excited about this approach.

On Thu, Aug 1, 2013 at 6:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Thu, Aug 1, 2013 at 11:58 AM, Pat Ferrel pat.fer...@gmail.com wrote:

  Sorry to be dense but I think there is some miscommunication. The most
  important question is: am I writing the item-item similarity matrix DRM
 out
  to Solr, one row = one Solr doc?


 Each row = one *field* in a Solr doc.  Different DRM's produce different
 fields in the same docs.

 There will also be item meta-data in the field.


  For the mapreduce Mahout Item-based recommender this is in
  tmp/similarityMatrix. If not then please stop me. If I'm off base here,
  maybe a skype or im session will straighten me out.
 pat.ferrel@gmail.comor
  p...@occamsmachete.com


 Actually, that is a grand idea.  Let's do a hangout.

 From the who-is-free-when
 https://docs.google.com/forms/d/1skIaqe0CBWO4qemTyHCZwS40YjXJ9FeLCqwV8cw4Gno/viewform
 survey,
 it looks like lots of people are available tomorrow at 2PM PDT.

 Would that work?

 To be clear below I'm not talking about history based recs, which is the
  primary use case. I am talking about a query that does not use history,
  that only finds similar items based on training data. The item-item
  similarity matrix DRM contains Key = item ID, Value = list of item IDs
 with
  similarity strengths.
 

 Yes.  I absolutely agree that you can do this.

 These should, strictly speaking, be columns in the item-item matrix.  The
 item-item matrix may or may not be symmetric.  If it is symmetric, then
 column or row doesn't matter.


  This is equivalent to the list returned by ItemBasedRecommender's
  public ListRecommendedItem mostSimilarItems(long itemID, int howMany)
  throws TasteException
 

 Yes.


  Specified by:
  mostSimilarItems in interface ItemBasedRecommender
 
  Parameters:
  itemID - ID of item for which to find most similar other items
  howMany - desired number of most similar items to find
 
  Returns:
  items most similar to the given item, ordered from most similar to least
 
  To get the list from Solr you would fetch the doc associated with
  itemID, no?
 

 If you store the column, then yes.

 If you store the row, then using a query on the field containing the
 similar items is the right answer.

 The key difference that I have is what happens in the next step.

 When using the Mahout mapreduce item-based recommender we get the
  similarity matrix and do just that. We get the row associated with the
  Mahout itemID and recommend the top k items from the vector. This
 performs
  well in cross-validation tests.
 

 Good.

 I think that there is a row/column confusion here, but they are probably
 nearly identical in your application.

 The key point is what happens *after* you do the query that you are
 suggesting.

 In your case, you have to retrieve the meta-data associated with each of
 related items.  I like to store this meta-data in a Solr field (or three)
 so this involves at least one additional query.  You can automatically
 chain this second query by using the join operation that Solr provides,
 but the second query still happens.

 If you do the query the way that I suggest, this second query doesn't need
 to happen.  You get the meta-data directly.





 
 
 
  On Aug 1, 2013, at 9:49 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
  On Thu, Aug 1, 2013 at 8:46 AM, Pat Ferrel p...@occamsmachete.com
 wrote:
 
  
   For item similarities there is no need to do more than fetch one doc
 that
   contains the similarities, right? I've successfully used this method
 with
   the Mahout recommender but please correct me if something 

Re: Setting up a recommender

2013-08-01 Thread Pat Ferrel
Yes, storing the similar_items in a field, cross_action_similar_items in 
another field all on the same doc ided by item ID. Agree that there may be 
other fields.

Storing the rows of [B'B] is ok because it's symmetric. However we did talk 
about the [B'A] case and I thought we agreed to store the rows there too 
because they were from Bs items. This was the discussion about having different 
items for cross actions. The excerpt below is Ted responding to my question. So 
do we want the columns of [B'A]? It's only a transpose away.


 On Tue, Jul 30, 2013 at 11:11 AM, Pat Ferrel p...@occamsmachete.com wrote:
 [B'A] =
 iphone  ipadnexus   galaxy  surface
 iphone  2   2   2   1   0
 ipad2   2   2   1   0
 nexus   1   1   1   1   0
 galaxy  1   1   1   1   0
 surface 0   0   0   0   1
 
 The rows are what we want from [B'A] since the row items are from B, right?
 
 Yes.
 
 It is easier to understand if you have different kinds of items as well as 
 different actions.  For instance, suppose that you have user x query terms 
 (A) and user x device (B).  B'A is then device x term so that there is a row 
 per device and the row contains terms.  This is good when searching for 
 devices using terms.


Talking about getting the actual doc field values, which will include the 
similar_items field and other metadata. The actual ids in the similar_items 
field work well for anonymous/no-history recs but maybe there is a second query 
or fetch that I'm missing? I assumed that a fetch of the doc and it's fields  
by item ID was as fast a way to do this as possible. If there is some way to 
get the same result by doing a query that is faster, I'm all for it?

Can do tomorrow at 2.

Re: Why is Lanczos deprecated?

2013-08-01 Thread Dmitriy Lyubimov
There's a part of Nathan Halko's dissertation referenced on algorithm page
running comparison.  In particular, he was not able to compute more than 40
eigenvectors with Lanczos on wikipedia dataset. You may refer to that
study.

On the accuracy part, it was not observed that it was a problem, assuming
high level of random noise is not the case, at least not in LSA-like
application used there.

That said, i am all for diversity of tools, I would actually be +0 on
deprecating Lanczos, it is not like we are lacking support for it. SSVD
could use improvements too.


On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández 
fernando.fernandez.gonza...@gmail.com wrote:

 Hi everyone,

 Sorry if I duplicate the question but I've been looking for an answer and I
 haven't found an explanation other than it's not being used (together with
 some other algorithms). If it's been discussed in depth before maybe you
 can point me to some link with the discussion.

 I have successfully used Lanczos in several projects and it's been a
 surprise to me finding that the main reason (according to what I've read
 that might not be the full story) is that it's not being used. At the
 begining I supposed it was because SSVD is supposed to be much faster with
 similar results, but after making some tests I have found that running
 times are similar or even worse than lanczos for some configurations (I
 have tried several combinations of parameters, given child processes enough
 memory, etc. and had no success in running SSVD at least in 3/4 of time
 Lanczos runs, thouh they might be some combinations of parameters I have
 still not tried). It seems to be quite tricky to find a good combination of
 parameters for SSVD and I have seen also a precision loss in some examples
 that makes me not confident in migrating Lanczos to SSVD from now on (How
 far can I trust results from a combination of parameters that runs in
 significant less time, or at least a good time?).

 Can someone convince me that SSVD is actually a better option than Lanczos?
 (I'm totally willing to be convinced... :) )

 Thank you very much in advance.

 Fernando.



Re: Question for RecommenderJob

2013-08-01 Thread hahn jiang
The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no
errors occur. I check the output but it has null.
I think the problem is my data set.
Is it too small about my item set that only 200 elements?



On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote:

 Which version of Mahout are you using? Did you check the output, are you
 sure that no errors occur?

 Best,
 Sebastian

 On 01.08.2013 09:59, hahn jiang wrote:
  Hi all,
 
 
  I have a question when I use RecommenderJob for item-based
 recommendation.
 
  My input data format is userid,itemid,1, so I set booleanData option is
  true.
 
  The length of users is 9,000,000 but the length of item is 200.
 
 
  When I run the RecommenderJob, the result is null. I try many times use
  different arguments. But the result is also null.
 
  This is one of my commands. Would you help me for  tell me why it is null
  please?
 
 
  bash recommender-job.sh --input input/user-item-value --output
  output/recommender --numRecommendations 10 --similarityClassname
  SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300
  --maxPrefsPerUser 300 --minPrefsPerUser 1
 --maxPrefsPerUserInItemSimilarity
  1000 --booleanData true
 
 
  Thanks
 




Re: Why is Lanczos deprecated?

2013-08-01 Thread Sebastian Schelter
I would also be fine with keeping if there is demand. I just proposed to
deprecate it and nobody voted against that at that point in time.

--sebastian


On 02.08.2013 03:12, Dmitriy Lyubimov wrote:
 There's a part of Nathan Halko's dissertation referenced on algorithm page
 running comparison.  In particular, he was not able to compute more than 40
 eigenvectors with Lanczos on wikipedia dataset. You may refer to that
 study.
 
 On the accuracy part, it was not observed that it was a problem, assuming
 high level of random noise is not the case, at least not in LSA-like
 application used there.
 
 That said, i am all for diversity of tools, I would actually be +0 on
 deprecating Lanczos, it is not like we are lacking support for it. SSVD
 could use improvements too.
 
 
 On Thu, Aug 1, 2013 at 3:15 AM, Fernando Fernández 
 fernando.fernandez.gonza...@gmail.com wrote:
 
 Hi everyone,

 Sorry if I duplicate the question but I've been looking for an answer and I
 haven't found an explanation other than it's not being used (together with
 some other algorithms). If it's been discussed in depth before maybe you
 can point me to some link with the discussion.

 I have successfully used Lanczos in several projects and it's been a
 surprise to me finding that the main reason (according to what I've read
 that might not be the full story) is that it's not being used. At the
 begining I supposed it was because SSVD is supposed to be much faster with
 similar results, but after making some tests I have found that running
 times are similar or even worse than lanczos for some configurations (I
 have tried several combinations of parameters, given child processes enough
 memory, etc. and had no success in running SSVD at least in 3/4 of time
 Lanczos runs, thouh they might be some combinations of parameters I have
 still not tried). It seems to be quite tricky to find a good combination of
 parameters for SSVD and I have seen also a precision loss in some examples
 that makes me not confident in migrating Lanczos to SSVD from now on (How
 far can I trust results from a combination of parameters that runs in
 significant less time, or at least a good time?).

 Can someone convince me that SSVD is actually a better option than Lanczos?
 (I'm totally willing to be convinced... :) )

 Thank you very much in advance.

 Fernando.

 



Re: Question for RecommenderJob

2013-08-01 Thread Sebastian Schelter
The size should not matter, you should get output, what do you exactly
mean by it has null?

--sebastian

On 02.08.2013 03:44, hahn jiang wrote:
 The version of Mahout which I used is 0.7-cdh4.3.1 and I am sure that no
 errors occur. I check the output but it has null.
 I think the problem is my data set.
 Is it too small about my item set that only 200 elements?
 
 
 
 On Thu, Aug 1, 2013 at 9:57 PM, Sebastian Schelter s...@apache.org wrote:
 
 Which version of Mahout are you using? Did you check the output, are you
 sure that no errors occur?

 Best,
 Sebastian

 On 01.08.2013 09:59, hahn jiang wrote:
 Hi all,


 I have a question when I use RecommenderJob for item-based
 recommendation.

 My input data format is userid,itemid,1, so I set booleanData option is
 true.

 The length of users is 9,000,000 but the length of item is 200.


 When I run the RecommenderJob, the result is null. I try many times use
 different arguments. But the result is also null.

 This is one of my commands. Would you help me for  tell me why it is null
 please?


 bash recommender-job.sh --input input/user-item-value --output
 output/recommender --numRecommendations 10 --similarityClassname
 SIMILARITY_PEARSON_CORRELATION --maxSimilaritiesPerItem 300
 --maxPrefsPerUser 300 --minPrefsPerUser 1
 --maxPrefsPerUserInItemSimilarity
 1000 --booleanData true


 Thanks