Re: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

2014-03-05 Thread Margusja

Hi

Here are my actions and the problematic result again:

[hduser@vm38 ~]$ git clone https://github.com/apache/mahout.git
remote: Reusing existing pack: 76099, done.
remote: Counting objects: 39, done.
remote: Compressing objects: 100% (32/32), done.
remote: Total 76138 (delta 2), reused 0 (delta 0)
Receiving objects: 100% (76138/76138), 49.04 MiB | 275 KiB/s, done.
Resolving deltas: 100% (34449/34449), done.
[hduser@vm38 ~]$ cd mahout
[hduser@vm38 ~]$ mvn clean package -DskipTests=true -Dhadoop2.version=2.2.0
...
...
...
[INFO] Reactor Summary:
[INFO]
[INFO] Mahout Build Tools  SUCCESS [15.529s]
[INFO] Apache Mahout . SUCCESS [1.657s]
[INFO] Mahout Math ... SUCCESS 
[1:00.891s]
[INFO] Mahout Core ... SUCCESS 
[2:44.617s]

[INFO] Mahout Integration  SUCCESS [38.195s]
[INFO] Mahout Examples ... SUCCESS [45.458s]
[INFO] Mahout Release Package  SUCCESS [0.012s]
[INFO] Mahout Math/Scala wrappers  SUCCESS [53.519s]
[INFO] 


[INFO] BUILD SUCCESS
[INFO] 


[INFO] Total time: 6:27.763s
[INFO] Finished at: Wed Mar 05 10:22:51 EET 2014
[INFO] Final Memory: 57M/442M
[INFO] 


[hduser@vm38 mahout]$
[hduser@vm38 mahout]$ cd ../
[hduser@vm38 ~]$ /usr/lib/hadoop/bin/hadoop jar 
mahout/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 
org.apache.mahout.classifier.df.mapreduce.BuildForest -d 
input/data666.noheader.data -ds input/data666.noheader.data.info -sl 5 
-p -t 100 -o nsl-forest

14/03/05 10:26:39 INFO mapreduce.BuildForest: Partial Mapred implementation
14/03/05 10:26:39 INFO mapreduce.BuildForest: Building the forest...
14/03/05 10:26:39 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8032
14/03/05 10:26:51 INFO input.FileInputFormat: Total input paths to 
process : 1

14/03/05 10:26:51 INFO mapreduce.JobSubmitter: number of splits:1
14/03/05 10:26:51 INFO Configuration.deprecation: user.name is 
deprecated. Instead, use mapreduce.job.user.name
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.jar is 
deprecated. Instead, use mapreduce.job.jar
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapred.cache.files.filesizes is deprecated. Instead, use 
mapreduce.job.cache.files.filesizes
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.cache.files is 
deprecated. Instead, use mapreduce.job.cache.files
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.reduce.tasks is 
deprecated. Instead, use mapreduce.job.reduces
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapred.output.value.class is deprecated. Instead, use 
mapreduce.job.output.value.class
14/03/05 10:26:51 INFO Configuration.deprecation: mapreduce.map.class is 
deprecated. Instead, use mapreduce.job.map.class
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.job.name is 
deprecated. Instead, use mapreduce.job.name
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapreduce.inputformat.class is deprecated. Instead, use 
mapreduce.job.inputformat.class
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.input.dir is 
deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.output.dir is 
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapreduce.outputformat.class is deprecated. Instead, use 
mapreduce.job.outputformat.class
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.map.tasks is 
deprecated. Instead, use mapreduce.job.maps
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapred.cache.files.timestamps is deprecated. Instead, use 
mapreduce.job.cache.files.timestamps
14/03/05 10:26:51 INFO Configuration.deprecation: 
mapred.output.key.class is deprecated. Instead, use 
mapreduce.job.output.key.class
14/03/05 10:26:51 INFO Configuration.deprecation: mapred.working.dir is 
deprecated. Instead, use mapreduce.job.working.dir
14/03/05 10:26:52 INFO mapreduce.JobSubmitter: Submitting tokens for 
job: job_1393936067845_0018
14/03/05 10:26:52 INFO impl.YarnClientImpl: Submitted application 
application_1393936067845_0018 to ResourceManager at /0.0.0.0:8032
14/03/05 10:26:52 INFO mapreduce.Job: The url to track the job: 
http://vm38.dbweb.ee:8088/proxy/application_1393936067845_0018/

14/03/05 10:26:52 INFO mapreduce.Job: Running job: job_1393936067845_0018
14/03/05 10:27:05 INFO mapreduce.Job: Job job_1393936067845_0018 running 
in uber mode : false

14/03/05 10:27:05 INFO mapreduce.Job:  map 0% reduce 0%
14/03/05 10:27:22 INFO mapreduce.Job:  map 100% reduce 0%
14/03/05 10:27:48 INFO 

Re: Recommend items not rated by any user

2014-03-05 Thread Juan José Ramos
In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user has
preferred too.

So, a different CandidateItemStrategy needs to be passed. For this problem,
it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote:

 First thing is thatI know this requirement would not make sense in a CF
 Recommender. In my case, I am trying to use Mahout to create something
 closer to a Content-Based Recommender.

 In particular, I am pre-computing a similarity matrix between all the
 documents (items) of my catalogue and using that matrix as the
 ItemSimilarity for my Item-Based Recommender.

 So, when a user rates a document, how could I make the recommender outputs
 similar documents to that ones the user has already rated even if no other
 user in the system has rated them yet? Is that even possible in the first
 place?

 Thanks a lot.



Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to 
implement this. Maybe we should simply extend its interface to add a 
parameter that says whether to keep or remove the current users items?


We could even do this in the abstract base class then.

--sebastian

On 03/05/2014 10:42 AM, Juan José Ramos wrote:

In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user has
preferred too.

So, a different CandidateItemStrategy needs to be passed. For this problem,
it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote:


First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender outputs
similar documents to that ones the user has already rated even if no other
user in the system has rated them yet? Is that even possible in the first
place?

Thanks a lot.







Rework our website

2014-03-05 Thread Sebastian Schelter

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of 
documentation on our website is one of the main pain points of Mahout 
atm. To be honest, I'm also not very happy with the design, especially 
fonts and spacing make it super hard to read long articles. This also 
prevents me from wanting to add articles and documentation.


I think we should have a beautiful website, where it is fun to add new 
stuff.


My design skills are pretty limited, but fortunately my brother is an 
art director! I asked him to make our website a bit more beautiful 
without changing to much of the structure, so that a redesign wouldn't 
take too long.


I really like the results and would volunteer to dig out my CSS skills 
and do the redesign, if people agree.


Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian


Re: Recommend items not rated by any user

2014-03-05 Thread Juan José Ramos
Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base class
though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
it returns the item not rated by the user and rated by somebody else.

Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do what I
wanted (recommend items not previously rated by any user), I honestly can't
tell the difference between the two strategies. In my tests the output was
always the same. If the eventual output of the recommender will not include
items already rated by the user as pointed out here (
http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?

Thanks.

On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote:

 Hi Juan,

 that is a good catch. CandidateItemsStrategy is the right place to
implement this. Maybe we should simply extend its interface to add a
parameter that says whether to keep or remove the current users items?

 We could even do this in the abstract base class then.

 --sebastian


 On 03/05/2014 10:42 AM, Juan José Ramos wrote:

 In case somebody runs into the same situation, the key seems to be in the
 CandidateItemStrategy being passed to the constructor
 of GenericItemBasedRecommender. Looking into the code, if no
 CandidateItemStrategy is specified in the
 constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
 as the documentation says, the doGetCandidateItems method: returns all
 items that have not been rated by the user and that were preferred by
 another user that has preferred at least one item that the current user
has
 preferred too.

 So, a different CandidateItemStrategy needs to be passed. For this
problem,
 it seems to me that AllSimilarItemsCandidateItemsStrategy,
 AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
 know where to find some documentation about the different
 CandidateItemStrategy? Based on the name I would say that:
 1) AllSimilarItemsCandidateItemsStrategy returns all similar items
 regardless of whether they have been already rated by someone or not.
 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
 have not been rated by anyone yet.

 Does anybody know if it works like that?
 Thanks.


 On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com
wrote:

 First thing is thatI know this requirement would not make sense in a CF
 Recommender. In my case, I am trying to use Mahout to create something
 closer to a Content-Based Recommender.

 In particular, I am pre-computing a similarity matrix between all the
 documents (items) of my catalogue and using that matrix as the
 ItemSimilarity for my Item-Based Recommender.

 So, when a user rates a document, how could I make the recommender
outputs
 similar documents to that ones the user has already rated even if no
other
 user in the system has rated them yet? Is that even possible in the
first
 place?

 Thanks a lot.





Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

On 03/05/2014 01:23 PM, Juan José Ramos wrote:

Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base class
though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
it returns the item not rated by the user and rated by somebody else.


Good point. So we seem to need special implementations.



Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do what I
wanted (recommend items not previously rated by any user), I honestly can't
tell the difference between the two strategies. In my tests the output was
always the same. If the eventual output of the recommender will not include
items already rated by the user as pointed out here (
http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?


AllSimilarItems returns all items that are similar to any item that the 
user already knows. AllUnknownItems simply returns all items that the 
user has not interacted with yet.


These are two different things, although they might overlap in some 
scenarios.


Best,
Sebastian




Thanks.

On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote:


Hi Juan,

that is a good catch. CandidateItemsStrategy is the right place to

implement this. Maybe we should simply extend its interface to add a
parameter that says whether to keep or remove the current users items?


We could even do this in the abstract base class then.

--sebastian


On 03/05/2014 10:42 AM, Juan José Ramos wrote:


In case somebody runs into the same situation, the key seems to be in the
CandidateItemStrategy being passed to the constructor
of GenericItemBasedRecommender. Looking into the code, if no
CandidateItemStrategy is specified in the
constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and
as the documentation says, the doGetCandidateItems method: returns all
items that have not been rated by the user and that were preferred by
another user that has preferred at least one item that the current user

has

preferred too.

So, a different CandidateItemStrategy needs to be passed. For this

problem,

it seems to me that AllSimilarItemsCandidateItemsStrategy,
AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
know where to find some documentation about the different
CandidateItemStrategy? Based on the name I would say that:
1) AllSimilarItemsCandidateItemsStrategy returns all similar items
regardless of whether they have been already rated by someone or not.
2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
have not been rated by anyone yet.

Does anybody know if it works like that?
Thanks.


On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com

wrote:



First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender

outputs

similar documents to that ones the user has already rated even if no

other

user in the system has rated them yet? Is that even possible in the

first

place?

Thanks a lot.











Re: Rework our website

2014-03-05 Thread Gokhan Capan
I liked both of them

Great work Lucas!

Gokhan


On Wed, Mar 5, 2014 at 2:11 PM, Sebastian Schelter s...@apache.org wrote:

 Hi everyone,

 In our latest discussion, I argued that the lack (and errors) of
 documentation on our website is one of the main pain points of Mahout atm.
 To be honest, I'm also not very happy with the design, especially fonts and
 spacing make it super hard to read long articles. This also prevents me
 from wanting to add articles and documentation.

 I think we should have a beautiful website, where it is fun to add new
 stuff.

 My design skills are pretty limited, but fortunately my brother is an art
 director! I asked him to make our website a bit more beautiful without
 changing to much of the structure, so that a redesign wouldn't take too
 long.

 I really like the results and would volunteer to dig out my CSS skills and
 do the redesign, if people agree.

 Here are his drafts, I like the second one best:

 https://people.apache.org/~ssc/mahout/mahout.jpg
 https://people.apache.org/~ssc/mahout/mahout2.jpg

 Let me know what you think!

 Best,
 Sebastian



Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:
 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote:
 On 03/05/2014 01:23 PM, Juan José Ramos wrote:

 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base class
 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
 it returns the item not rated by the user and rated by somebody else.


 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both do what
 I
 wanted (recommend items not previously rated by any user), I honestly
 can't
 tell the difference between the two strategies. In my tests the output was
 always the same. If the eventual output of the recommender will not
 include
 items already rated by the user as pointed out here (

 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
 AllSimilarItemsCandidateItemsStrategy should be equivalent to
 AllUnkownItemsCandidateItemsStrategy, shouldn't it?


 AllSimilarItems returns all items that are similar to any item that the user
 already knows. AllUnknownItems simply returns all items that the user has
 not interacted with yet.

 These are two different things, although they might overlap in some
 scenarios.

 Best,
 Sebastian




 Thanks.

 On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
 wrote:


 Hi Juan,

 that is a good catch. CandidateItemsStrategy is the right place to

 implement this. Maybe we should simply extend its interface to add a
 parameter that says whether to keep or remove the current users items?


 We could even do this in the abstract base class then.

 --sebastian


 On 03/05/2014 10:42 AM, Juan José Ramos wrote:


 In case somebody runs into the same situation, the key seems to be in
 the
 CandidateItemStrategy being passed to the constructor
 of GenericItemBasedRecommender. Looking into the code, if no
 CandidateItemStrategy is specified in the
 constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used
 and
 as the documentation says, the doGetCandidateItems method: returns all
 items that have not been rated by the user and that were preferred by
 another user that has preferred at least one item that the current user

 has

 preferred too.

 So, a different CandidateItemStrategy needs to be passed. For this

 problem,

 it seems to me that AllSimilarItemsCandidateItemsStrategy,
 AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
 know where to find some documentation about the different
 CandidateItemStrategy? Based on the name I would say that:
 1) AllSimilarItemsCandidateItemsStrategy returns all similar items
 regardless of whether they have been already rated by someone or not.
 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
 have not been rated by anyone yet.

 Does anybody know if it works like that?
 Thanks.


 On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com

 wrote:


 First thing is thatI know this requirement would not make sense in a CF
 Recommender. In my case, I am trying to use Mahout to create something
 closer to a Content-Based Recommender.

 In particular, I am pre-computing a similarity matrix between all the
 documents (items) of my catalogue and using that matrix as the
 ItemSimilarity for my Item-Based Recommender.

 So, when a user rates a document, how could I make the recommender

 outputs

 similar documents to that ones the user has already rated even if no

 other

 user in the system has rated them yet? Is that even possible in the

 first

 place?

 Thanks a lot.







Re: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Kevin Moulart
Hi and thanks for your help!

I had been told that the version of mahout used by Cloudera (CDH 4.6) was
in fact 0.8 with a patch for mr2 support.
(
http://mail-archives.apache.org/mod_mbox/mahout-user/201402.mbox/%3CCAEccTywqSAKA_HeX4vTZ-5XPmKtj5b8zMGQUfn5qRsiq=7o=u...@mail.gmail.com%3E)

But I tried to install 0.9 on my own, by compiling it with mvn after I
changed the pom.xml :

- Added cloudera repository :

repository
  idcloudera-repo/id
  nameCloudera Repository/name
   urlhttps://repository.cloudera.com/artifactory/cloudera-repos/url
/repository

- Changed the version of hadoop to use :
hadoop.1.version2.0.0-mr1-cdh4.6.0/hadoop.1.version
- I tried adding this one too :
hadoop2.version2.0.0-cdh4.6.0/hadoop2.version

But then I get a lot of errors when Maven begins to compile the core
package :
https://gist.github.com/kmoulart/9368193

Could you tell me what I did wrong ?


2014-03-04 19:02 GMT+01:00 Suneel Marthi suneel_mar...@yahoo.com:

 The -us option was fixed for Mahout 0.8, seems like u r using Mahout 0.7
 which had this issue (from ur stacktrace, its apparent u r using Mahout
 0.7).  Please upgrade to the latest mahout version.





 On Tuesday, March 4, 2014 8:54 AM, Kevin Moulart kevinmoul...@gmail.com
 wrote:

 Hi,

 I'm trying to apply a PCA to reduce the dimension of a matrix of 1603
 columns and 100.000 to 30.000.000 lines using ssvd with the pca option, and
 I always get a StackOverflowError :

 Here is my command line :
 mahout ssvd -i /user/myUser/Echant100k -o /user/myUser/Echant/SVD100 -k 100
 -pca true -U false -V false -t 3 -ow

 I also tried to put -us true as mentionned in

 https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=18modificationDate=1381347063000api=v2but
 the option is not available anymore.

 The output of the previous command is :
 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
 Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
 and HADOOP_CONF_DIR=/etc/hadoop/conf
 MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.5.0-job.jar
 14/03/04 14:45:16 INFO common.AbstractJob: Command line arguments:
 {--abtBlockHeight=[20], --blockHeight=[1], --broadcast=[true],
 --computeU=[false], --computeV=[false], --endPhase=[2147483647],
 --input=[/user/myUser/Echant100k], --minSplitSize=[-1],
 --outerProdBlockHeight=[3], --output=[/user/myUser/Echant/SVD100],
 --oversampling=[15], --overwrite=null, --pca=[true], --powerIter=[0],
 --rank=[100], --reduceTasks=[3], --startPhase=[0], --tempDir=[temp],
 --uHalfSigma=[false], --vHalfSigma=[false]}
 Exception in thread main java.lang.StackOverflowError
 at

 org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
 at

 org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
 at

 org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55)
 ...

 I search online and didn't find a solution to my problem.

 Can you help me ?

 Thanks in advance,

 --
 Kévin Moulart




-- 
Kévin Moulart
GSM France : +33 7 81 06 10 10
GSM Belgique : +32 473 85 23 85
Téléphone fixe : +32 2 771 88 45


Re: Recommend items not rated by any user

2014-03-05 Thread Juan José Ramos
Hi Tefik,

Thanks for the response. I think what you says contradicts what Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns
all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
  Hi Juan,
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value that is with at
  least one of the items preferred by the user.
 
  Tevfik
 
  On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
 wrote:
  On 03/05/2014 01:23 PM, Juan José Ramos wrote:
 
  Thanks for the reply, Sebastian.
 
  I am not sure if that should be implemented in the Abstract base class
  though because for
  instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
 definition,
  it returns the item not rated by the user and rated by somebody else.
 
 
  Good point. So we seem to need special implementations.
 
 
 
  Back to my last post, I have been playing around with
  AllSimilarItemsCandidateItemsStrategy
  and AllUnknownItemsCandidateItemsStrategy, and although they both do
 what
  I
  wanted (recommend items not previously rated by any user), I honestly
  can't
  tell the difference between the two strategies. In my tests the output
 was
  always the same. If the eventual output of the recommender will not
  include
  items already rated by the user as pointed out here (
 
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
 ),
  AllSimilarItemsCandidateItemsStrategy should be equivalent to
  AllUnkownItemsCandidateItemsStrategy, shouldn't it?
 
 
  AllSimilarItems returns all items that are similar to any item that the
 user
  already knows. AllUnknownItems simply returns all items that the user
 has
  not interacted with yet.
 
  These are two different things, although they might overlap in some
  scenarios.
 
  Best,
  Sebastian
 
 
 
 
  Thanks.
 
  On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
  wrote:
 
 
  Hi Juan,
 
  that is a good catch. CandidateItemsStrategy is the right place to
 
  implement this. Maybe we should simply extend its interface to add a
  parameter that says whether to keep or remove the current users items?
 
 
  We could even do this in the abstract base class then.
 
  --sebastian
 
 
  On 03/05/2014 10:42 AM, Juan José Ramos wrote:
 
 
  In case somebody runs into the same situation, the key seems to be in
  the
  CandidateItemStrategy being passed to the constructor
  of GenericItemBasedRecommender. Looking into the code, if no
  CandidateItemStrategy is specified in the
  constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used
  and
  as the documentation says, the doGetCandidateItems method: returns
 all
  items that have not been rated by the user and that were preferred by
  another user that has preferred at least one item that the current
 user
 
  has
 
  preferred too.
 
  So, a different CandidateItemStrategy needs to be passed. For this
 
  problem,
 
  it seems to me that AllSimilarItemsCandidateItemsStrategy,
  AllUnknownItemsCandidateItemsStrategy are good candidates. Does
 anybody
  know where to find some documentation about the different
  CandidateItemStrategy? Based on the name I would say that:
  1) AllSimilarItemsCandidateItemsStrategy returns all similar items
  regardless of whether they have been already rated by someone or not.
  2) AllUnknownItemsCandidateItemsStrategy returns all similar items
 that
  have not been rated by anyone yet.
 
  Does anybody know if it works like that?
  Thanks.
 
 
  On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com
 
  wrote:
 
 
  First thing is thatI know this requirement would not make sense in
 a CF
  Recommender. In my case, I am trying to use Mahout to create
 something
  closer to a Content-Based Recommender.
 
  In particular, I am pre-computing a similarity matrix between all
 the
  documents (items) of my catalogue and using that matrix as the
  ItemSimilarity for my Item-Based Recommender.
 
  So, when a user rates a document, how could I make the recommender
 
  outputs
 
  similar documents to that ones the user has already rated even if no
 
  other
 
  user in the system has rated them yet? Is that even possible in the
 
  first
 
  place?
 
  Thanks a lot.
 
 
 
 
 



Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:
 Hi Tefik,

 Thanks for the response. I think what you says contradicts what Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns
 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
  Hi Juan,
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value that is with at
  least one of the items preferred by the user.
 
  Tevfik
 
  On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
 wrote:
  On 03/05/2014 01:23 PM, Juan José Ramos wrote:
 
  Thanks for the reply, Sebastian.
 
  I am not sure if that should be implemented in the Abstract base class
  though because for
  instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
 definition,
  it returns the item not rated by the user and rated by somebody else.
 
 
  Good point. So we seem to need special implementations.
 
 
 
  Back to my last post, I have been playing around with
  AllSimilarItemsCandidateItemsStrategy
  and AllUnknownItemsCandidateItemsStrategy, and although they both do
 what
  I
  wanted (recommend items not previously rated by any user), I honestly
  can't
  tell the difference between the two strategies. In my tests the output
 was
  always the same. If the eventual output of the recommender will not
  include
  items already rated by the user as pointed out here (
 
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
 ),
  AllSimilarItemsCandidateItemsStrategy should be equivalent to
  AllUnkownItemsCandidateItemsStrategy, shouldn't it?
 
 
  AllSimilarItems returns all items that are similar to any item that the
 user
  already knows. AllUnknownItems simply returns all items that the user
 has
  not interacted with yet.
 
  These are two different things, although they might overlap in some
  scenarios.
 
  Best,
  Sebastian
 
 
 
 
  Thanks.
 
  On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
  wrote:
 
 
  Hi Juan,
 
  that is a good catch. CandidateItemsStrategy is the right place to
 
  implement this. Maybe we should simply extend its interface to add a
  parameter that says whether to keep or remove the current users items?
 
 
  We could even do this in the abstract base class then.
 
  --sebastian
 
 
  On 03/05/2014 10:42 AM, Juan José Ramos wrote:
 
 
  In case somebody runs into the same situation, the key seems to be in
  the
  CandidateItemStrategy being passed to the constructor
  of GenericItemBasedRecommender. Looking into the code, if no
  CandidateItemStrategy is specified in the
  constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used
  and
  as the documentation says, the doGetCandidateItems method: returns
 all
  items that have not been rated by the user and that were preferred by
  another user that has preferred at least one item that the current
 user
 
  has
 
  preferred too.
 
  So, a different CandidateItemStrategy needs to be passed. For this
 
  problem,
 
  it seems to me that AllSimilarItemsCandidateItemsStrategy,
  AllUnknownItemsCandidateItemsStrategy are good candidates. Does
 anybody
  know where to find some documentation about the different
  CandidateItemStrategy? Based on the name I would say that:
  1) AllSimilarItemsCandidateItemsStrategy returns all similar items
  regardless of whether they have been already rated by someone or not.
  2) AllUnknownItemsCandidateItemsStrategy returns all similar items
 that
  have not been rated by anyone yet.
 
  Does anybody know if it works like that?
  Thanks.
 
 
  On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com
 
  wrote:
 
 
  

Re: Rework our website

2014-03-05 Thread Ted Dunning
Both are nice.

I think you are right that the second is calmer.


On Wed, Mar 5, 2014 at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

 Hi everyone,

 In our latest discussion, I argued that the lack (and errors) of
 documentation on our website is one of the main pain points of Mahout atm.
 To be honest, I'm also not very happy with the design, especially fonts and
 spacing make it super hard to read long articles. This also prevents me
 from wanting to add articles and documentation.

 I think we should have a beautiful website, where it is fun to add new
 stuff.

 My design skills are pretty limited, but fortunately my brother is an art
 director! I asked him to make our website a bit more beautiful without
 changing to much of the structure, so that a redesign wouldn't take too
 long.

 I really like the results and would volunteer to dig out my CSS skills and
 do the redesign, if people agree.

 Here are his drafts, I like the second one best:

 https://people.apache.org/~ssc/mahout/mahout.jpg
 https://people.apache.org/~ssc/mahout/mahout2.jpg

 Let me know what you think!

 Best,
 Sebastian



Re: Recommend items not rated by any user

2014-03-05 Thread Pat Ferrel
I am ignoring the rest of the thread because I suspect it may have gotten off 
track.

Your data is new articles, right? You would like to recommend from known 
articles to any user based on an article they rate or even view. You have no 
collaborative filtering data because the lifetime of a news article is short 
and so there is not enough usage data to create a CF type recommender. Is this 
a correct problem statement? If so I don’t believe you should be using a CF 
recommender from Mahout’s collection.

However you can use the Mahout text analysis pipeline to find all articles that 
are similar to each other. In this case when a user views any article in the 
training data you can show the most similar items precalculated with 
RowSimilarityJob and the rest of the text prep jobs. The pipeline is outlined 
here: 
https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

But this will only work for news articles already in the training data. Another 
approach it to not use Mahout at all. Simply index all docs as they come in 
with Solr. Then when a user rates or even views an article, even if it has not 
been indexed yet, you can use the viewed article as the query on the indexed 
articles and Solr will return articles ranked by similarity. This is a content 
based recommender based solely on Solr.

Does this describe your situation?


On Mar 4, 2014, at 1:16 AM, Juan José Ramos jjar...@gmail.com wrote:

First thing is thatI know this requirement would not make sense in a CF
Recommender. In my case, I am trying to use Mahout to create something
closer to a Content-Based Recommender.

In particular, I am pre-computing a similarity matrix between all the
documents (items) of my catalogue and using that matrix as the
ItemSimilarity for my Item-Based Recommender.

So, when a user rates a document, how could I make the recommender outputs
similar documents to that ones the user has already rated even if no other
user in the system has rated them yet? Is that even possible in the first
place?

Thanks a lot.



Re: Rework our website

2014-03-05 Thread Pat Ferrel
What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs have 
a way to allow input that is moderated. Might this make getting documentation 
help easier? Allow anyone to contribute but committers can filter out the 
bad—sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of documentation 
on our website is one of the main pain points of Mahout atm. To be honest, I'm 
also not very happy with the design, especially fonts and spacing make it super 
hard to read long articles. This also prevents me from wanting to add articles 
and documentation.

I think we should have a beautiful website, where it is fun to add new stuff.

My design skills are pretty limited, but fortunately my brother is an art 
director! I asked him to make our website a bit more beautiful without changing 
to much of the structure, so that a redesign wouldn't take too long.

I really like the results and would volunteer to dig out my CSS skills and do 
the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian



Re: Rework our website

2014-03-05 Thread Andrew Musselman
On Wed, Mar 5, 2014 at 7:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

 What no centered text??

 ;-)

 Love either.

 BTW users are no longer able to contribute content to the wiki. Most CMSs
 have a way to allow input that is moderated. Might this make getting
 documentation help easier? Allow anyone to contribute but committers can
 filter out the bad--sort of like submitting patches.


Yes, that's a good idea.  They both look good, thanks Sebastian and Lucas.


Re: Rework our website

2014-03-05 Thread Scott C. Cote
I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:

What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian





Re: Recommend items not rated by any user

2014-03-05 Thread Juan José Ramos
@Pat. You described my situation very well. The only additional thing is
that I am also interested in creating some sort of a profile from the user
with all the information s/he has provided by interacting with the articles
and not only recommending similar items (news) based on a specific input.
Thus, that is why I thought using the output of RowSimilarityJob as the
ItemSimilarity of a ItemBasedRecommender would behave as I want since I use
Mahout dataModel to create that profile.


On Wed, Mar 5, 2014 at 3:40 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I am ignoring the rest of the thread because I suspect it may have gotten
 off track.

 Your data is new articles, right? You would like to recommend from known
 articles to any user based on an article they rate or even view. You have
 no collaborative filtering data because the lifetime of a news article is
 short and so there is not enough usage data to create a CF type
 recommender. Is this a correct problem statement? If so I don't believe you
 should be using a CF recommender from Mahout's collection.

 However you can use the Mahout text analysis pipeline to find all articles
 that are similar to each other. In this case when a user views any article
 in the training data you can show the most similar items precalculated with
 RowSimilarityJob and the rest of the text prep jobs. The pipeline is
 outlined here:
 https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis+using+the+Mahout+command+line

 But this will only work for news articles already in the training data.
 Another approach it to not use Mahout at all. Simply index all docs as they
 come in with Solr. Then when a user rates or even views an article, even if
 it has not been indexed yet, you can use the viewed article as the query on
 the indexed articles and Solr will return articles ranked by similarity.
 This is a content based recommender based solely on Solr.

 Does this describe your situation?


 On Mar 4, 2014, at 1:16 AM, Juan José Ramos jjar...@gmail.com wrote:

 First thing is thatI know this requirement would not make sense in a CF
 Recommender. In my case, I am trying to use Mahout to create something
 closer to a Content-Based Recommender.

 In particular, I am pre-computing a similarity matrix between all the
 documents (items) of my catalogue and using that matrix as the
 ItemSimilarity for my Item-Based Recommender.

 So, when a user rates a document, how could I make the recommender outputs
 similar documents to that ones the user has already rated even if no other
 user in the system has rated them yet? Is that even possible in the first
 place?

 Thanks a lot.




Re: Recommend items not rated by any user

2014-03-05 Thread Juan José Ramos
@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
the similarity between item 5 and two of the items user 1 preferred are not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:
  Hi Tefik,
 
  Thanks for the response. I think what you says contradicts what Sebastian
  pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy
 returns
  all items that have not been rated by the user, what would
  AllUnknownItemsCandidateItemsStrategy return?
 
 
  On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
 
  Sorry there was a typo in the previous paragraph.
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value with at
  least one of the items preferred by the user.
 
  On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.com
  wrote:
   Hi Juan,
  
   If I remember correctly, AllSimilarItemsCandidateItemsStrategy
  
   returns all items that have not been rated by the user and the
   similarity metric returns a non-NaN similarity value that is with at
   least one of the items preferred by the user.
  
   Tevfik
  
   On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
  wrote:
   On 03/05/2014 01:23 PM, Juan José Ramos wrote:
  
   Thanks for the reply, Sebastian.
  
   I am not sure if that should be implemented in the Abstract base
 class
   though because for
   instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
  definition,
   it returns the item not rated by the user and rated by somebody
 else.
  
  
   Good point. So we seem to need special implementations.
  
  
  
   Back to my last post, I have been playing around with
   AllSimilarItemsCandidateItemsStrategy
   and AllUnknownItemsCandidateItemsStrategy, and although they both do
  what
   I
   wanted (recommend items not previously rated by any user), I
 honestly
   can't
   tell the difference between the two strategies. In my tests the
 output
  was
   always the same. If the eventual output of the recommender will not
   include
   items already rated by the user as pointed out here (
  
  
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
  ),
   AllSimilarItemsCandidateItemsStrategy should be equivalent to
   AllUnkownItemsCandidateItemsStrategy, shouldn't it?
  
  
   AllSimilarItems returns all items that are similar to any item that
 the
  user
   already knows. AllUnknownItems simply returns all items that the user
  has
   not interacted with yet.
  
   These are two different things, although they might overlap in some
   scenarios.
  
   Best,
   Sebastian
  
  
  
  
   Thanks.
  
   On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
 
   wrote:
  
  
   Hi Juan,
  
   that is a good catch. CandidateItemsStrategy is the right place to
  
   implement this. Maybe we should simply extend its interface to add a
   parameter that says whether to keep or remove the current users
 items?
  
  
   We could even do this in the abstract base class then.
  
   --sebastian
  
  
   On 03/05/2014 10:42 AM, Juan José Ramos wrote:
  
  
   In case somebody runs into the same situation, the key seems to
 be in
   the
   CandidateItemStrategy being passed to the constructor
   of GenericItemBasedRecommender. Looking into the code, if no
   CandidateItemStrategy is specified in the
   constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is
 used
   and
   as the documentation says, the 

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Suneel Marthi
Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and M-1098 
which address the issues u r seeing.



The second part of the issue u r seeing with Mahout 0.9 distro seems to be 
related to how u set it up on CDH4. I apologize for not being helpful here as I 
am not a CDH4 user or expert.

Sean?




On Wednesday, March 5, 2014 10:23 AM, Kevin Moulart kevinmoul...@gmail.com 
wrote:
 
Previous mail sent only to Suneel : (my bad sorry)

According to my stacktrace it seems that I am running mahout 0.7 indeed.
 That's the version provided by Cloudera when I install mahout using yum.
 But according to Sean Owen, it really is a 0.8 inside...
 Anyway I tried with the compiled version and it didn't work :
 Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
 and HADOOP_CONF_DIR=
 Exception in thread main java.lang.NoSuchMethodError:
 org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:122)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

MAHOUT-JOB:
 /home/cacf/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar


And now I changed the conf directory of mahout 0.9 to be linked to the one
used by the existing working mahout and the trace changes :

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB:
/home/myCompany/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar
14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
java.lang.ClassNotFoundException:
org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
java.lang.ClassNotFoundException:
org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
org.apache.mahout.clustering.minhash.MinHashDriver
java.lang.ClassNotFoundException:
org.apache.mahout.clustering.minhash.MinHashDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Andrew Musselman
I'm not sure about this either but I think these are all the changes to
Mahout in CDH 4.6.0:
http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.6.0.CHANGES.txt

MAHOUT-1291

MAHOUT-1033

MAHOUT-1142



On Wed, Mar 5, 2014 at 8:30 AM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and M-1098
 which address the issues u r seeing.



 The second part of the issue u r seeing with Mahout 0.9 distro seems to be
 related to how u set it up on CDH4. I apologize for not being helpful here
 as I am not a CDH4 user or expert.

 Sean?




 On Wednesday, March 5, 2014 10:23 AM, Kevin Moulart 
 kevinmoul...@gmail.com wrote:

 Previous mail sent only to Suneel : (my bad sorry)

 According to my stacktrace it seems that I am running mahout 0.7 indeed.
  That's the version provided by Cloudera when I install mahout using yum.
  But according to Sean Owen, it really is a 0.8 inside...
  Anyway I tried with the compiled version and it didn't work :
  Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
  and HADOOP_CONF_DIR=
  Exception in thread main java.lang.NoSuchMethodError:
  org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
   at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:122)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

 MAHOUT-JOB:
  /home/cacf/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar
 

 And now I changed the conf directory of mahout 0.9 to be linked to the one
 used by the existing working mahout and the trace changes :

 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
 Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
 and HADOOP_CONF_DIR=/etc/hadoop/conf
 MAHOUT-JOB:

 /home/myCompany/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar
 14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
 org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
 java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
 org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
 java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
 org.apache.mahout.clustering.minhash.MinHashDriver
 java.lang.ClassNotFoundException:
 org.apache.mahout.clustering.minhash.MinHashDriver
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at 

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
If the similarity between item 5 and two of the items user 1 preferred are not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:
 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
 the similarity between item 5 and two of the items user 1 preferred are not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:
  Hi Tefik,
 
  Thanks for the response. I think what you says contradicts what Sebastian
  pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy
 returns
  all items that have not been rated by the user, what would
  AllUnknownItemsCandidateItemsStrategy return?
 
 
  On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
 
  Sorry there was a typo in the previous paragraph.
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value with at
  least one of the items preferred by the user.
 
  On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.com
  wrote:
   Hi Juan,
  
   If I remember correctly, AllSimilarItemsCandidateItemsStrategy
  
   returns all items that have not been rated by the user and the
   similarity metric returns a non-NaN similarity value that is with at
   least one of the items preferred by the user.
  
   Tevfik
  
   On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
  wrote:
   On 03/05/2014 01:23 PM, Juan José Ramos wrote:
  
   Thanks for the reply, Sebastian.
  
   I am not sure if that should be implemented in the Abstract base
 class
   though because for
   instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
  definition,
   it returns the item not rated by the user and rated by somebody
 else.
  
  
   Good point. So we seem to need special implementations.
  
  
  
   Back to my last post, I have been playing around with
   AllSimilarItemsCandidateItemsStrategy
   and AllUnknownItemsCandidateItemsStrategy, and although they both do
  what
   I
   wanted (recommend items not previously rated by any user), I
 honestly
   can't
   tell the difference between the two strategies. In my tests the
 output
  was
   always the same. If the eventual output of the recommender will not
   include
   items already rated by the user as pointed out here (
  
  
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
  ),
   AllSimilarItemsCandidateItemsStrategy should be equivalent to
   AllUnkownItemsCandidateItemsStrategy, shouldn't it?
  
  
   AllSimilarItems returns all items that are similar to any item that
 the
  user
   already knows. AllUnknownItems simply returns all items that the user
  has
   not interacted with yet.
  
   These are two different things, although they might overlap in some
   scenarios.
  
   Best,
   Sebastian
  
  
  
  
   Thanks.
  
   On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
 
   wrote:
  
  
   Hi Juan,
  
   that is a good catch. CandidateItemsStrategy is the right place to
  
   implement this. Maybe we should simply extend its 

Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter

 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

It can take a long time to estimate preferences for all items a user 
doesn't know. Especially if you have a lot of items. Traditional 
item-based recommenders will not recommend any item that is not similar 
to at least one of the items the user interacted with, so 
AllSimilarItemsStrategy already selects the maximum set of items that 
could be potentially recommended to the user.


--sebastian



On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

If the similarity between item 5 and two of the items user 1 preferred are not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.




On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
the similarity between item 5 and two of the items user 1 preferred are not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote:


Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:

Hi Tefik,

Thanks for the response. I think what you says contradicts what Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

returns

all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
wrote:


Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

tevfik.ayte...@gmail.com

wrote:

Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

wrote:

On 03/05/2014 01:23 PM, Juan José Ramos wrote:


Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base

class

though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

definition,

it returns the item not rated by the user and rated by somebody

else.



Good point. So we seem to need special implementations.




Back to my last post, I have been playing around with
AllSimilarItemsCandidateItemsStrategy
and AllUnknownItemsCandidateItemsStrategy, and although they both do

what

I
wanted (recommend items not previously rated by any user), I

honestly

can't
tell the difference between the two strategies. In my tests the

output

was

always the same. If the eventual output of the recommender will not
include
items already rated by the user as pointed out here (





http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E

),

AllSimilarItemsCandidateItemsStrategy should be equivalent to
AllUnkownItemsCandidateItemsStrategy, shouldn't it?



AllSimilarItems returns all items that are similar to any item that

the

user

already knows. AllUnknownItems simply returns all items that the user

has

not interacted with yet.

These are two different things, although they might overlap in some
scenarios.

Best,

Re: Rework our website

2014-03-05 Thread Suneel Marthi
+1 for Option# 2.





On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter s...@apache.org wrote:
 
Hi everyone,

In our latest discussion, I argued that the lack (and errors) of 
documentation on our website is one of the main pain points of Mahout 
atm. To be honest, I'm also not very happy with the design, especially 
fonts and spacing make it super hard to read long articles. This also 
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new 
stuff.

My design skills are pretty limited, but fortunately my brother is an 
art director! I asked him to make our website a bit more beautiful 
without changing to much of the structure, so that a redesign wouldn't 
take too long.

I really like the results and would volunteer to dig out my CSS skills 
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Hi Sebastian,
But in order not to select items that is not similar to at least one
of the items the user interacted with you have to compute the
similarity with all user items (which is the main task for estimating
the preference of an item in item-based method). So, it seems to me
that AllSimilarItemsStrategy does not bring much advantage over
AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:
 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

 It can take a long time to estimate preferences for all items a user doesn't
 know. Especially if you have a lot of items. Traditional item-based
 recommenders will not recommend any item that is not similar to at least one
 of the items the user interacted with, so AllSimilarItemsStrategy already
 selects the maximum set of items that could be potentially recommended to
 the user.

 --sebastian




 On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

 If the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN then it will return 1, that is what I'm saying. If the
 similarities were all NaN then
 it will not return it.

 But surely, you might wonder if all similarities between an item and
 user's items are NaN, then
 AllUnknownItemsCandidateItemsStrategy probably will not return it.


 On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1,
 and
 the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
 So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
 wrote:

 Hi Tefik,

 Thanks for the response. I think what you says contradicts what
 Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

 returns

 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.com
 wrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

 tevfik.ayte...@gmail.com

 wrote:

 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

 wrote:

 On 03/05/2014 01:23 PM, Juan José Ramos wrote:


 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base

 class

 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

 definition,

 it returns the item not rated by the user and rated by somebody

 else.



 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both
 do

 what

 I
 wanted (recommend items not previously rated by any user), I

 honestly

 can't
 tell the difference between the two strategies. In my tests the

 output

 was

 always the same. If the eventual output of the recommender will not
 include
 items already rated by the user as 

Re: Recommend items not rated by any user

2014-03-05 Thread Pat Ferrel
I agree. IMHO using the Mahout recommenders is wrong for this. The recommenders 
are the CF/cooccurrence type that expect usage or rating data on fairly long 
lived items from a somewhat static catalog. Trying to make them work for 
content based recommendations is needlessly difficult especially since other 
tools are custom made for this. Like RowSimilarityJob and Solr. Each find 
content-based similarity with no rating or CF data needed.

Profile creation is another subject and still does not use a Mahout 
recommender. You can keep the text of articles the user has rated, read, 
whatever. These will form the basis of your user profile. For each of them (if 
their are not too many) you could use them as the query to Solr returning 
similar docs for each in the profile. You could also lump them all together and 
use this as the query. You can also experiment with various ways to process 
profile data. If there are enough articles in the profile you might categorize 
them with clustering. then use the centroid of the clusters as the Solr query. 

The same thing can be done in batch mode with Mahout’s RowSimilarityJob. Take 
the user's cluster centroids as synthetic items, add them to the item DRM of 
news articles you get out of the text pipeline and run RSJ on that. For each 
synthetic item (cluster centroid) you’ll get a list of articles that are most 
similar. 

Not sure clustering the user profile is the best idea though since it would 
require quite a few articles for the user in question. If you have some method 
of labeling your articles (categories, tags, or the like) you can build 
classifiers for each. Then see what categories your user reads from the most by 
classifying the articles in their profile based on the labeled training data. 
As new articles come in and are classified you can funnel them to the right 
users. 

You can do this with clustering too but generally clustering is not as good as 
classifying since it is unsupervised learning. However clustering all news will 
probably give better results than clustering the user’s profile articles. So 
you would cluster your news corpus, which will include the articles your user 
has read, then recommend other articles that the user’s profile articles was 
clustered with (from the same cluster). This is only slightly different than 
using the profile articles as Solr queries but may produce better results. 
However the Solr queries will work even if the query (profile news article) is 
not in the index and will return results in realtime, requiring no batch RSJ.

BTW I did just this as an experiment. I used my own browsing history as the 
profile, clustered the pages I read, then took the top terms from the centroids 
and did Google searches with them. Since the sources are so varied in Google I 
had to create a custom search engine to include only specific sites. It worked 
pretty well for discovering related pages.

On Mar 5, 2014, at 8:46 AM, Sebastian Schelter s...@apache.org wrote:

 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

It can take a long time to estimate preferences for all items a user doesn't 
know. Especially if you have a lot of items. Traditional item-based 
recommenders will not recommend any item that is not similar to at least one of 
the items the user interacted with, so AllSimilarItemsStrategy already selects 
the maximum set of items that could be potentially recommended to the user.

--sebastian



On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:
 If the similarity between item 5 and two of the items user 1 preferred are not
 NaN then it will return 1, that is what I'm saying. If the
 similarities were all NaN then
 it will not return it.
 
 But surely, you might wonder if all similarities between an item and
 user's items are NaN, then
 AllUnknownItemsCandidateItemsStrategy probably will not return it.
 

 On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:
 @Tevfik, running this recommender:
 
 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));
 
 
 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0
 
 
 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0
 
 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
 the similarity between item 5 and two of the items user 1 preferred are not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.
 
 
 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:
 
 Juan,
 You got me wrong,
 
 AllSimilarItemsCandidateItemsStrategy
 
 returns all items that have not been 

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
It can even make things worse in SVD-based algorithms for which
preference estimation is very fast.

On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:
 Hi Sebastian,
 But in order not to select items that is not similar to at least one
 of the items the user interacted with you have to compute the
 similarity with all user items (which is the main task for estimating
 the preference of an item in item-based method). So, it seems to me
 that AllSimilarItemsStrategy does not bring much advantage over
 AllUnknownItemsCandidateItemsStrategy.

 On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:
 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

 It can take a long time to estimate preferences for all items a user doesn't
 know. Especially if you have a lot of items. Traditional item-based
 recommenders will not recommend any item that is not similar to at least one
 of the items the user interacted with, so AllSimilarItemsStrategy already
 selects the maximum set of items that could be potentially recommended to
 the user.

 --sebastian




 On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

 If the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN then it will return 1, that is what I'm saying. If the
 similarities were all NaN then
 it will not return it.

 But surely, you might wonder if all similarities between an item and
 user's items are NaN, then
 AllUnknownItemsCandidateItemsStrategy probably will not return it.


 On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1,
 and
 the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
 So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
 wrote:

 Hi Tefik,

 Thanks for the response. I think what you says contradicts what
 Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

 returns

 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.com
 wrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

 tevfik.ayte...@gmail.com

 wrote:

 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

 wrote:

 On 03/05/2014 01:23 PM, Juan José Ramos wrote:


 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base

 class

 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

 definition,

 it returns the item not rated by the user and rated by somebody

 else.



 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both
 do

 what

 I
 wanted (recommend items not previously rated by any user), I

 honestly

 can't
 tell the 

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
CDH 4.5 and 4.6 are both 0.7 + patches. Neither contains 0.8, since it
has (tiny) breaking changes vs 0.7 and this is a minor version update.
CDH5 contains 0.8 + patches. I did not say CDH4 has 0.8 -- re-read the
message of mine that was quoted.

http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.5.0.CHANGES.txt
http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.6.0.CHANGES.txt

Those two patches are not in CDH 4.x, no.

The non-upstream changes are basically all internal packaging stuff,
and that can include modifying dependency versions to harmonize with
the rest of the platform. That's the sense in which it works with
Hadoop 2.

I don't think the change you cite is sufficient to work with Hadoop 2.
You also, for example, must build against the Hadoop 2 profile in
Mahout in Maven. For that you do not need the CDH repo even, just
point to the Hadoop 2.x release if you like.

I know there has been a patch in even just the past few weeks that
makes it work even better with 2.x. So I suppose I would build from
HEAD if possible to take advantage.

On Wed, Mar 5, 2014 at 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and M-1098 
 which address the issues u r seeing.



 The second part of the issue u r seeing with Mahout 0.9 distro seems to be 
 related to how u set it up on CDH4. I apologize for not being helpful here as 
 I am not a CDH4 user or expert.

 Sean?



Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Dmitriy Lyubimov
Yeah. it would seem CDH releases of Mahout produce some sort of cut-down
version of such. I suggest to switch to official release tarbal (or write
to Cloudera support about it).


On Wed, Mar 5, 2014 at 8:38 AM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 I'm not sure about this either but I think these are all the changes to
 Mahout in CDH 4.6.0:
 http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.6.0.CHANGES.txt

 MAHOUT-1291

 MAHOUT-1033

 MAHOUT-1142



 On Wed, Mar 5, 2014 at 8:30 AM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and
 M-1098
  which address the issues u r seeing.
 
 
 
  The second part of the issue u r seeing with Mahout 0.9 distro seems to
 be
  related to how u set it up on CDH4. I apologize for not being helpful
 here
  as I am not a CDH4 user or expert.
 
  Sean?
 
 
 
 
  On Wednesday, March 5, 2014 10:23 AM, Kevin Moulart 
  kevinmoul...@gmail.com wrote:
 
  Previous mail sent only to Suneel : (my bad sorry)
 
  According to my stacktrace it seems that I am running mahout 0.7 indeed.
   That's the version provided by Cloudera when I install mahout using
 yum.
   But according to Sean Owen, it really is a 0.8 inside...
   Anyway I tried with the compiled version and it didn't work :
   Running on hadoop, using
 /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
   and HADOOP_CONF_DIR=
   Exception in thread main java.lang.NoSuchMethodError:
   org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:122)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
 
  MAHOUT-JOB:
  
 /home/cacf/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar
  
 
  And now I changed the conf directory of mahout 0.9 to be linked to the
 one
  used by the existing working mahout and the trace changes :
 
  MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
  Running on hadoop, using /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop
  and HADOOP_CONF_DIR=/etc/hadoop/conf
  MAHOUT-JOB:
 
 
 /home/myCompany/Downloads/mahout-distribution-0.9/mahout-examples-0.9-job.jar
  14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
  org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
  java.lang.ClassNotFoundException:
  org.apache.mahout.clustering.meanshift.MeanShiftCanopyDriver
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:190)
  at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
  14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
  org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
  java.lang.ClassNotFoundException:
  org.apache.mahout.clustering.spectral.eigencuts.EigencutsDriver
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:190)
  at org.apache.mahout.driver.MahoutDriver.addClass(MahoutDriver.java:237)
  at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:118)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
  14/03/05 16:16:23 WARN driver.MahoutDriver: Unable to add class:
  

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
I don't follow what here makes you say they are cut down releases?
They are release plus patches not release minus patches.

The question is not about how to use 0.7, but how to use 1.0-SNAPSHOT.
Why would switching to the official 0.7 release help?

I think the answer is you build Mahout for Hadoop 2. right? This has
always been the case. Mahout has always been Hadoop 1, with 2 support
on the side.

On Wed, Mar 5, 2014 at 5:04 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 Yeah. it would seem CDH releases of Mahout produce some sort of cut-down
 version of such. I suggest to switch to official release tarbal (or write
 to Cloudera support about it).



Re: Recommend items not rated by any user

2014-03-05 Thread Sebastian Schelter
For SVD based algorithms, you would should use the AllUnknownItems 
Strategy then, thats correct.


In the majority of industry usecases that I have seen, people use 
pre-computed item similarities (Mahout has lots of machinery for doing 
this, btw), so AllSimilarItems totally makes sense there.


--sebastian

On 03/05/2014 06:01 PM, Tevfik Aytekin wrote:

It can even make things worse in SVD-based algorithms for which
preference estimation is very fast.

On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:

Hi Sebastian,
But in order not to select items that is not similar to at least one
of the items the user interacted with you have to compute the
similarity with all user items (which is the main task for estimating
the preference of an item in item-based method). So, it seems to me
that AllSimilarItemsStrategy does not bring much advantage over
AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.


It can take a long time to estimate preferences for all items a user doesn't
know. Especially if you have a lot of items. Traditional item-based
recommenders will not recommend any item that is not similar to at least one
of the items the user interacted with, so AllSimilarItemsStrategy already
selects the maximum set of items that could be potentially recommended to
the user.

--sebastian




On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:


If the similarity between item 5 and two of the items user 1 preferred are
not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.




On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:


@Tevfik, running this recommender:

GenericItemBasedRecommender itemRecommender = new
GenericItemBasedRecommender(dataModel, itemSimilarity, new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


With this dataModel:
1,1,1.0
1,2,2.0
1,3,1.0
1,4,2.0
2,1,1.0
2,2,4.0


And these similarities
1,2,0.1
1,3,0.2
1,4,0.3
2,3,0.5
3,4,0.5
5,1,0.2
5,2,1.0

Returns item 5 for User 1. So item 5 has not been preferred by user 1,
and
the similarity between item 5 and two of the items user 1 preferred are
not
NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
So,
I'm truly sorry to insist on this, but I still really do not get the
difference.


On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
tevfik.ayte...@gmail.comwrote:


Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
wrote:


Hi Tefik,

Thanks for the response. I think what you says contradicts what
Sebastian
pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy


returns


all items that have not been rated by the user, what would
AllUnknownItemsCandidateItemsStrategy return?


On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
tevfik.ayte...@gmail.com
wrote:


Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 


tevfik.ayte...@gmail.com


wrote:


Hi Juan,

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value that is with at
least one of the items preferred by the user.

Tevfik

On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org


wrote:


On 03/05/2014 01:23 PM, Juan José Ramos wrote:



Thanks for the reply, Sebastian.

I am not sure if that should be implemented in the Abstract base


class


though because for
instance PreferredItemsNeighborhoodCandidateItemsStrategy, by


definition,


it returns the item not rated by the user and rated by somebody


else.




Good point. So we seem to need special 

Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Suneel Marthi
I apologize Sean I wasn't aware of the complete history in this thread.  I 
didn't know about Hadoop 2.x being involved here, if so yes need to build 
Mahout against HEAD with Hadoop 2 profile to get working.







On Wednesday, March 5, 2014 12:04 PM, Sean Owen sro...@gmail.com wrote:
 
CDH 4.5 and 4.6 are both 0.7 + patches. Neither contains 0.8, since it
has (tiny) breaking changes vs 0.7 and this is a minor version update.
CDH5 contains 0.8 + patches. I did not say CDH4 has 0.8 -- re-read the
message of mine that was quoted.

http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.5.0.CHANGES.txt
http://archive.cloudera.com/cdh4/cdh/4/mahout-0.7-cdh4.6.0.CHANGES.txt

Those two patches are not in CDH 4.x, no.

The non-upstream changes are basically all internal packaging stuff,
and that can include modifying dependency versions to harmonize with
the rest of the platform. That's the sense in which it works with
Hadoop 2.

I don't think the change you cite is sufficient to work with Hadoop 2.
You also, for example, must build against the Hadoop 2 profile in
Mahout in Maven. For that you do not need the CDH repo even, just
point to the Hadoop 2.x release if you like.

I know there has been a patch in even just the past few weeks that
makes it work even better with 2.x. So I suppose I would build from
HEAD if possible to take advantage.


On Wed, Mar 5, 2014 at 4:30 PM, Suneel Marthi suneel_mar...@yahoo.com wrote:
 Not sure if the CDH4 patches on top of 0.7 has fixes for M-1067 and M-1098 
 which address the issues u r seeing.



 The second part of the issue u r seeing with Mahout 0.9 distro seems to be 
 related to how u set it up on CDH4. I apologize for not being helpful here as 
 I am not a CDH4 user or expert.

 Sean?


Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Dmitriy Lyubimov
On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen sro...@gmail.com wrote:

 I don't follow what here makes you say they are cut down releases?


meaning it seems to be pretty much 2 releases behind the official. But i
definitely don't follow CDH developments in this department, you seem in a
better position to explain the existing patchlevel there so I defer to you
to explain why this patchlevel is not there.

-d


Re: Rework our website

2014-03-05 Thread Frank Scholten
+1 for design 2


On Wed, Mar 5, 2014 at 6:00 PM, Suneel Marthi suneel_mar...@yahoo.comwrote:

 +1 for Option# 2.





 On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter s...@apache.org
 wrote:

 Hi everyone,

 In our latest discussion, I argued that the lack (and errors) of
 documentation on our website is one of the main pain points of Mahout
 atm. To be honest, I'm also not very happy with the design, especially
 fonts and spacing make it super hard to read long articles. This also
 prevents me from wanting to add articles and documentation.

 I think we should have a beautiful website, where it is fun to add new
 stuff.

 My design skills are pretty limited, but fortunately my brother is an
 art director! I asked him to make our website a bit more beautiful
 without changing to much of the structure, so that a redesign wouldn't
 take too long.

 I really like the results and would volunteer to dig out my CSS skills
 and do the redesign, if people agree.

 Here are his drafts, I like the second one best:

 https://people.apache.org/~ssc/mahout/mahout.jpg
 https://people.apache.org/~ssc/mahout/mahout2.jpg

 Let me know what you think!

 Best,
 Sebastian



Re: Rework our website

2014-03-05 Thread Matthew Parent
I also prefer design 2


On Wed, Mar 5, 2014 at 11:08 AM, Frank Scholten fr...@frankscholten.nlwrote:

 +1 for design 2


 On Wed, Mar 5, 2014 at 6:00 PM, Suneel Marthi suneel_mar...@yahoo.com
 wrote:

  +1 for Option# 2.
 
 
 
 
 
  On Wednesday, March 5, 2014 7:11 AM, Sebastian Schelter s...@apache.org
  wrote:
 
  Hi everyone,
 
  In our latest discussion, I argued that the lack (and errors) of
  documentation on our website is one of the main pain points of Mahout
  atm. To be honest, I'm also not very happy with the design, especially
  fonts and spacing make it super hard to read long articles. This also
  prevents me from wanting to add articles and documentation.
 
  I think we should have a beautiful website, where it is fun to add new
  stuff.
 
  My design skills are pretty limited, but fortunately my brother is an
  art director! I asked him to make our website a bit more beautiful
  without changing to much of the structure, so that a redesign wouldn't
  take too long.
 
  I really like the results and would volunteer to dig out my CSS skills
  and do the redesign, if people agree.
 
  Here are his drafts, I like the second one best:
 
 
 https://urldefense.proofpoint.com/v1/url?u=https://people.apache.org/~ssc/mahout/mahout.jpgk=2a4Akkj3oY%2FOkjwft1MTMw%3D%3D%0Ar=9NWLniU1hq%2FrWXkfnwTRj8Lc%2BfBFgJW%2FYCy4Rls0Pvk%3D%0Am=ePJZLLP4bQhVfRe67t%2BD%2FRnawYF%2F%2Bx4IGnTOLXvydz8%3D%0As=08801d50fb6e66bc069052b66f8d6f5911d8453c35f6292a9ac8fef44e12a866
 
 https://urldefense.proofpoint.com/v1/url?u=https://people.apache.org/~ssc/mahout/mahout2.jpgk=2a4Akkj3oY%2FOkjwft1MTMw%3D%3D%0Ar=9NWLniU1hq%2FrWXkfnwTRj8Lc%2BfBFgJW%2FYCy4Rls0Pvk%3D%0Am=ePJZLLP4bQhVfRe67t%2BD%2FRnawYF%2F%2Bx4IGnTOLXvydz8%3D%0As=cb15ba2620a20c27d93745de448a604e46b0169592cb88febdc850680ba00628
 
  Let me know what you think!
 
  Best,
  Sebastian
 



Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
I don't understand this -- CDH always bundles the latest release.

You know that CDH4 was released in July 2012, right? So it included
0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
month after it began beta 2.

CDH follows semantic versioning and won't introduce changes that are
not backwards-compatible in a minor version update. 0.x releases of
Mahout act like major version changes -- not backwards compatible. So
4.x will always be 0.7 and 5.x will always be 0.8.

On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
 On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen sro...@gmail.com wrote:

 I don't follow what here makes you say they are cut down releases?


 meaning it seems to be pretty much 2 releases behind the official. But i
 definitely don't follow CDH developments in this department, you seem in a
 better position to explain the existing patchlevel there so I defer to you
 to explain why this patchlevel is not there.

 -d


Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Sean Owen
You can always install whatever version of anything on your cluster
that you want. It may or may not work, but often happens to, at least
for whatever you need it to do.

It's just the same as it is without a packaged distribution -- dump
new tarballs and cross your fingers. Nothing is weird or different
about the setup or layout. That is the here be dragons solution,
already

You go with support from a packaged distribution when you want a here
be no dragons solution. Everything else is by definition already
something you can and should do yourself outside of a packaged
distribution. And really -- you freely can, and it's not hard, if you
know what you are doing.

On Wed, Mar 5, 2014 at 9:15 PM, Andrew Musselman
andrew.mussel...@gmail.com wrote:
 Feels like just yesterday :)

 Consider this a feature request to have more flexible component versioning,
 even with a caveat/here be dragons warning.  I know that complicates
 things but people do use your releases a long time.  I personally wished I
 could upgrade Pig on CDH 4 for new features but there was no simple way on
 a managed cluster.


 On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen sro...@gmail.com wrote:

 I don't understand this -- CDH always bundles the latest release.

 You know that CDH4 was released in July 2012, right? So it included
 0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
 month after it began beta 2.

 CDH follows semantic versioning and won't introduce changes that are
 not backwards-compatible in a minor version update. 0.x releases of
 Mahout act like major version changes -- not backwards compatible. So
 4.x will always be 0.7 and 5.x will always be 0.8.

 On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov dlie...@gmail.com
 wrote:
  On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen sro...@gmail.com wrote:
 
  I don't follow what here makes you say they are cut down releases?
 
 
  meaning it seems to be pretty much 2 releases behind the official. But i
  definitely don't follow CDH developments in this department, you seem in
 a
  better position to explain the existing patchlevel there so I defer to
 you
  to explain why this patchlevel is not there.
 
  -d



Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Andrew Musselman
Yeah, for sure; balancing clients' risk aversion to technical features is
why we often recommend vendor solutions.

Having a little button to choose a newer version of a component in the
Manager UI (even with a confirmation dialog that said Are you sure? Are
you crazy?) would be more palatable to some teams than installing
tarballs, is what I'm getting at.


On Wed, Mar 5, 2014 at 1:30 PM, Sean Owen sro...@gmail.com wrote:

 You can always install whatever version of anything on your cluster
 that you want. It may or may not work, but often happens to, at least
 for whatever you need it to do.

 It's just the same as it is without a packaged distribution -- dump
 new tarballs and cross your fingers. Nothing is weird or different
 about the setup or layout. That is the here be dragons solution,
 already

 You go with support from a packaged distribution when you want a here
 be no dragons solution. Everything else is by definition already
 something you can and should do yourself outside of a packaged
 distribution. And really -- you freely can, and it's not hard, if you
 know what you are doing.

 On Wed, Mar 5, 2014 at 9:15 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
  Feels like just yesterday :)
 
  Consider this a feature request to have more flexible component
 versioning,
  even with a caveat/here be dragons warning.  I know that complicates
  things but people do use your releases a long time.  I personally wished
 I
  could upgrade Pig on CDH 4 for new features but there was no simple way
 on
  a managed cluster.
 
 
  On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen sro...@gmail.com wrote:
 
  I don't understand this -- CDH always bundles the latest release.
 
  You know that CDH4 was released in July 2012, right? So it included
  0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
  month after it began beta 2.
 
  CDH follows semantic versioning and won't introduce changes that are
  not backwards-compatible in a minor version update. 0.x releases of
  Mahout act like major version changes -- not backwards compatible. So
  4.x will always be 0.7 and 5.x will always be 0.8.
 
  On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
   On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen sro...@gmail.com wrote:
  
   I don't follow what here makes you say they are cut down releases?
  
  
   meaning it seems to be pretty much 2 releases behind the official.
 But i
   definitely don't follow CDH developments in this department, you seem
 in
  a
   better position to explain the existing patchlevel there so I defer to
  you
   to explain why this patchlevel is not there.
  
   -d
 



Re: Fwd: PCA with ssvd leads to StackOverFlowError

2014-03-05 Thread Andrew Musselman
I mean balance the risk aversion against the value of new features duh.


On Wed, Mar 5, 2014 at 1:39 PM, Andrew Musselman andrew.mussel...@gmail.com
 wrote:

 Yeah, for sure; balancing clients' risk aversion to technical features is
 why we often recommend vendor solutions.

 Having a little button to choose a newer version of a component in the
 Manager UI (even with a confirmation dialog that said Are you sure? Are
 you crazy?) would be more palatable to some teams than installing
 tarballs, is what I'm getting at.


 On Wed, Mar 5, 2014 at 1:30 PM, Sean Owen sro...@gmail.com wrote:

 You can always install whatever version of anything on your cluster
 that you want. It may or may not work, but often happens to, at least
 for whatever you need it to do.

 It's just the same as it is without a packaged distribution -- dump
 new tarballs and cross your fingers. Nothing is weird or different
 about the setup or layout. That is the here be dragons solution,
 already

 You go with support from a packaged distribution when you want a here
 be no dragons solution. Everything else is by definition already
 something you can and should do yourself outside of a packaged
 distribution. And really -- you freely can, and it's not hard, if you
 know what you are doing.

 On Wed, Mar 5, 2014 at 9:15 PM, Andrew Musselman
 andrew.mussel...@gmail.com wrote:
  Feels like just yesterday :)
 
  Consider this a feature request to have more flexible component
 versioning,
  even with a caveat/here be dragons warning.  I know that complicates
  things but people do use your releases a long time.  I personally
 wished I
  could upgrade Pig on CDH 4 for new features but there was no simple way
 on
  a managed cluster.
 
 
  On Wed, Mar 5, 2014 at 12:12 PM, Sean Owen sro...@gmail.com wrote:
 
  I don't understand this -- CDH always bundles the latest release.
 
  You know that CDH4 was released in July 2012, right? So it included
  0.7 + patches. CDH5 includes 0.8 because 0.9 was released about a
  month after it began beta 2.
 
  CDH follows semantic versioning and won't introduce changes that are
  not backwards-compatible in a minor version update. 0.x releases of
  Mahout act like major version changes -- not backwards compatible. So
  4.x will always be 0.7 and 5.x will always be 0.8.
 
  On Wed, Mar 5, 2014 at 5:34 PM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
   On Wed, Mar 5, 2014 at 9:08 AM, Sean Owen sro...@gmail.com wrote:
  
   I don't follow what here makes you say they are cut down releases?
  
  
   meaning it seems to be pretty much 2 releases behind the official.
 But i
   definitely don't follow CDH developments in this department, you
 seem in
  a
   better position to explain the existing patchlevel there so I defer
 to
  you
   to explain why this patchlevel is not there.
  
   -d
 





Re: Rework our website

2014-03-05 Thread Sebastian Schelter
At the moment, only committers can change the website unfortunately. If 
you have a text to add, I'm happy to work it in and add your name to our 
contributers list in the CHANGELOG.


Best,
Sebastian


On 03/05/2014 04:58 PM, Scott C. Cote wrote:

I had recently taken the text tour of mahout, but I couldn't decipher a
way to contribute updates to the tour (some of the file names have
changed, etc).

How would I start?   (this was part of my offer to help with the
documentation of Mahout).

SCott

On 3/5/14 9:47 AM, Pat Ferrel p...@occamsmachete.com wrote:


What no centered text??

;-)

Love either.

BTW users are no longer able to contribute content to the wiki. Most CMSs
have a way to allow input that is moderated. Might this make getting
documentation help easier? Allow anyone to contribute but committers can
filter out the bad‹sort of like submitting patches.

On Mar 5, 2014, at 4:11 AM, Sebastian Schelter s...@apache.org wrote:

Hi everyone,

In our latest discussion, I argued that the lack (and errors) of
documentation on our website is one of the main pain points of Mahout
atm. To be honest, I'm also not very happy with the design, especially
fonts and spacing make it super hard to read long articles. This also
prevents me from wanting to add articles and documentation.

I think we should have a beautiful website, where it is fun to add new
stuff.

My design skills are pretty limited, but fortunately my brother is an art
director! I asked him to make our website a bit more beautiful without
changing to much of the structure, so that a redesign wouldn't take too
long.

I really like the results and would volunteer to dig out my CSS skills
and do the redesign, if people agree.

Here are his drafts, I like the second one best:

https://people.apache.org/~ssc/mahout/mahout.jpg
https://people.apache.org/~ssc/mahout/mahout2.jpg

Let me know what you think!

Best,
Sebastian