Re: Running a single test

2012-09-13 Thread Dhruv
Your command is correct, and it should run a single test. I just tried
running a new test I wrote from the $MAHOUT_HOME/core/ directory.

>From where are you firing this command?

On Wed, Sep 12, 2012 at 12:09 PM, Nick Kolegraff wrote:

> Does this work for anyone?
> mvn -Dtest=YourTest install
>
> Doing a bit of practice and trying to understand mahout a bit better
> internally so I wanted to try and implement something basic to get a feel,
> yet, I'm running into some issues.
> /* Disclaimer I'm new to java/maven */
> /* 'mvn install' will find and run the test fine but that got annoying
> pretty quick */
>
> I have the following files:
>
> $MAHOUT_HOME/core/src/main/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.java
>
> $MAHOUT_HOME/core/src/test/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.java
>
> $MAHOUT_HOME/core/target/test-classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.class
>
> $MAHOUT_HOME/core/target/classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.class
>
> However, this:
>mvn -Dtest=HarmonicRunningAverageTest install
> doesn't seem to work for ANY single test I run and I get this error:
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-surefire-plugin:2.11:test (default-test) on
> project mahout-buildtools: No tests were executed!  (Set
> -DfailIfNoTests=false to ignore this error.) -> [Help 1]
>
> Looks like a surefire issue, so I found this:
> http://jira.codehaus.org/browse/SUREFIRE-827
> and changed my version of surefire to 2.12.{1,2} (which should contain the
> fix per the jira ticket) yet the same problem persists.
>
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-surefire-plugin:2.12.1:test (default-test)
> on project mahout-buildtools: No tests were executed!  (Set
> -DfailIfNoTests=false to ignore this error.) -> [Help 1]
>
> Is this consistent with everyone else or am I missing something trivial?
> How is everyone else running single tests?
>


Building Mahout

2012-09-13 Thread David Scarlatti
Hi, I'm installing Mahout, following this steps (
http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-installation-on-hadoop-cluster/
):

user1@ubuntu-server:~$ apt-get install maven2

 user1@ubuntu-server:~$ cd /opt

 user1@ubuntu-server:~$ svn co http://svn.apache.org/repos/asf/mahout/trunk

 user1@ubuntu-server:~$ mv trunk mahout_trunk

 user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout

 user1@ubuntu-server:~$ cd mahout

 user1@ubuntu-server:~$ mvn install


I get this result:

Results :

Failed tests:
testCanopyEuclideanMRJobNoClustering(org.apache.mahout.clustering.meanshift.TestMeanShift):
count expected:<3> but was:<4>

Tests run: 676, Failures: 1, Errors: 0, Skipped: 0

[INFO]

[ERROR] BUILD FAILURE
[INFO]

[INFO] There are test failures.

Please refer to /opt/mahout_trunk/core/target/surefire-reports for the
individual test results.
[INFO]

[INFO] For more information, run Maven with the -e switch
[INFO]

[INFO] Total time: 51 minutes 43 seconds
[INFO] Finished at: Thu Sep 13 12:27:34 CEST 2012
[INFO] Final Memory: 50M/399M
[INFO]






It seems it is know (same web page says: P.S.: Sometimes some tests fail
while building mahout from source. In such cases use – user1@ubuntu-server:~$
mvn -DskipTests install)

I'd like  to understand what exactly the test run are... are they optional?

An installation with -DskipTests is the same than the "clean" one?

Are any documentation available explaining the installation a bit...?

Thanks in advance.

-- 
-
David.


Re: Building Mahout

2012-09-13 Thread Paritosh Ranjan
The current build is broken as sometimes happens with development 
https://builds.apache.org/job/Mahout-Quality/1658/console.


Till the time it gets fixed, I would suggest to skip tests and build.

On 13-09-2012 15:59, David Scarlatti wrote:

Hi, I'm installing Mahout, following this steps (
http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-installation-on-hadoop-cluster/
):

user1@ubuntu-server:~$ apt-get install maven2

  user1@ubuntu-server:~$ cd /opt

  user1@ubuntu-server:~$ svn co http://svn.apache.org/repos/asf/mahout/trunk

  user1@ubuntu-server:~$ mv trunk mahout_trunk

  user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout

  user1@ubuntu-server:~$ cd mahout

  user1@ubuntu-server:~$ mvn install


I get this result:

Results :

Failed tests:
testCanopyEuclideanMRJobNoClustering(org.apache.mahout.clustering.meanshift.TestMeanShift):
count expected:<3> but was:<4>

Tests run: 676, Failures: 1, Errors: 0, Skipped: 0

[INFO]

[ERROR] BUILD FAILURE
[INFO]

[INFO] There are test failures.

Please refer to /opt/mahout_trunk/core/target/surefire-reports for the
individual test results.
[INFO]

[INFO] For more information, run Maven with the -e switch
[INFO]

[INFO] Total time: 51 minutes 43 seconds
[INFO] Finished at: Thu Sep 13 12:27:34 CEST 2012
[INFO] Final Memory: 50M/399M
[INFO]






It seems it is know (same web page says: P.S.: Sometimes some tests fail
while building mahout from source. In such cases use – user1@ubuntu-server:~$
mvn -DskipTests install)

I'd like  to understand what exactly the test run are... are they optional?

An installation with -DskipTests is the same than the "clean" one?

Are any documentation available explaining the installation a bit...?

Thanks in advance.






Re: Running a single test

2012-09-13 Thread Nick Kolegraff
Thanks for the response.

I have tried firing the command from both
$MAHOUT_HOME (as this suggests
https://cwiki.apache.org/MAHOUT/buildingmahout.html)
$MAHOUT_HOME/core

both with the same errors as reported above.
($MAHOUT_HOME is set to the root of mahout)

using:
git: a92140e3b546de25096079ef0e029b8cb1908ddc
git-svn-id: 
https://svn.apache.org/repos/asf/mahout/trunk@137738013f79535-47bb-0310-9956-ffa450edef68


On Thu, Sep 13, 2012 at 12:34 AM, Dhruv  wrote:

> Your command is correct, and it should run a single test. I just tried
> running a new test I wrote from the $MAHOUT_HOME/core/ directory.
>
> From where are you firing this command?
>
> On Wed, Sep 12, 2012 at 12:09 PM, Nick Kolegraff  >wrote:
>
> > Does this work for anyone?
> > mvn -Dtest=YourTest install
> >
> > Doing a bit of practice and trying to understand mahout a bit better
> > internally so I wanted to try and implement something basic to get a
> feel,
> > yet, I'm running into some issues.
> > /* Disclaimer I'm new to java/maven */
> > /* 'mvn install' will find and run the test fine but that got annoying
> > pretty quick */
> >
> > I have the following files:
> >
> >
> $MAHOUT_HOME/core/src/main/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.java
> >
> >
> $MAHOUT_HOME/core/src/test/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.java
> >
> >
> $MAHOUT_HOME/core/target/test-classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.class
> >
> >
> $MAHOUT_HOME/core/target/classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.class
> >
> > However, this:
> >mvn -Dtest=HarmonicRunningAverageTest install
> > doesn't seem to work for ANY single test I run and I get this error:
> >
> > [ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-surefire-plugin:2.11:test (default-test)
> on
> > project mahout-buildtools: No tests were executed!  (Set
> > -DfailIfNoTests=false to ignore this error.) -> [Help 1]
> >
> > Looks like a surefire issue, so I found this:
> > http://jira.codehaus.org/browse/SUREFIRE-827
> > and changed my version of surefire to 2.12.{1,2} (which should contain
> the
> > fix per the jira ticket) yet the same problem persists.
> >
> > [ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-surefire-plugin:2.12.1:test (default-test)
> > on project mahout-buildtools: No tests were executed!  (Set
> > -DfailIfNoTests=false to ignore this error.) -> [Help 1]
> >
> > Is this consistent with everyone else or am I missing something trivial?
> > How is everyone else running single tests?
> >
>


Re: Running a single test

2012-09-13 Thread Nick Kolegraff
Ok, sorry for the bother.  I took a recent pull and everything seems to be
working fine now.

working from this for others reference:
commit: 79313f55c3c3d38a4999a5cb0656170bc9e29434
git-svn-id: 
https://svn.apache.org/repos/asf/mahout/trunk@138178013f79535-47bb-0310-9956-ffa450edef68

On Thu, Sep 13, 2012 at 4:43 AM, Nick Kolegraff wrote:

> Thanks for the response.
>
> I have tried firing the command from both
> $MAHOUT_HOME (as this suggests
> https://cwiki.apache.org/MAHOUT/buildingmahout.html)
> $MAHOUT_HOME/core
>
> both with the same errors as reported above.
> ($MAHOUT_HOME is set to the root of mahout)
>
> using:
> git: a92140e3b546de25096079ef0e029b8cb1908ddc
> git-svn-id: 
> https://svn.apache.org/repos/asf/mahout/trunk@137738013f79535-47bb-0310-9956-ffa450edef68
>
>
> On Thu, Sep 13, 2012 at 12:34 AM, Dhruv  wrote:
>
>> Your command is correct, and it should run a single test. I just tried
>> running a new test I wrote from the $MAHOUT_HOME/core/ directory.
>>
>> From where are you firing this command?
>>
>> On Wed, Sep 12, 2012 at 12:09 PM, Nick Kolegraff > >wrote:
>>
>> > Does this work for anyone?
>> > mvn -Dtest=YourTest install
>> >
>> > Doing a bit of practice and trying to understand mahout a bit better
>> > internally so I wanted to try and implement something basic to get a
>> feel,
>> > yet, I'm running into some issues.
>> > /* Disclaimer I'm new to java/maven */
>> > /* 'mvn install' will find and run the test fine but that got annoying
>> > pretty quick */
>> >
>> > I have the following files:
>> >
>> >
>> $MAHOUT_HOME/core/src/main/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.java
>> >
>> >
>> $MAHOUT_HOME/core/src/test/java/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.java
>> >
>> >
>> $MAHOUT_HOME/core/target/test-classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverageTest.class
>> >
>> >
>> $MAHOUT_HOME/core/target/classes/org/apache/mahout/cf/taste/impl/common/HarmonicRunningAverage.class
>> >
>> > However, this:
>> >mvn -Dtest=HarmonicRunningAverageTest install
>> > doesn't seem to work for ANY single test I run and I get this error:
>> >
>> > [ERROR] Failed to execute goal
>> > org.apache.maven.plugins:maven-surefire-plugin:2.11:test (default-test)
>> on
>> > project mahout-buildtools: No tests were executed!  (Set
>> > -DfailIfNoTests=false to ignore this error.) -> [Help 1]
>> >
>> > Looks like a surefire issue, so I found this:
>> > http://jira.codehaus.org/browse/SUREFIRE-827
>> > and changed my version of surefire to 2.12.{1,2} (which should contain
>> the
>> > fix per the jira ticket) yet the same problem persists.
>> >
>> > [ERROR] Failed to execute goal
>> > org.apache.maven.plugins:maven-surefire-plugin:2.12.1:test
>> (default-test)
>> > on project mahout-buildtools: No tests were executed!  (Set
>> > -DfailIfNoTests=false to ignore this error.) -> [Help 1]
>> >
>> > Is this consistent with everyone else or am I missing something trivial?
>> > How is everyone else running single tests?
>> >
>>
>
>


RE: Building Mahout

2012-09-13 Thread I-Scarlatti, David
Ok. So tests are just tests... not needed for having mahout running 

Thanks!


-Original Message-
From: Paritosh Ranjan [mailto:pran...@xebia.com] 
Sent: Thursday, September 13, 2012 1:15 PM
To: user@mahout.apache.org; d_scarla...@yahoo.es
Subject: Re: Building Mahout

The current build is broken as sometimes happens with development 
https://builds.apache.org/job/Mahout-Quality/1658/console.

Till the time it gets fixed, I would suggest to skip tests and build.

On 13-09-2012 15:59, David Scarlatti wrote:
> Hi, I'm installing Mahout, following this steps ( 
> http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-installation-on-
> hadoop-cluster/
> ):
>
> user1@ubuntu-server:~$ apt-get install maven2
>
>   user1@ubuntu-server:~$ cd /opt
>
>   user1@ubuntu-server:~$ svn co 
> http://svn.apache.org/repos/asf/mahout/trunk
>
>   user1@ubuntu-server:~$ mv trunk mahout_trunk
>
>   user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout
>
>   user1@ubuntu-server:~$ cd mahout
>
>   user1@ubuntu-server:~$ mvn install
>
>
> I get this result:
>
> Results :
>
> Failed tests:
> testCanopyEuclideanMRJobNoClustering(org.apache.mahout.clustering.meanshift.TestMeanShift):
> count expected:<3> but was:<4>
>
> Tests run: 676, Failures: 1, Errors: 0, Skipped: 0
>
> [INFO]
> --
> --
> [ERROR] BUILD FAILURE
> [INFO]
> --
> --
> [INFO] There are test failures.
>
> Please refer to /opt/mahout_trunk/core/target/surefire-reports for the 
> individual test results.
> [INFO]
> --
> -- [INFO] For more information, run Maven with the -e switch [INFO]
> --
> -- [INFO] Total time: 51 minutes 43 seconds [INFO] Finished at: Thu 
> Sep 13 12:27:34 CEST 2012 [INFO] Final Memory: 50M/399M [INFO]
> --
> --
>
>
>
>
>
> It seems it is know (same web page says: P.S.: Sometimes some tests 
> fail while building mahout from source. In such cases use - 
> user1@ubuntu-server:~$ mvn -DskipTests install)
>
> I'd like  to understand what exactly the test run are... are they optional?
>
> An installation with -DskipTests is the same than the "clean" one?
>
> Are any documentation available explaining the installation a bit...?
>
> Thanks in advance.
>




Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
What distance measure?

On Sep 12, 2012, at 10:37 PM, Elaine Gan  wrote:

My -cd was quite loose, set it at 0.1

Hmm.. maybe the data is too small, causing the low performance..?


> 200 iterations?
> 
> What is your convergence delta? If it is too small for your distance measure 
> you will perform all 200 iterations, every time you cluster. 
> 
>  --convergenceDelta (-cd) convergenceDelta  
>  The convergence delta value.   
>   Default is 0.5  
> 
> I would set the convergence delta looser and see if 100 or even 20 iterations 
> produces good results. You can always tweak your other parameters to get them 
> tuned and up your convergence if needed. Also remember that a good 
> convergence is related to your distance measure so you need to think about 
> which distance measure works for your data.
> 
> I generally only take 10-20 iterations using cosine distance and 0.001 as the 
> convergence delta, which would be 20-40 minutes for you.
> 
> On Sep 12, 2012, at 7:26 PM, Elaine Gan  wrote:
> 
> Hi,
> 
> I'm trying to do some text analysis using mahout kmeans (clustering),
> processing the data on hadoop.
> --numClusters = 160 
> --maxIter (-x) maxIter = 200
> 
> Well my data is small, around 500MB .
> I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
> maximum.
> When i run the mahout task, i can see that the number of map tasks are
> the most 3, so i guess i do not need to do any tuning on this at this
> moment.
> 
> One iteration took around 1.5mins ~ 2mins to finish.
> I am not sure whether this is normal or is it consider slow, can anyone
> gives me an advice on this?
> 
> And with x = 200, it tooks me around 200x2mins = 6 hours 
> to finish the whole analysis..
> Is it something which is unavoided?
> The bigger the "x" is, the longer time it takes to finish the kmeans job?
> 
> Any ways to improve on the mahout kmeans to speed it up?
> 
> Thank you.
> 




Re: Is mahout kmeans slow ?

2012-09-13 Thread Pat Ferrel
Actually if it is really taking 200 iterations then it is never matching your 
convergence delta. That means either your data does not cluster well or you 
convergence delta is still to tight.

I was suggesting that you loosen the convergence delta until it only takes 
10-20 iterations to cluster then look at the data, tune your other parameters, 
scrub you input etc. before tightening your delta. If it takes 6 hours to 
cluster then tuning your other params will take too long so do them first.

On Sep 13, 2012, at 7:59 AM, Pat Ferrel  wrote:

What distance measure?

On Sep 12, 2012, at 10:37 PM, Elaine Gan  wrote:

My -cd was quite loose, set it at 0.1

Hmm.. maybe the data is too small, causing the low performance..?


> 200 iterations?
> 
> What is your convergence delta? If it is too small for your distance measure 
> you will perform all 200 iterations, every time you cluster. 
> 
> --convergenceDelta (-cd) convergenceDelta  
> The convergence delta value.   
>  Default is 0.5  
> 
> I would set the convergence delta looser and see if 100 or even 20 iterations 
> produces good results. You can always tweak your other parameters to get them 
> tuned and up your convergence if needed. Also remember that a good 
> convergence is related to your distance measure so you need to think about 
> which distance measure works for your data.
> 
> I generally only take 10-20 iterations using cosine distance and 0.001 as the 
> convergence delta, which would be 20-40 minutes for you.
> 
> On Sep 12, 2012, at 7:26 PM, Elaine Gan  wrote:
> 
> Hi,
> 
> I'm trying to do some text analysis using mahout kmeans (clustering),
> processing the data on hadoop.
> --numClusters = 160 
> --maxIter (-x) maxIter = 200
> 
> Well my data is small, around 500MB .
> I have 4 servers, each with 4CPU and TaskTrackers are set to 4 as
> maximum.
> When i run the mahout task, i can see that the number of map tasks are
> the most 3, so i guess i do not need to do any tuning on this at this
> moment.
> 
> One iteration took around 1.5mins ~ 2mins to finish.
> I am not sure whether this is normal or is it consider slow, can anyone
> gives me an advice on this?
> 
> And with x = 200, it tooks me around 200x2mins = 6 hours 
> to finish the whole analysis..
> Is it something which is unavoided?
> The bigger the "x" is, the longer time it takes to finish the kmeans job?
> 
> Any ways to improve on the mahout kmeans to speed it up?
> 
> Thank you.
> 





Re: Building Mahout

2012-09-13 Thread Ted Dunning
Yes.  It is a grave embarrassment to us, but not a functional requirement.

On Thu, Sep 13, 2012 at 6:42 AM, I-Scarlatti, David <
david.scarla...@boeing.com> wrote:

> Ok. So tests are just tests... not needed for having mahout running
>
> Thanks!
>
>
> -Original Message-
> From: Paritosh Ranjan [mailto:pran...@xebia.com]
> Sent: Thursday, September 13, 2012 1:15 PM
> To: user@mahout.apache.org; d_scarla...@yahoo.es
> Subject: Re: Building Mahout
>
> The current build is broken as sometimes happens with development
> https://builds.apache.org/job/Mahout-Quality/1658/console.
>
> Till the time it gets fixed, I would suggest to skip tests and build.
>
> On 13-09-2012 15:59, David Scarlatti wrote:
> > Hi, I'm installing Mahout, following this steps (
> > http://cloudblog.8kmiles.com/2012/01/31/apache-mahout-installation-on-
> > hadoop-cluster/
> > ):
> >
> > user1@ubuntu-server:~$ apt-get install maven2
> >
> >   user1@ubuntu-server:~$ cd /opt
> >
> >   user1@ubuntu-server:~$ svn co
> > http://svn.apache.org/repos/asf/mahout/trunk
> >
> >   user1@ubuntu-server:~$ mv trunk mahout_trunk
> >
> >   user1@ubuntu-server:~$ ln -s mahout_trunk/ mahout
> >
> >   user1@ubuntu-server:~$ cd mahout
> >
> >   user1@ubuntu-server:~$ mvn install
> >
> >
> > I get this result:
> >
> > Results :
> >
> > Failed tests:
> >
> testCanopyEuclideanMRJobNoClustering(org.apache.mahout.clustering.meanshift.TestMeanShift):
> > count expected:<3> but was:<4>
> >
> > Tests run: 676, Failures: 1, Errors: 0, Skipped: 0
> >
> > [INFO]
> > --
> > --
> > [ERROR] BUILD FAILURE
> > [INFO]
> > --
> > --
> > [INFO] There are test failures.
> >
> > Please refer to /opt/mahout_trunk/core/target/surefire-reports for the
> > individual test results.
> > [INFO]
> > --
> > -- [INFO] For more information, run Maven with the -e switch [INFO]
> > --
> > -- [INFO] Total time: 51 minutes 43 seconds [INFO] Finished at: Thu
> > Sep 13 12:27:34 CEST 2012 [INFO] Final Memory: 50M/399M [INFO]
> > --
> > --
> >
> >
> >
> >
> >
> > It seems it is know (same web page says: P.S.: Sometimes some tests
> > fail while building mahout from source. In such cases use -
> > user1@ubuntu-server:~$ mvn -DskipTests install)
> >
> > I'd like  to understand what exactly the test run are... are they
> optional?
> >
> > An installation with -DskipTests is the same than the "clean" one?
> >
> > Are any documentation available explaining the installation a bit...?
> >
> > Thanks in advance.
> >
>
>
>


Re: Mahout Kmeans

2012-09-13 Thread Gustavo Enrique Salazar Torres
Hi Paritosh:

I made it work on Hadoop mode, not Local. I don't know if thats desirable.
I also got this error: Hadoop libraries are missing when running local and,
from what I saw at the mahout script, it simply discards all libraries when
MAHOUT_LOCAL is set.
So, is the local mode used for anything? (please forgive my ignorance, I
don't know the whole project)

Gustavo

On Sat, Sep 8, 2012 at 2:35 AM, Paritosh Ranjan  wrote:

> Can you open up a jira describing the problem and submitting the patch for
> your fix?
> https://issues.apache.org/**jira/browse/MAHOUT
>
>
> On 08-09-2012 09:40, Gustavo Enrique Salazar Torres wrote:
>
>> Nevermind, got it to work, had to fix the script though.
>>
>> Thanks.
>> Gustavo
>>
>> On Fri, Sep 7, 2012 at 5:54 PM, Gustavo Enrique Salazar Torres <
>> gsala...@ime.usp.br> wrote:
>>
>>  Hi there:
>>>
>>> I'm trying to finish an improvement to the Kmeans algorithm but I first
>>> need to get it run in order to compare results.
>>> But running the cluster-reuters.sh script I get this error:
>>>
>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>> Running on hadoop, using /home/gustavo/Desktop/yandex_**data/hadoop-
>>> 0.20.203.0/bin/hadoop and
>>> HADOOP_CONF_DIR=/home/gustavo/**Desktop/yandex_data/hadoop-0.20.203.0/**
>>> conf
>>> MAHOUT-JOB:
>>> /home/gustavo/Desktop/yandex_**data/mahout-distribution-0.7/**
>>> mahout-examples-0.7-job.jar
>>> 12/09/07 17:47:43 INFO common.AbstractJob: Command line arguments:
>>> {--clustering=null, --clusters=[./reuters-kmeans-**clusters],
>>> --convergenceDelta=[0.5],
>>> --distanceMeasure=[org.apache.**mahout.common.distance.**
>>> CosineDistanceMeasure],
>>> --endPhase=[2147483647],
>>> --input=[./reuters_out_seqdir_**kmeans/tfidf-vectors], --maxIter=[10],
>>> --method=[mapreduce], --numClusters=[20], --output=[./reuters-kmeans],
>>> --overwrite=null, --startPhase=[0], --tempDir=[temp]}
>>> 12/09/07 17:47:44 INFO common.HadoopUtil: Deleting
>>> reuters-kmeans-clusters
>>> 12/09/07 17:47:44 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 12/09/07 17:47:44 INFO zlib.ZlibFactory: Successfully loaded &
>>> initialized
>>> native-zlib library
>>> 12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new compressor
>>> 12/09/07 17:47:44 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
>>> reuters-kmeans-clusters/part-**randomSeed
>>> 12/09/07 17:47:44 INFO kmeans.KMeansDriver: Input:
>>> reuters_out_seqdir_kmeans/**tfidf-vectors Clusters In:
>>> reuters-kmeans-clusters/part-**randomSeed Out: reuters-kmeans Distance:
>>> org.apache.mahout.common.**distance.CosineDistanceMeasure
>>> 12/09/07 17:47:44 INFO kmeans.KMeansDriver: convergence: 0.5 max
>>> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
>>> Input Vectors: {}
>>> 12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new decompressor
>>> Exception in thread "main" java.lang.**IllegalStateException: No input
>>> clusters found in reuters-kmeans-clusters/part-**randomSeed. Check your
>>> -c
>>> argument.
>>> at
>>> org.apache.mahout.clustering.**kmeans.KMeansDriver.**
>>> buildClusters(KMeansDriver.**java:218)
>>>
>>> As you can see the initial clusters are being created but for a reason I
>>> don't understand why they are being found.
>>> Below is the 'cat' command on the part file containing clusters:
>>>
>>> $ dfs -cat reuters-kmeans-clusters/part-**randomSeed
>>> SEQ
>>> org.apache.hadoop.io.Text5org.**apache.mahout.clustering.**
>>> iterator.ClusterWritable
>>> *org.apache.hadoop.io.**compress.DefaultCodec b�W3 K�E�߇H��Vgustavo
>>>
>>> Can anyone help me please?
>>>
>>> Thanks
>>> Gustavo Salazar
>>>
>>>
>
>


hadoop-0.19 and mahout 0.7: throwing incompatible errors, how can I fix it?

2012-09-13 Thread Phoenix Bai
Hi guys,

I am trying to compile my application code using mahout 0.7 and hadoop 0.19.
during the compile process, it is throwing errors as below:

$ hadoop jar cluster-0.0.1-SNAPSHOT-jar-with-dependencies.jar
mahout.sample.ClusterVideos
12/09/13 20:36:18 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum
n-gram size is: 1
12/09/13 20:36:31 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum
LLR value: 1.0
12/09/13 20:36:31 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of
reduce tasks: 1
java.lang.VerifyError: (class: org/apache/hadoop/mapreduce/Job, method:
submit signature: ()V) Incompatible argument to function
at
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:78)
 at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:253)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:55)
 at mahout.sample.ClusterVideos.runSeq2Sparse(ClusterVideos.java:133)
at mahout.sample.ClusterVideos.main(ClusterVideos.java:54)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


This is due to incompability between hadoop0.19 and mahout 0.7 right?
so, how can I fix it?
I can`t upgrade hadoop 0.19 because it is not up to me,
and I don`t want to use mahout 0.5 either because, in that case, I might
have to rewrite my application code.

so, is there any way to solve this like through a patch or something?

Thanks


Re: Newbie question on modeling a Recommender using Mahout when the matrix is sparse

2012-09-13 Thread Gokul Pillai
Very true, good catch. I think I was interpreting the results the wrong way.
I expect only the top 5, so I changed the parameter to "5" instead of "10"
and the results are as expected now.

Thanks.

On Wed, Sep 12, 2012 at 11:36 PM, Sean Owen  wrote:

> Well there are only 7 products in the universe! If you ask for 10
> recommendations, you will always get all unrated items back in the
> recommendations. That's always true unless the algorithm can't
> actually establish a value for some items.
>
> What result were you expecting, less than 10 recs? less than 7?
>
> On Thu, Sep 13, 2012 at 6:55 AM, Gokul Pillai 
> wrote:
> > I am trying out Mahout to come up with product recommendations for users
> > based on data that show what products they use today.
> > The data is not web-scale, just about 300,000 users and 7 products. Few
> > comments about the data here:
> > 1. Since users either have or not have a particular product, the value in
> > the matrix is either "1" or "0" for all the columns (rows being the
> userids)
> > 2. All the users have one basic product, so I discounted this from the
> > data-model passed to the Mahout recommender since I assume that if
> everyone
> > has the same product, its effect on the recommendations are trivial.
> > 3. The matrix itself is sparse, the total counts of users having each
> > product is :
> > A=31847, 54754,1897 |23154 |2201 |2766 |33585
> >
> > Steps followed:
> > 1. Created a data-source from the user-product table in the database
> > File ratingsFile = new
> > File("datasets/products.csv");
> > DataModel model = new FileDataModel(ratingsFile);
> >   2.  Created a recommender on this data
> > CachingRecommender recommender = new CachingRecommender(new
> > SlopeOneRecommender(model));
> > 3. Loop through all users and get the top ten recommendations:
> > List recommendations =
> > recommender.recommend(userId, 10);
> >
> > Issue faced:
> > The problem I am facing is that the recommendations that come out are way
> > too simple - meaning that all that it seems like what is being
> recommended
> > is "if a user does not have product A, then recommend it, if they dont
> have
> > product B, then recommend it and so on." Basically a simple inverse of
> > their ownership status.
> >
> > Obviously, I am not doing something right here. How can I do the modeling
> > better to get the right recommendations. Or is it that my dataset (30
> > users times 7 products) is too small for Mahout to work with?
> >
> > Look forward to your comments. Thanks.
>


Re: Mahout Kmeans

2012-09-13 Thread Paritosh Ranjan
The general convention is that if there is a MAHOUT_LOCAL env variable, 
this means run 'pseudo-distributed' rather than against a cluster.


On 14-09-2012 05:11, Gustavo Enrique Salazar Torres wrote:

Hi Paritosh:

I made it work on Hadoop mode, not Local. I don't know if thats desirable.
I also got this error: Hadoop libraries are missing when running local and,
from what I saw at the mahout script, it simply discards all libraries when
MAHOUT_LOCAL is set.
So, is the local mode used for anything? (please forgive my ignorance, I
don't know the whole project)

Gustavo

On Sat, Sep 8, 2012 at 2:35 AM, Paritosh Ranjan  wrote:


Can you open up a jira describing the problem and submitting the patch for
your fix?
https://issues.apache.org/**jira/browse/MAHOUT


On 08-09-2012 09:40, Gustavo Enrique Salazar Torres wrote:


Nevermind, got it to work, had to fix the script though.

Thanks.
Gustavo

On Fri, Sep 7, 2012 at 5:54 PM, Gustavo Enrique Salazar Torres <
gsala...@ime.usp.br> wrote:

  Hi there:

I'm trying to finish an improvement to the Kmeans algorithm but I first
need to get it run in order to compare results.
But running the cluster-reuters.sh script I get this error:

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /home/gustavo/Desktop/yandex_**data/hadoop-
0.20.203.0/bin/hadoop and
HADOOP_CONF_DIR=/home/gustavo/**Desktop/yandex_data/hadoop-0.20.203.0/**
conf
MAHOUT-JOB:
/home/gustavo/Desktop/yandex_**data/mahout-distribution-0.7/**
mahout-examples-0.7-job.jar
12/09/07 17:47:43 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=[./reuters-kmeans-**clusters],
--convergenceDelta=[0.5],
--distanceMeasure=[org.apache.**mahout.common.distance.**
CosineDistanceMeasure],
--endPhase=[2147483647],
--input=[./reuters_out_seqdir_**kmeans/tfidf-vectors], --maxIter=[10],
--method=[mapreduce], --numClusters=[20], --output=[./reuters-kmeans],
--overwrite=null, --startPhase=[0], --tempDir=[temp]}
12/09/07 17:47:44 INFO common.HadoopUtil: Deleting
reuters-kmeans-clusters
12/09/07 17:47:44 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/09/07 17:47:44 INFO zlib.ZlibFactory: Successfully loaded &
initialized
native-zlib library
12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new compressor
12/09/07 17:47:44 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
reuters-kmeans-clusters/part-**randomSeed
12/09/07 17:47:44 INFO kmeans.KMeansDriver: Input:
reuters_out_seqdir_kmeans/**tfidf-vectors Clusters In:
reuters-kmeans-clusters/part-**randomSeed Out: reuters-kmeans Distance:
org.apache.mahout.common.**distance.CosineDistanceMeasure
12/09/07 17:47:44 INFO kmeans.KMeansDriver: convergence: 0.5 max
Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
Input Vectors: {}
12/09/07 17:47:44 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.lang.**IllegalStateException: No input
clusters found in reuters-kmeans-clusters/part-**randomSeed. Check your
-c
argument.
at
org.apache.mahout.clustering.**kmeans.KMeansDriver.**
buildClusters(KMeansDriver.**java:218)

As you can see the initial clusters are being created but for a reason I
don't understand why they are being found.
Below is the 'cat' command on the part file containing clusters:

$ dfs -cat reuters-kmeans-clusters/part-**randomSeed
SEQ
org.apache.hadoop.io.Text5org.**apache.mahout.clustering.**
iterator.ClusterWritable
*org.apache.hadoop.io.**compress.DefaultCodec b�W3 K�E�߇H��Vgustavo

Can anyone help me please?

Thanks
Gustavo Salazar