Re: mahout failing with -c as required option

2015-03-10 Thread Suneel Marthi
Try

./mahout kmeans -i
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
-o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c
some-folder -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow
-cl -k 25

I don't have a machine before me, so no way to try this out.

But IIRC the way this works is :

a) u specify an initial seed of centroids via -c , u then don't need to
specify k, since the # of centroids specified as seed would be the k

b) u let the algorithm choose random centroids by specifying -k, it needs
-c to write the random centroids to hence -c is needed with -k.






On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote:

 ok so if -c is required then how can i give it or atleast is there a way
 to remove -k itself?

 ./mahout kmeans -i
 http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm
 org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow
 -cl -k 25

 and

 ./mahout kmeans -i
 http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm
 org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow
 -cl

 both give the same exception still. Kindly suggest.


   On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi 
 suneel.mar...@gmail.com wrote:


 Oops! I meant to say that -c is required for the random centroid
 initialization if -k is specified.
 It initializes k random centroids in the folder specified by -c. so yes -c
 is required.

 On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid
 wrote:

 No i have removed the -c option now so i get the mentioned exception that
 -c is mandatory.


  On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
 suneel.mar...@gmail.com wrote:


  R u still specifying the -c option, its only needed if u have initial
 centroids to launch the KMEans from otherwise KMeans picks random
 centroids.

 Also CosineDistanceMeasure doesn't make sense with kMeans which is in
 Euclidean space -try using SquaredEuclidean or Euclidean distances.

 On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid
 wrote:

  Hi All,
  I am trying to run the command:
  ./mahout kmeans -i
  hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
  -o
 
 hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
  -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
  org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k
 25
  -xm mapreduce
  Since i dont have any clusters yet to give it as an input i can remove it
  is what forums suggested. But now i get the error
 
  Running on hadoop, using /usr/local/hadoop/bin/hadoop and
 HADOOP_CONF_DIR=
  MAHOUT-JOB:
 
 /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
  15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
  --clusters
  Missing required option
  --clusters
 
  Usage:
   [--input input --output output --distanceMeasure
  distanceMeasure
  --clusters clusters --numClusters k --randomSeed
  randomSeed1
  [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter
  maxIter
  --overwrite --clustering --method method
  --outlierThreshold
  outlierThreshold --help --tempDir tempDir --startPhase
  startPhase
  --endPhase
  endPhase]
  --clusters (-c) clustersThe input centroids, as Vectors.  Must be
  a
 SequenceFile of Writable, Cluster/Canopy.  If
  k is
 also specified, then a random set of vectors
  will
 be selected and written out to this path
  first
  15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
  0.006167)
  Kindly help me out.
  Thanks
 
 
 










Re: mahout failing with -c as required option

2015-03-10 Thread Suneel Marthi
Oops! I meant to say that -c is required for the random centroid
initialization if -k is specified.
It initializes k random centroids in the folder specified by -c. so yes -c
is required.

On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid
wrote:

 No i have removed the -c option now so i get the mentioned exception that
 -c is mandatory.


  On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
 suneel.mar...@gmail.com wrote:


  R u still specifying the -c option, its only needed if u have initial
 centroids to launch the KMEans from otherwise KMeans picks random
 centroids.

 Also CosineDistanceMeasure doesn't make sense with kMeans which is in
 Euclidean space -try using SquaredEuclidean or Euclidean distances.

 On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid
 wrote:

  Hi All,
  I am trying to run the command:
  ./mahout kmeans -i
  hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
  -o
 
 hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
  -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
  org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k
 25
  -xm mapreduce
  Since i dont have any clusters yet to give it as an input i can remove it
  is what forums suggested. But now i get the error
 
  Running on hadoop, using /usr/local/hadoop/bin/hadoop and
 HADOOP_CONF_DIR=
  MAHOUT-JOB:
 
 /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
  15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
  --clusters
  Missing required option
  --clusters
 
  Usage:
   [--input input --output output --distanceMeasure
  distanceMeasure
  --clusters clusters --numClusters k --randomSeed
  randomSeed1
  [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter
  maxIter
  --overwrite --clustering --method method
  --outlierThreshold
  outlierThreshold --help --tempDir tempDir --startPhase
  startPhase
  --endPhase
  endPhase]
  --clusters (-c) clustersThe input centroids, as Vectors.  Must be
  a
 SequenceFile of Writable, Cluster/Canopy.  If
  k is
 also specified, then a random set of vectors
  will
 be selected and written out to this path
  first
  15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
  0.006167)
  Kindly help me out.
  Thanks
 
 
 





Re: mahout failing with -c as required option

2015-03-10 Thread Raghuveer
ok so if -c is required then how can i give it or atleast is there a way to 
remove -k itself?
./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl 
-k 25
and 

./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl
both give the same exception still. Kindly suggest.
 

 On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:
   

 Oops! I meant to say that -c is required for the random centroid 
initialization if -k is specified.
It initializes k random centroids in the folder specified by -c. so yes -c is 
required.

On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid 
wrote:

No i have removed the -c option now so i get the mentioned exception that -c is 
mandatory.


     On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:


 R u still specifying the -c option, its only needed if u have initial
centroids to launch the KMEans from otherwise KMeans picks random centroids.

Also CosineDistanceMeasure doesn't make sense with kMeans which is in
Euclidean space -try using SquaredEuclidean or Euclidean distances.

On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid
wrote:

 Hi All,
 I am trying to run the command:
 ./mahout kmeans -i
 hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o
 hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
 -xm mapreduce
 Since i dont have any clusters yet to give it as an input i can remove it
 is what forums suggested. But now i get the error

 Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
 --clusters
 Missing required option
 --clusters

 Usage:
  [--input input --output output --distanceMeasure
 distanceMeasure
 --clusters clusters --numClusters k --randomSeed
 randomSeed1
 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter
 maxIter
 --overwrite --clustering --method method
 --outlierThreshold
 outlierThreshold --help --tempDir tempDir --startPhase
 startPhase
 --endPhase
 endPhase]
 --clusters (-c) clusters    The input centroids, as Vectors.  Must be
 a
                            SequenceFile of Writable, Cluster/Canopy.  If
 k is
                            also specified, then a random set of vectors
 will
                            be selected and written out to this path
 first
 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
 0.006167)
 Kindly help me out.
 Thanks





   



   

Re: mahout failing with -c as required option

2015-03-10 Thread Raghuveer
I see the error below:
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: 
/home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
15/03/10 11:50:20 INFO common.AbstractJob: Command line arguments: 
{--clustering=null, 
--clusters=[hdfs://master:54310/user/netlog/upload/mahoutoutput], 
--convergenceDelta=[0.5], 
--distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
 --endPhase=[2147483647], 
--input=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0],
 --maxIter=[5], --method=[mapreduce], --numClusters=[25], 
--output=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer],
 --overwrite=null, --startPhase=[0], --tempDir=[temp]}
15/03/10 11:50:21 INFO common.HadoopUtil: Deleting 
hdfs://master:54310/user/netlog/upload/mahoutoutput
15/03/10 11:50:21 INFO zlib.ZlibFactory: Successfully loaded  initialized 
native-zlib library
15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new compressor [.deflate]
15/03/10 11:50:21 INFO kmeans.RandomSeedGenerator: Wrote 25 Klusters to 
hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed
15/03/10 11:50:21 INFO kmeans.KMeansDriver: Input: 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 
Clusters In: 
hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed Out: 
hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
15/03/10 11:50:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 5
15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
Exception in thread main java.lang.IllegalStateException: No input clusters 
found in hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed. 
Check your -c argument.
    at 
org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213)

 

 On Tuesday, March 10, 2015 11:53 AM, Raghuveer 
alwaysra...@yahoo.com.INVALID wrote:
   

 I see the error below: 

    On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:
  

 Try

./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c some-folder 
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow 
-cl -k 25

I don't have a machine before me, so no way to try this out. 

But IIRC the way this works is :

a) u specify an initial seed of centroids via -c , u then don't need to specify 
k, since the # of centroids specified as seed would be the k

b) u let the algorithm choose random centroids by specifying -k, it needs -c to 
write the random centroids to hence -c is needed with -k.






On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote:

ok so if -c is required then how can i give it or atleast is there a way to 
remove -k itself?
./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl 
-k 25
and 

./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl
both give the same exception still. Kindly suggest.
 

    On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:
  

 Oops! I meant to say that -c is required for the random centroid 
initialization if -k is specified.
It initializes k random centroids in the folder specified by -c. so yes -c is 
required.

On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid 
wrote:

No i have removed the -c option now so i get the mentioned exception that -c is 
mandatory.


     On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:


 R u still specifying the -c option, its only needed if u have initial
centroids to launch the KMEans from otherwise KMeans picks random centroids.

Also CosineDistanceMeasure doesn't make sense with kMeans which is in
Euclidean space -try using SquaredEuclidean or Euclidean distances.

On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid
wrote:

 Hi All,
 I am trying to run the command:
 ./mahout kmeans -i
 hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o
 hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
 -xm mapreduce
 Since i dont have any clusters yet to give it as an input i can remove it
 is what forums suggested. But now i get the error

 

Re: mahout failing with -c as required option

2015-03-10 Thread Raghuveer
I see the error below: 

 On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:
   

 Try

./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c some-folder 
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow 
-cl -k 25

I don't have a machine before me, so no way to try this out. 

But IIRC the way this works is :

a) u specify an initial seed of centroids via -c , u then don't need to specify 
k, since the # of centroids specified as seed would be the k

b) u let the algorithm choose random centroids by specifying -k, it needs -c to 
write the random centroids to hence -c is needed with -k.






On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote:

ok so if -c is required then how can i give it or atleast is there a way to 
remove -k itself?
./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl 
-k 25
and 

./mahout kmeans -i 
http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm 
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl
both give the same exception still. Kindly suggest.
 

 On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:
   

 Oops! I meant to say that -c is required for the random centroid 
initialization if -k is specified.
It initializes k random centroids in the folder specified by -c. so yes -c is 
required.

On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid 
wrote:

No i have removed the -c option now so i get the mentioned exception that -c is 
mandatory.


     On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi 
suneel.mar...@gmail.com wrote:


 R u still specifying the -c option, its only needed if u have initial
centroids to launch the KMEans from otherwise KMeans picks random centroids.

Also CosineDistanceMeasure doesn't make sense with kMeans which is in
Euclidean space -try using SquaredEuclidean or Euclidean distances.

On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid
wrote:

 Hi All,
 I am trying to run the command:
 ./mahout kmeans -i
 hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0
 -o
 hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer
 -c  hdfs://master:54310/user/netlog/upload/mahoutoutput -dm
 org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
 -xm mapreduce
 Since i dont have any clusters yet to give it as an input i can remove it
 is what forums suggested. But now i get the error

 Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
 MAHOUT-JOB:
 /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar
 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option
 --clusters
 Missing required option
 --clusters

 Usage:
  [--input input --output output --distanceMeasure
 distanceMeasure
 --clusters clusters --numClusters k --randomSeed
 randomSeed1
 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter
 maxIter
 --overwrite --clustering --method method
 --outlierThreshold
 outlierThreshold --help --tempDir tempDir --startPhase
 startPhase
 --endPhase
 endPhase]
 --clusters (-c) clusters    The input centroids, as Vectors.  Must be
 a
                            SequenceFile of Writable, Cluster/Canopy.  If
 k is
                            also specified, then a random set of vectors
 will
                            be selected and written out to this path
 first
 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes:
 0.006167)
 Kindly help me out.
 Thanks





   







   

Re: implementation of context-aware recommender in Mahout

2015-03-10 Thread Efi Koulouri
Things got clearier with your help!

Thank you very much

On 9 March 2015 at 01:50, Ted Dunning ted.dunn...@gmail.com wrote:

 Efi,

 Only you can really tell which is best for your efforts.  All the rest is
 our own partially informed opinions.

 Pre-filtering can often be accomplished in the search context by creating
 more than one indicator field and using different combinations of
 indicators for different tasks.  For instance, you could create indicators
 for last one, two, three, five and seven days.  Then when you query the
 engine, you can pick which indicators to try.  That way the same search
 engine can embody multiple recommendation engines.

 I would also tend toward search-based approaches for your testing, if only
 because any deployed system is likely to use a search approach and thus
 testing that approach in your off-line testing gives you the most realistic
 results.


 On Sun, Mar 8, 2015 at 10:21 AM, Efi Koulouri ekoulou...@gmail.com
 wrote:

  Thanks for your help!
 
  Actually, I want to build a recommender for experimental purposes
 following
  the pre-filtering and post-filtering approaches that I described. I have
  already two datasets and I want to show the benefits of using a
  context-aware recommender. So,the recommender is going to work offline.
 
  I saw that the search engine approach is very interesting but in my case
 I
  think that building the recommender using the java classes is more
  appropriate as I need to use both approaches (post filtering,pre
  filtering). Am I right ?
 
  On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote:
 
   The by far easiest way to build a recommender (especially for
 production)
   is to use the search engine approach (what Pat was recommending).
  
   Post filtering can be done using the search engine far more easily than
   using Java classes.
  



Re: implementation of context-aware recommender in Mahout

2015-03-10 Thread Ted Dunning
Glad to help.

You can help us by reporting your results when you get them.

We look forward to that!


On Tue, Mar 10, 2015 at 4:22 AM, Efi Koulouri ekoulou...@gmail.com wrote:

 Things got clearier with your help!

 Thank you very much

 On 9 March 2015 at 01:50, Ted Dunning ted.dunn...@gmail.com wrote:

  Efi,
 
  Only you can really tell which is best for your efforts.  All the rest is
  our own partially informed opinions.
 
  Pre-filtering can often be accomplished in the search context by creating
  more than one indicator field and using different combinations of
  indicators for different tasks.  For instance, you could create
 indicators
  for last one, two, three, five and seven days.  Then when you query the
  engine, you can pick which indicators to try.  That way the same search
  engine can embody multiple recommendation engines.
 
  I would also tend toward search-based approaches for your testing, if
 only
  because any deployed system is likely to use a search approach and thus
  testing that approach in your off-line testing gives you the most
 realistic
  results.
 
 
  On Sun, Mar 8, 2015 at 10:21 AM, Efi Koulouri ekoulou...@gmail.com
  wrote:
 
   Thanks for your help!
  
   Actually, I want to build a recommender for experimental purposes
  following
   the pre-filtering and post-filtering approaches that I described. I
 have
   already two datasets and I want to show the benefits of using a
   context-aware recommender. So,the recommender is going to work
 offline.
  
   I saw that the search engine approach is very interesting but in my
 case
  I
   think that building the recommender using the java classes is more
   appropriate as I need to use both approaches (post filtering,pre
   filtering). Am I right ?
  
   On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote:
  
The by far easiest way to build a recommender (especially for
  production)
is to use the search engine approach (what Pat was recommending).
   
Post filtering can be done using the search engine far more easily
 than
using Java classes.
   
 



Re: mahout spark-itemsimilarity from command line

2015-03-10 Thread Jeff Isenhart
OK, so the solution to the issue was to add the following to my core-site.xml
!-- Added to try and solve mahout issue claiming 'No FileSystem for schema: 
hdfs' --property    namefs.file.impl/name    
valueorg.apache.hadoop.fs.LocalFileSystem/value    descriptionThe 
FileSystem for file: uris./description /property
 property    namefs.hdfs.impl/name    
valueorg.apache.hadoop.hdfs.DistributedFileSystem/value    descriptionThe 
FileSystem for hdfs: uris./description /property 

 On Monday, March 9, 2015 11:38 AM, Pat Ferrel p...@occamsmachete.com 
wrote:
   

 Mahout is on Spark 1.1.0 (before last week) and 1.1.1 as of current master. 
Running locally should use these but make sure these are installed if you run 
with anything other than —master local

The next thing to try is see which versions of Hadoop both Mahout and Spark are 
compiled for, they must be the one you have installed. Check build instructions 
for Spark https://spark.apache.org/docs/latest/building-spark.html this is for 
1.2.1 but make sure you have source for 1.1.0 or 1.1.1
and Mahout http://mahout.apache.org/developers/buildingmahout.html

On Mar 9, 2015, at 11:20 AM, Jeff Isenhart jeffi...@yahoo.com.INVALID wrote:

Here is what I get with hadoop fs -ls
-rw-r--r--  1 username supergroup    5510526 2015-03-09 11:10 transactions.csv
Yes, I am trying to run a local version of Spark (trying to run everything 
local at the moment)
and when I run 
./bin/mahout spark-itemsimilarity -i transactions.csv -o output -fc 1 -ic 2
15/03/09 11:18:30 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkDriver@10.0.1.20:50565/user/HeartbeatReceiverException in 
thread main java.io.IOException: No FileSystem for scheme: hdfs at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421) at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at 
org.apache.mahout.common.HDFSPathSearch.init(HDFSPathSearch.scala:36) at 
org.apache.mahout.drivers.ItemSimilarityDriver$.readIndexedDatasets(ItemSimilarityDriver.scala:152)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:213)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
 at scala.Option.map(Option.scala:145) at 
org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala) 

    On Monday, March 9, 2015 10:51 AM, Pat Ferrel p...@occamsmachete.com 
wrote:


From the command line can you run:

    hadoop fs -ls

And see SomeDir/transactions.csv? It looks like HDFS is not accessible from 
wherever you are running spark-itemsimilarity.

Are you trying to run a local version of Spark because the default is --master 
local” This can still access a clustered HDFS if you are configured to access 
it from your machine.


On Mar 9, 2015, at 10:35 AM, Jeff Isenhart jeffi...@yahoo.com.INVALID wrote:

bump...anybody??? 

    On Wednesday, March 4, 2015 9:22 PM, Jeff Isenhart 
jeffi...@yahoo.com.INVALID wrote:


I am having issue getting a simple itemsimilarity example to work. I know 
hadoop is up and functional (ran the example mapreduce program anyway)
But when I run either of these
./mahout spark-itemsimilarity -i SomeDir/transactions.csv -o 
hdfs://localhost:9000/users/someuser/output -fc 1 -ic 2
./mahout spark-itemsimilarity -i SomeDir/transactions.csv -o SomeDir/output 
-fc 1 -ic 2
and get
Exception in thread main java.io.IOException: No FileSystem for scheme: hdfs 
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421) at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at 
org.apache.mahout.common.HDFSPathSearch.init(HDFSPathSearch.scala:36) at 
org.apache.mahout.drivers.ItemSimilarityDriver$.readIndexedDatasets(ItemSimilarityDriver.scala:152)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:213)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116)
 at 
org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114)
 at 

spark-item-similarity incremental update

2015-03-10 Thread Kevin Zhang
Hi,

Does anybody have any idea about how to do incremental update for the item 
similarity? I mean how I can apply latest user action data for example today's 
data? Do I have to run it again for the entire dataset?

Thanks,
Kevin

Re: spark-item-similarity incremental update

2015-03-10 Thread Pat Ferrel
The latest user actions work just fine as the query against the last time you 
ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and 
run through the “trainer” those things you pick are instantly used to make 
recs. spark-itemsimilarity was not re-run. The only time you really have to 
re-run it is:
1) you have new items with interactions. You can only recommend what you 
trained with.
2) you have enough new user data to significantly change the model.

There is no incremental way to update the model (yet) but it can be rerun in a 
few minutes and as I said you get recs with realtime user history, even for new 
users not in the training data.

On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID 
wrote:

Hi,

Does anybody have any idea about how to do incremental update for the item 
similarity? I mean how I can apply latest user action data for example today's 
data? Do I have to run it again for the entire dataset?

Thanks,
Kevin



Re: spark-item-similarity incremental update

2015-03-10 Thread Pat Ferrel
Just to be clear #1 was about new items, not users. New users will work as long 
as you have history for them.

On Mar 10, 2015, at 3:34 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID 
wrote:

I see. Thank you, Pat. 




On Tuesday, March 10, 2015 3:17 PM, Pat Ferrel p...@occamsmachete.com wrote:



The latest user actions work just fine as the query against the last time you 
ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and 
run through the “trainer” those things you pick are instantly used to make 
recs. spark-itemsimilarity was not re-run. The only time you really have to 
re-run it is:
1) you have new items with interactions. You can only recommend what you 
trained with.
2) you have enough new user data to significantly change the model.

There is no incremental way to update the model (yet) but it can be rerun in a 
few minutes and as I said you get recs with realtime user history, even for new 
users not in the training data.


On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID 
wrote:

Hi,

Does anybody have any idea about how to do incremental update for the item 
similarity? I mean how I can apply latest user action data for example today's 
data? Do I have to run it again for the entire dataset?

Thanks,
Kevin



Re: spark-item-similarity incremental update

2015-03-10 Thread Kevin Zhang
I see. Thank you, Pat. 




On Tuesday, March 10, 2015 3:17 PM, Pat Ferrel p...@occamsmachete.com wrote:
 


The latest user actions work just fine as the query against the last time you 
ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and 
run through the “trainer” those things you pick are instantly used to make 
recs. spark-itemsimilarity was not re-run. The only time you really have to 
re-run it is:
1) you have new items with interactions. You can only recommend what you 
trained with.
2) you have enough new user data to significantly change the model.

There is no incremental way to update the model (yet) but it can be rerun in a 
few minutes and as I said you get recs with realtime user history, even for new 
users not in the training data.


On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID 
wrote:

Hi,

Does anybody have any idea about how to do incremental update for the item 
similarity? I mean how I can apply latest user action data for example today's 
data? Do I have to run it again for the entire dataset?

Thanks,
Kevin