Re: mahout failing with -c as required option
Try ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c some-folder -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 I don't have a machine before me, so no way to try this out. But IIRC the way this works is : a) u specify an initial seed of centroids via -c , u then don't need to specify k, since the # of centroids specified as seed would be the k b) u let the algorithm choose random centroids by specifying -k, it needs -c to write the random centroids to hence -c is needed with -k. On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote: ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Hi All, I am trying to run the command: ./mahout kmeans -i hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm mapreduce Since i dont have any clusters yet to give it as an input i can remove it is what forums suggested. But now i get the error Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option --clusters Missing required option --clusters Usage: [--input input --output output --distanceMeasure distanceMeasure --clusters clusters --numClusters k --randomSeed randomSeed1 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter maxIter --overwrite --clustering --method method --outlierThreshold outlierThreshold --help --tempDir tempDir --startPhase startPhase --endPhase endPhase] --clusters (-c) clustersThe input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: 0.006167) Kindly help me out. Thanks
Re: mahout failing with -c as required option
Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Hi All, I am trying to run the command: ./mahout kmeans -i hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm mapreduce Since i dont have any clusters yet to give it as an input i can remove it is what forums suggested. But now i get the error Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option --clusters Missing required option --clusters Usage: [--input input --output output --distanceMeasure distanceMeasure --clusters clusters --numClusters k --randomSeed randomSeed1 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter maxIter --overwrite --clustering --method method --outlierThreshold outlierThreshold --help --tempDir tempDir --startPhase startPhase --endPhase endPhase] --clusters (-c) clustersThe input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: 0.006167) Kindly help me out. Thanks
Re: mahout failing with -c as required option
ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Hi All, I am trying to run the command: ./mahout kmeans -i hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm mapreduce Since i dont have any clusters yet to give it as an input i can remove it is what forums suggested. But now i get the error Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option --clusters Missing required option --clusters Usage: [--input input --output output --distanceMeasure distanceMeasure --clusters clusters --numClusters k --randomSeed randomSeed1 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter maxIter --overwrite --clustering --method method --outlierThreshold outlierThreshold --help --tempDir tempDir --startPhase startPhase --endPhase endPhase] --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: 0.006167) Kindly help me out. Thanks
Re: mahout failing with -c as required option
I see the error below: Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 11:50:20 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[hdfs://master:54310/user/netlog/upload/mahoutoutput], --convergenceDelta=[0.5], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0], --maxIter=[5], --method=[mapreduce], --numClusters=[25], --output=[hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 15/03/10 11:50:21 INFO common.HadoopUtil: Deleting hdfs://master:54310/user/netlog/upload/mahoutoutput 15/03/10 11:50:21 INFO zlib.ZlibFactory: Successfully loaded initialized native-zlib library 15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new compressor [.deflate] 15/03/10 11:50:21 INFO kmeans.RandomSeedGenerator: Wrote 25 Klusters to hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed 15/03/10 11:50:21 INFO kmeans.KMeansDriver: Input: hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 Clusters In: hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed Out: hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer 15/03/10 11:50:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 5 15/03/10 11:50:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate] Exception in thread main java.lang.IllegalStateException: No input clusters found in hdfs://master:54310/user/netlog/upload/mahoutoutput/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:213) On Tuesday, March 10, 2015 11:53 AM, Raghuveer alwaysra...@yahoo.com.INVALID wrote: I see the error below: On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Try ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c some-folder -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 I don't have a machine before me, so no way to try this out. But IIRC the way this works is : a) u specify an initial seed of centroids via -c , u then don't need to specify k, since the # of centroids specified as seed would be the k b) u let the algorithm choose random centroids by specifying -k, it needs -c to write the random centroids to hence -c is needed with -k. On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote: ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Hi All, I am trying to run the command: ./mahout kmeans -i hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm mapreduce Since i dont have any clusters yet to give it as an input i can remove it is what forums suggested. But now i get the error
Re: mahout failing with -c as required option
I see the error below: On Tuesday, March 10, 2015 11:45 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Try ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -c some-folder -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 I don't have a machine before me, so no way to try this out. But IIRC the way this works is : a) u specify an initial seed of centroids via -c , u then don't need to specify k, since the # of centroids specified as seed would be the k b) u let the algorithm choose random centroids by specifying -k, it needs -c to write the random centroids to hence -c is needed with -k. On Tue, Mar 10, 2015 at 2:09 AM, Raghuveer alwaysra...@yahoo.com wrote: ok so if -c is required then how can i give it or atleast is there a way to remove -k itself? ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl -k 25 and ./mahout kmeans -i http://master:50070/explorer.html#/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o /usr/netlog/upload/output4/tfidf-vectors-kmeans-clusters -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -x 5 -ow -cl both give the same exception still. Kindly suggest. On Tuesday, March 10, 2015 11:35 AM, Suneel Marthi suneel.mar...@gmail.com wrote: Oops! I meant to say that -c is required for the random centroid initialization if -k is specified. It initializes k random centroids in the folder specified by -c. so yes -c is required. On Tue, Mar 10, 2015 at 1:42 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: No i have removed the -c option now so i get the mentioned exception that -c is mandatory. On Tuesday, March 10, 2015 11:06 AM, Suneel Marthi suneel.mar...@gmail.com wrote: R u still specifying the -c option, its only needed if u have initial centroids to launch the KMEans from otherwise KMeans picks random centroids. Also CosineDistanceMeasure doesn't make sense with kMeans which is in Euclidean space -try using SquaredEuclidean or Euclidean distances. On Tue, Mar 10, 2015 at 1:27 AM, Raghuveer alwaysra...@yahoo.com.invalid wrote: Hi All, I am trying to run the command: ./mahout kmeans -i hdfs://master:54310/user/netlog/upload/output4/tfidf-vectors/part-r-0 -o hdfs://master:54310//user/netlog/upload/output4/tfidf-vectors-kmeans-clusters-raghuveer -c hdfs://master:54310/user/netlog/upload/mahoutoutput -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25 -xm mapreduce Since i dont have any clusters yet to give it as an input i can remove it is what forums suggested. But now i get the error Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /home/raghuveer/trunk/examples/target/mahout-examples-1.0-SNAPSHOT-job.jar 15/03/10 10:52:53 ERROR common.AbstractJob: Missing required option --clusters Missing required option --clusters Usage: [--input input --output output --distanceMeasure distanceMeasure --clusters clusters --numClusters k --randomSeed randomSeed1 [randomSeed2 ...] --convergenceDelta convergenceDelta --maxIter maxIter --overwrite --clustering --method method --outlierThreshold outlierThreshold --help --tempDir tempDir --startPhase startPhase --endPhase endPhase] --clusters (-c) clusters The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first 15/03/10 10:52:53 INFO driver.MahoutDriver: Program took 370 ms (Minutes: 0.006167) Kindly help me out. Thanks
Re: implementation of context-aware recommender in Mahout
Things got clearier with your help! Thank you very much On 9 March 2015 at 01:50, Ted Dunning ted.dunn...@gmail.com wrote: Efi, Only you can really tell which is best for your efforts. All the rest is our own partially informed opinions. Pre-filtering can often be accomplished in the search context by creating more than one indicator field and using different combinations of indicators for different tasks. For instance, you could create indicators for last one, two, three, five and seven days. Then when you query the engine, you can pick which indicators to try. That way the same search engine can embody multiple recommendation engines. I would also tend toward search-based approaches for your testing, if only because any deployed system is likely to use a search approach and thus testing that approach in your off-line testing gives you the most realistic results. On Sun, Mar 8, 2015 at 10:21 AM, Efi Koulouri ekoulou...@gmail.com wrote: Thanks for your help! Actually, I want to build a recommender for experimental purposes following the pre-filtering and post-filtering approaches that I described. I have already two datasets and I want to show the benefits of using a context-aware recommender. So,the recommender is going to work offline. I saw that the search engine approach is very interesting but in my case I think that building the recommender using the java classes is more appropriate as I need to use both approaches (post filtering,pre filtering). Am I right ? On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote: The by far easiest way to build a recommender (especially for production) is to use the search engine approach (what Pat was recommending). Post filtering can be done using the search engine far more easily than using Java classes.
Re: implementation of context-aware recommender in Mahout
Glad to help. You can help us by reporting your results when you get them. We look forward to that! On Tue, Mar 10, 2015 at 4:22 AM, Efi Koulouri ekoulou...@gmail.com wrote: Things got clearier with your help! Thank you very much On 9 March 2015 at 01:50, Ted Dunning ted.dunn...@gmail.com wrote: Efi, Only you can really tell which is best for your efforts. All the rest is our own partially informed opinions. Pre-filtering can often be accomplished in the search context by creating more than one indicator field and using different combinations of indicators for different tasks. For instance, you could create indicators for last one, two, three, five and seven days. Then when you query the engine, you can pick which indicators to try. That way the same search engine can embody multiple recommendation engines. I would also tend toward search-based approaches for your testing, if only because any deployed system is likely to use a search approach and thus testing that approach in your off-line testing gives you the most realistic results. On Sun, Mar 8, 2015 at 10:21 AM, Efi Koulouri ekoulou...@gmail.com wrote: Thanks for your help! Actually, I want to build a recommender for experimental purposes following the pre-filtering and post-filtering approaches that I described. I have already two datasets and I want to show the benefits of using a context-aware recommender. So,the recommender is going to work offline. I saw that the search engine approach is very interesting but in my case I think that building the recommender using the java classes is more appropriate as I need to use both approaches (post filtering,pre filtering). Am I right ? On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote: The by far easiest way to build a recommender (especially for production) is to use the search engine approach (what Pat was recommending). Post filtering can be done using the search engine far more easily than using Java classes.
Re: mahout spark-itemsimilarity from command line
OK, so the solution to the issue was to add the following to my core-site.xml !-- Added to try and solve mahout issue claiming 'No FileSystem for schema: hdfs' --property namefs.file.impl/name valueorg.apache.hadoop.fs.LocalFileSystem/value descriptionThe FileSystem for file: uris./description /property property namefs.hdfs.impl/name valueorg.apache.hadoop.hdfs.DistributedFileSystem/value descriptionThe FileSystem for hdfs: uris./description /property On Monday, March 9, 2015 11:38 AM, Pat Ferrel p...@occamsmachete.com wrote: Mahout is on Spark 1.1.0 (before last week) and 1.1.1 as of current master. Running locally should use these but make sure these are installed if you run with anything other than —master local The next thing to try is see which versions of Hadoop both Mahout and Spark are compiled for, they must be the one you have installed. Check build instructions for Spark https://spark.apache.org/docs/latest/building-spark.html this is for 1.2.1 but make sure you have source for 1.1.0 or 1.1.1 and Mahout http://mahout.apache.org/developers/buildingmahout.html On Mar 9, 2015, at 11:20 AM, Jeff Isenhart jeffi...@yahoo.com.INVALID wrote: Here is what I get with hadoop fs -ls -rw-r--r-- 1 username supergroup 5510526 2015-03-09 11:10 transactions.csv Yes, I am trying to run a local version of Spark (trying to run everything local at the moment) and when I run ./bin/mahout spark-itemsimilarity -i transactions.csv -o output -fc 1 -ic 2 15/03/09 11:18:30 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@10.0.1.20:50565/user/HeartbeatReceiverException in thread main java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.mahout.common.HDFSPathSearch.init(HDFSPathSearch.scala:36) at org.apache.mahout.drivers.ItemSimilarityDriver$.readIndexedDatasets(ItemSimilarityDriver.scala:152) at org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:213) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114) at scala.Option.map(Option.scala:145) at org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:114) at org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala) On Monday, March 9, 2015 10:51 AM, Pat Ferrel p...@occamsmachete.com wrote: From the command line can you run: hadoop fs -ls And see SomeDir/transactions.csv? It looks like HDFS is not accessible from wherever you are running spark-itemsimilarity. Are you trying to run a local version of Spark because the default is --master local” This can still access a clustered HDFS if you are configured to access it from your machine. On Mar 9, 2015, at 10:35 AM, Jeff Isenhart jeffi...@yahoo.com.INVALID wrote: bump...anybody??? On Wednesday, March 4, 2015 9:22 PM, Jeff Isenhart jeffi...@yahoo.com.INVALID wrote: I am having issue getting a simple itemsimilarity example to work. I know hadoop is up and functional (ran the example mapreduce program anyway) But when I run either of these ./mahout spark-itemsimilarity -i SomeDir/transactions.csv -o hdfs://localhost:9000/users/someuser/output -fc 1 -ic 2 ./mahout spark-itemsimilarity -i SomeDir/transactions.csv -o SomeDir/output -fc 1 -ic 2 and get Exception in thread main java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.mahout.common.HDFSPathSearch.init(HDFSPathSearch.scala:36) at org.apache.mahout.drivers.ItemSimilarityDriver$.readIndexedDatasets(ItemSimilarityDriver.scala:152) at org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:213) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:116) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:114) at
spark-item-similarity incremental update
Hi, Does anybody have any idea about how to do incremental update for the item similarity? I mean how I can apply latest user action data for example today's data? Do I have to run it again for the entire dataset? Thanks, Kevin
Re: spark-item-similarity incremental update
The latest user actions work just fine as the query against the last time you ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and run through the “trainer” those things you pick are instantly used to make recs. spark-itemsimilarity was not re-run. The only time you really have to re-run it is: 1) you have new items with interactions. You can only recommend what you trained with. 2) you have enough new user data to significantly change the model. There is no incremental way to update the model (yet) but it can be rerun in a few minutes and as I said you get recs with realtime user history, even for new users not in the training data. On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID wrote: Hi, Does anybody have any idea about how to do incremental update for the item similarity? I mean how I can apply latest user action data for example today's data? Do I have to run it again for the entire dataset? Thanks, Kevin
Re: spark-item-similarity incremental update
Just to be clear #1 was about new items, not users. New users will work as long as you have history for them. On Mar 10, 2015, at 3:34 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID wrote: I see. Thank you, Pat. On Tuesday, March 10, 2015 3:17 PM, Pat Ferrel p...@occamsmachete.com wrote: The latest user actions work just fine as the query against the last time you ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and run through the “trainer” those things you pick are instantly used to make recs. spark-itemsimilarity was not re-run. The only time you really have to re-run it is: 1) you have new items with interactions. You can only recommend what you trained with. 2) you have enough new user data to significantly change the model. There is no incremental way to update the model (yet) but it can be rerun in a few minutes and as I said you get recs with realtime user history, even for new users not in the training data. On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID wrote: Hi, Does anybody have any idea about how to do incremental update for the item similarity? I mean how I can apply latest user action data for example today's data? Do I have to run it again for the entire dataset? Thanks, Kevin
Re: spark-item-similarity incremental update
I see. Thank you, Pat. On Tuesday, March 10, 2015 3:17 PM, Pat Ferrel p...@occamsmachete.com wrote: The latest user actions work just fine as the query against the last time you ran spark-itemsimilairty. Go to the Demo site https://guide.finderbots.com and run through the “trainer” those things you pick are instantly used to make recs. spark-itemsimilarity was not re-run. The only time you really have to re-run it is: 1) you have new items with interactions. You can only recommend what you trained with. 2) you have enough new user data to significantly change the model. There is no incremental way to update the model (yet) but it can be rerun in a few minutes and as I said you get recs with realtime user history, even for new users not in the training data. On Mar 10, 2015, at 3:07 PM, Kevin Zhang zhangyongji...@yahoo.com.INVALID wrote: Hi, Does anybody have any idea about how to do incremental update for the item similarity? I mean how I can apply latest user action data for example today's data? Do I have to run it again for the entire dataset? Thanks, Kevin