A Dropbox link now: https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
And here is the script I use to test different sizes/partitions (example: 10 parts of 10k): #!/bin/sh set -e -u mkdir -p ratings-split rm -rf ratings-split/part* hdfs dfs -rm -r ratings-split spark-itemsimilarity temp cat ~/input_ratings/part* 2>/dev/null | head -n100k | split -l10k -d - "ratings-split/part-" hdfs dfs -mkdir -p ratings-split hdfs dfs -copyFromLocal ratings-split time mahout spark-itemsimilarity --input ratings-split/ --output spark-itemsimilarity \ --maxSimilaritiesPerItem 10 --master yarn-client |& tee spark-itemsimilarity.out Thanks! On Thu, 29 Sep 2016 19:46:03 +0200 Arnau Sanchez <pyar...@gmail.com> wrote: > Hi Sebastian, > > That's weird, it works here. Anyway, a Dropbox link: > > https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 > > Thanks! > > On Thu, 29 Sep 2016 18:50:23 +0200 Sebastian <s...@apache.org> wrote: > > > Hi Arnau, > > > > The links to your logfiles don't work for me unfortunately. Are you sure > > you correctly setup Spark? That can be a bit tricky in YARN settings, > > sometimes one machine idles around... > > > > Best, > > Sebastian > > > > On 25.09.2016 18:01, Pat Ferrel wrote: > > > AWS EMR is usually not very well suited for Spark. Spark get’s most of > > > it’s speed from in-memory calculations. So to see speed gains you have to > > > have enough memory. Also partitioning will help in many cases. If you > > > read in data from a single file—that partitioning will usually follow the > > > calculation throughout all intermediate steps. If the data is from a > > > single file the partition may be 1 and therefor it will only use one > > > machine. The most recent Mahout snapshot (therefore the next release) > > > allows you to pass in the partitioning for each event pair (this is only > > > in the library use, not CLI). To get this effect in the current release, > > > try splitting the input into multiple files. > > > > > > I’m. probably the one that reported the 10x speed up and used input from > > > Kafka DStreams, which causes very small default partition sizes. Also > > > other comparisons for other calculations give a similar speedup result. > > > There is little question about Spark being much faster—when used the way > > > it is meant to be. > > > > > > I use Mahout as a library all the time in the Universal Recommender > > > implemented in Apache PredictionIO. As a library we get greater control > > > than the CLI. The CLI is really only a proof of concept, not really meant > > > for production. > > > > > > BTW there is a significant algorithm benefit of the code behind > > > spark-itemsimilarity that is probably more important than the speed > > > increase and that is Correlated Cross-Occurrence, which allows the use of > > > many indicators of user taste, not just the primary/conversion event, > > > which is all any other CF-style recommender that I know of can use. > > > > > > > > > On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <pyar...@gmail.com> wrote: > > > > > > I've been using the Mahout itemsimilarity job for a while, with good > > > results. I read that the new spark-itemsimilarity job is typically > > > faster, by a factor of 10, so I wanted to give it a try. I must be doing > > > something wrong because, with the same EMR infrastructure, the spark job > > > is slower than the old one (6 min vs 16 min) working on the same data. I > > > took a small sample dataset (766k rating pairs) to compare numbers, this > > > is the result: > > > > > > Input ratings: http://download.zaudera.com/public/ratings > > > > > > Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) > > > > > > Old itemsimilarity: > > > > > > $ mahout itemsimilarity --input ratings --output itemsimilarity > > > --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname > > > SIMILARITY_COOCCURRENCE > > > [5m54s] > > > > > > (logs: http://download.zaudera.com/public/itemsimilarity.out) > > > > > > New spark-itemsimilarity: > > > > > > $ mahout spark-itemsimilarity --input ratings --output > > > spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client > > > [15m51s] > > > > > > (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) > > > > > > Any ideas? Thanks! > > >