Re: mahout 1.0 on EMR with spark item-similarity

Pat Ferrel Tue, 07 Apr 2015 16:17:38 -0700

We are working on a release, which will be 0.10.0 so give it a try if you can. 
It fixes one problem that you may encounter with an out of range index in a 
vector. You may not see it.


1) The search engine must be able to take one query with multiple fields and 
apply each field in the query to separate fields in the index. Solr and ES 
work, not sure about Amazon.
2) Config and experiment seem good.
3) It is good practice to save you interactions in something like a db so they 
can be replayed to create indicators if needed and to maintain a time range of 
data. I use a key-value store like the search engine itself or NoSQL DB. The 
value is that you can look at the collection as an item catalog and so put 
metadata in columns or doc fields. This metadata can be used to match context 
later so if you are on a “men’s clothes” page you may want “people who bought 
this also bought these” but biased or filtered by the “men’s clothes” category.
4) Tags can be used for CF or for content-based recs and CF would generally be 
better. In the case you ask about the query is items favored since 
spark-rowsimilarity will produce similar items (similar in their tags, not 
users who preferred). So the query is items. Extend this and text tokens 
(bag-of-words) can be used to calculate content-based indicators and recs that 
are personalized, not just "more items like this”. But again CF data would be 
preferable if available.

As with your cluster investment I’d start small with clear usage based 
indicators and build up as needed based on your application data.

Let us know how it goes


On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <[email protected]> wrote:

Thanks, Pat.
We are only running EMR cluster with 1 master and 1 core node right now and 
were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default 
configuration for spark (using aws script for spark) which I believe sets 
number of instances to 2.  Spark version 1.1.0h  
(https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md)
 
We are not in production yet as we are experimenting right now.   

I have a question about the choice of the search engine to do recommendations.
I know the Practical Machine Learning book and mahout docs talk about Solr.  Do 
you see any issues with using Elastic Search or AWS Cloud Search?  
Also, looking at the content based indicator example on 
intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is used 
to produce itemid to items matrix, but then it says to use tags associated with 
purchases in the query for tags like this:
Query:
 field: purchase; q:user's-purchase-history
 field: view; q:user's view-history
 field: tags; q:user's-tags-associated-with-purchases

So, we are not providing the actual tags in the tags field query, are we?

Thanks


-----Original Message-----
From: Pat Ferrel [mailto:[email protected]] 
Sent: Monday, April 06, 2015 2:33 PM
To: [email protected]
Subject: Re: mahout 1.0 on EMR with spark item-similarity

OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this 
works in other cases.

BTW we are nearing a new release. It fixes one cooccurrence problem that you 
may run into if you are using lots of small files to contain the initial 
interaction input. This happens often when using Spark Streaming for input.

If you want to try the source on github make sure to compile with -DskipTests 
since there is a failing test unrelated to the Spark code. Be aware that jar 
names have changed if that matters.

Can you report the cluster version of Spark and Hadoop as well as how many 
nodes?

Thanks


On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <[email protected]> wrote:

Pat, I was not using spark-submit script.  I am using mahout 
spark-itemsimilarity exactly how it is specified in 
http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

So, what I did is I created a bootstrap action that installs spark and mahout 
on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can 
call a script (amazon provides scriptRunner that can run any script).  So, I 
basically create a command (mahout spark-itemsimilarity <parameters>) and pass 
it to script runner that runs it. One of the parameters is -ma , so I pass in 
yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz 
job) with whatever parameters that job needs.
I used instructions in here: 
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install 
spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package 
to s3. I also created a bash script to copy that package from s3 to EMR, 
unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used 
another boostrap action to invoke that script  and install mahout.  I had to 
also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
(this is where I installed spark on EMR). Modified 
CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to avoid 
including classpath passed in by amazon script-runner since it contains path to 
the 2.11 version of scala (installed on EMR by Amazon) that conflicts with 
spark/mahout 2.10.x version.


-----Original Message-----
From: Pat Ferrel [mailto:[email protected]] 
Sent: Thursday, March 26, 2015 3:49 PM
To: [email protected]
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with 
the spark-submit script? That shouldn’t work, the job is a standalone app and 
does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <[email protected]> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of 
Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs 
and column/itemIDS) per physical machine that all local tasks consult. They are 
Spark broadcast values. These will grow linearly as the number of items and 
users grow and as the size of your IDs, treated as strings, grow. The hashmaps 
have some overhead but in large collections the main cost is the size of the 
application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <[email protected]> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap 
actions and was able to run spark-itemsimilarity job via an EMR step with some 
modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH 
is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to 
submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits 
jobs to executors via yarn manager, so my question is how much memory does this 
driver need?
Will the memory requirement vary based on the size of the input to 
spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:[email protected]] 
Sent: Thursday, January 15, 2015 12:46 PM
To: [email protected]
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR 
cluster running spark.   I am now able to deploy EMR cluster with Spark using 
AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to 
install mahout.
What I am trying to figure out is whether I would need to build mahout every 
time I start EMR cluster or have pre-built artifacts and develop a script 
similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for 
the use of the recipient and may be confidential and privileged. Unauthorized 
use, disclosure, or reproduction is strictly prohibited and may be unlawful. If 
you have received this electronic transmission in error, please notify the 
sender immediately.

Re: mahout 1.0 on EMR with spark item-similarity

Reply via email to