Re: mahout 1.0 on EMR with spark item-similarity

Pat Ferrel Tue, 21 Apr 2015 14:16:24 -0700

Gave you the wrong schema entries for the advice about queries. Check with Solr 
documentation, which always trumps my guesses.


To use token phrases do the following:

   <fieldType name=“indicator" class="solr.TextField" omitNorms=“false”>
      <!— This simple tokenizer will split the text by spaces (and other 
punctuation) to separate item-id tokens —>
      <tokenizer class="solr.StandardTokenizerFactory”/>
   < /fieldType>
   <field name=“purchase" stored=“true" type="indicator" multiValued=“false" 
indexed="true”/>

> On Apr 16, 2015, at 9:15 AM, Pat Ferrel <[email protected]> wrote:
> 
> OK, this is cool. Almost there!
> 
> In order to answer the question you have to decide how you will persist the 
> indicators. The output of spark-itemsimilarity can be directly indexed but 
> requires phrase searches on text fields. It requires the item-id strings to 
> be tokenized so they aren’t broken up in the analyzer used for Solr phrase 
> queries. The better way to do it is store the indicators as multi-valued 
> fields on Solr or a DB. Many ways to slice this.
> 
> If you want to index the output of Mahout directly we will assume the item 
> IDs are tokenized and so contain no spaces, ., comma, or other punctuation 
> that will break a phrase, so we can encode the user history as a single 
> string of space separated item tokens.
> 
> To do a query with something like “ipad iphone” we’ll need to setup Solr like 
> this, which is a bit of a guess since I use a DB—not the raw output files:
> 
>    <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!— 
> NOTICE NO TOKENIZER OR ANALYZER USED—>
>    <field name=“purchase" stored=“true" type="indicator" multiValued="true" 
> indexed="true”/>

My bad, this is for multi-valued fields, see above for space delimited token 
fields. I believe the above should use class=“solr.stringField” also.

> 
> “OR” is the default query operator so unless you’ve messed with that it 
> should be fine. You need that because if you have multiple fields in the 
> future you want them ORed as well as the terms The query would be something 
> like:
> 
> q=purchase: (“iphone ipad")
> 
> So you are applying the items the user has purchased only to the purchase 
> indicator field. As you add cross-cooccurrence actions the fieldType will 
> stay the same and you will add a field for “views".
> 
> q=purchase: (“iphone ipad”) view: (“history of items viewed")
> And so on. 
> 
> You’ll also need to index them in Solr as csv files with no header. The 
> output is tab delimited by default so more correctly a tsv.
> 
> This can be setup to use multi-valued fields but you’d have to store the 
> output of spark-itemsimilarity in Solr or a db. I’d actually recommend this 
> for several reasons including that it is faster than HDFS but it requires you 
> write storage code and customize the Solr config differently.
> 
> other answers below:
> 
> 
>> On Apr 16, 2015, at 7:35 AM, Pasmanik, Paul <[email protected]> 
>> wrote:
>> 
>> Thanks, Pat.
>> I have a question regarding the search on the multi-valued field.
>> So, once I have indexed the results of the spark-itemsimilarity for purchase 
>> field as multi-valued field in Solar, what kind of search do I perform using 
>> user's purchase history?   Is it a phrase search (purchase items separated 
>> by space, something like purchase: "iphone ipad" with or without high slope 
>> value) or is it an OR query using each purchased item (purchase: ("iphone" 
>> OR "ipad"))  or something totally different? 
>> 
>> My understanding is that if I have a document with purchase field that has 
>> values: 1,2,3,4,5 and another document that has values 3,4,5 and my purchase 
>> history has 1,2,4   then the first document should rank higher.
> 
> Yes. The longer answer is that Solr with omitNorms=“false” will TF-IDF weight 
> terms (in this case individual indicators). So very frequently preferred 
> items will be down-weighted on the theory that because you and I like 
> "motherhood and apple pie", it doesn’t say much about our similarity of 
> taste--everyone likes those. So actual results will depend on frequency of 
> item preferences in the corpus. This down-weighting is a good thing since 
> otherwise a the popular things would always be recommended.
> 
>> Thanks.
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]] 
>> Sent: Tuesday, April 07, 2015 7:15 PM
>> To: [email protected]; Pasmanik, Paul
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> We are working on a release, which will be 0.10.0 so give it a try if you 
>> can. It fixes one problem that you may encounter with an out of range index 
>> in a vector. You may not see it.
>> 
>> 1) The search engine must be able to take one query with multiple fields and 
>> apply each field in the query to separate fields in the index. Solr and ES 
>> work, not sure about Amazon.
>> 2) Config and experiment seem good.
>> 3) It is good practice to save you interactions in something like a db so 
>> they can be replayed to create indicators if needed and to maintain a time 
>> range of data. I use a key-value store like the search engine itself or 
>> NoSQL DB. The value is that you can look at the collection as an item 
>> catalog and so put metadata in columns or doc fields. This metadata can be 
>> used to match context later so if you are on a “men’s clothes” page you may 
>> want “people who bought this also bought these” but biased or filtered by 
>> the “men’s clothes” category.
>> 4) Tags can be used for CF or for content-based recs and CF would generally 
>> be better. In the case you ask about the query is items favored since 
>> spark-rowsimilarity will produce similar items (similar in their tags, not 
>> users who preferred). So the query is items. Extend this and text tokens 
>> (bag-of-words) can be used to calculate content-based indicators and recs 
>> that are personalized, not just "more items like this”. But again CF data 
>> would be preferable if available.
>> 
>> As with your cluster investment I’d start small with clear usage based 
>> indicators and build up as needed based on your application data.
>> 
>> Let us know how it goes
>> 
>> 
>> On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <[email protected]> 
>> wrote:
>> 
>> Thanks, Pat.
>> We are only running EMR cluster with 1 master and 1 core node right now and 
>> were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default 
>> configuration for spark (using aws script for spark) which I believe sets 
>> number of instances to 2.  Spark version 1.1.0h  
>> (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md)
>>  
>> We are not in production yet as we are experimenting right now.   
>> 
>> I have a question about the choice of the search engine to do 
>> recommendations.
>> I know the Practical Machine Learning book and mahout docs talk about Solr.  
>> Do you see any issues with using Elastic Search or AWS Cloud Search?  
>> Also, looking at the content based indicator example on 
>> intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is 
>> used to produce itemid to items matrix, but then it says to use tags 
>> associated with purchases in the query for tags like this:
>> Query:
>> field: purchase; q:user's-purchase-history
>> field: view; q:user's view-history
>> field: tags; q:user's-tags-associated-with-purchases
>> 
>> So, we are not providing the actual tags in the tags field query, are we?
>> 
>> Thanks
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]] 
>> Sent: Monday, April 06, 2015 2:33 PM
>> To: [email protected]
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this 
>> works in other cases.
>> 
>> BTW we are nearing a new release. It fixes one cooccurrence problem that you 
>> may run into if you are using lots of small files to contain the initial 
>> interaction input. This happens often when using Spark Streaming for input.
>> 
>> If you want to try the source on github make sure to compile with 
>> -DskipTests since there is a failing test unrelated to the Spark code. Be 
>> aware that jar names have changed if that matters.
>> 
>> Can you report the cluster version of Spark and Hadoop as well as how many 
>> nodes?
>> 
>> Thanks
>> 
>> 
>> On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <[email protected]> 
>> wrote:
>> 
>> Pat, I was not using spark-submit script.  I am using mahout 
>> spark-itemsimilarity exactly how it is specified in 
>> http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
>> 
>> So, what I did is I created a bootstrap action that installs spark and 
>> mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step 
>> which can call a script (amazon provides scriptRunner that can run any 
>> script).  So, I basically create a command (mahout spark-itemsimilarity 
>> <parameters>) and pass it to script runner that runs it. One of the 
>> parameters is -ma , so I pass in yarn-client.   
>> 
>> We use AWS java API to programmatically start EMR cluster (trigger by Quartz 
>> job) with whatever parameters that job needs.
>> I used instructions in here: 
>> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to 
>> install spark as bootstrap action.  I built mahout-1.0 locally and uploaded 
>> a package to s3. I also created a bash script to copy that package from s3 
>> to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I 
>> used another boostrap action to invoke that script  and install mahout.  I 
>> had to also make changes to mahout script.   Added 
>> SPARK_HOME=/home/hadoop/spark 
>> (this is where I installed spark on EMR). Modified 
>> CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to 
>> avoid including classpath passed in by amazon script-runner since it 
>> contains path to the 2.11 version of scala (installed on EMR by Amazon) that 
>> conflicts with spark/mahout 2.10.x version.
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]] 
>> Sent: Thursday, March 26, 2015 3:49 PM
>> To: [email protected]
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity 
>> with the spark-submit script? That shouldn’t work, the job is a standalone 
>> app and does not require, nor is it likely to work with spark-submit.
>> 
>> Were you able to run on Yarn? How?
>> 
>> On Jan 29, 2015, at 9:15 AM, Pat Ferrel <[email protected]> wrote:
>> 
>> There are two indices (guava HashBiMaps) that map your ID into and out of 
>> Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs 
>> and column/itemIDS) per physical machine that all local tasks consult. They 
>> are Spark broadcast values. These will grow linearly as the number of items 
>> and users grow and as the size of your IDs, treated as strings, grow. The 
>> hashmaps have some overhead but in large collections the main cost is the 
>> size of the application IDs stored as strings, Mahout’s IDs are ints.
>> 
>> On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <[email protected]> 
>> wrote:
>> 
>> I was able to get spark and mahout installed on EMR cluster as bootstrap 
>> actions and was able to run spark-itemsimilarity job via an EMR step with 
>> some modifications to mahout script (defining SPARK_HOME and making sure 
>> CLASSPATH is not picked up from the invoking script  which is amazon's 
>> script-runner).
>> 
>> I was only able to run this job using yarn-client (yarn-master is not able 
>> to submit to resource manager).  
>> 
>> In yarn-client mode the driver program runs in the client process and 
>> submits jobs to executors via yarn manager, so my question is how much 
>> memory does this driver need?
>> Will the memory requirement vary based on the size of the input to 
>> spark-itemsimilarity?
>> 
>> Thanks. 
>> 
>> 
>> -----Original Message-----
>> From: Pasmanik, Paul [mailto:[email protected]] 
>> Sent: Thursday, January 15, 2015 12:46 PM
>> To: [email protected]
>> Subject: mahout 1.0 on EMR with spark
>> 
>> Has anyone tried running mahout 1.0 on EMR with Spark?
>> I've used instructions at  
>> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get 
>> EMR cluster running spark.   I am now able to deploy EMR cluster with Spark 
>> using AWS JAVA APIs.
>> EMR allows running a custom script as bootstrap action which I can use to 
>> install mahout.
>> What I am trying to figure out is whether I would need to build mahout every 
>> time I start EMR cluster or have pre-built artifacts and develop a script 
>> similar to what awslab is using to install spark?
>> 
>> Thanks.
>> 
>> 
>> 
>> ________________________________
>> The information contained in this electronic transmission is intended only 
>> for the use of the recipient and may be confidential and privileged. 
>> Unauthorized use, disclosure, or reproduction is strictly prohibited and may 
>> be unlawful. If you have received this electronic transmission in error, 
>> please notify the sender immediately.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>

Re: mahout 1.0 on EMR with spark item-similarity

Reply via email to