Gave you the wrong schema entries for the advice about queries. Check with Solr
documentation, which always trumps my guesses.
To use token phrases do the following:
<fieldType name=“indicator" class="solr.TextField" omitNorms=“false”>
<!— This simple tokenizer will split the text by spaces (and other
punctuation) to separate item-id tokens —>
<tokenizer class="solr.StandardTokenizerFactory”/>
< /fieldType>
<field name=“purchase" stored=“true" type="indicator" multiValued=“false"
indexed="true”/>
> On Apr 16, 2015, at 9:15 AM, Pat Ferrel <[email protected]> wrote:
>
> OK, this is cool. Almost there!
>
> In order to answer the question you have to decide how you will persist the
> indicators. The output of spark-itemsimilarity can be directly indexed but
> requires phrase searches on text fields. It requires the item-id strings to
> be tokenized so they aren’t broken up in the analyzer used for Solr phrase
> queries. The better way to do it is store the indicators as multi-valued
> fields on Solr or a DB. Many ways to slice this.
>
> If you want to index the output of Mahout directly we will assume the item
> IDs are tokenized and so contain no spaces, ., comma, or other punctuation
> that will break a phrase, so we can encode the user history as a single
> string of space separated item tokens.
>
> To do a query with something like “ipad iphone” we’ll need to setup Solr like
> this, which is a bit of a guess since I use a DB—not the raw output files:
>
> <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!—
> NOTICE NO TOKENIZER OR ANALYZER USED—>
> <field name=“purchase" stored=“true" type="indicator" multiValued="true"
> indexed="true”/>
My bad, this is for multi-valued fields, see above for space delimited token
fields. I believe the above should use class=“solr.stringField” also.
>
> “OR” is the default query operator so unless you’ve messed with that it
> should be fine. You need that because if you have multiple fields in the
> future you want them ORed as well as the terms The query would be something
> like:
>
> q=purchase: (“iphone ipad")
>
> So you are applying the items the user has purchased only to the purchase
> indicator field. As you add cross-cooccurrence actions the fieldType will
> stay the same and you will add a field for “views".
>
> q=purchase: (“iphone ipad”) view: (“history of items viewed")
> And so on.
>
> You’ll also need to index them in Solr as csv files with no header. The
> output is tab delimited by default so more correctly a tsv.
>
> This can be setup to use multi-valued fields but you’d have to store the
> output of spark-itemsimilarity in Solr or a db. I’d actually recommend this
> for several reasons including that it is faster than HDFS but it requires you
> write storage code and customize the Solr config differently.
>
> other answers below:
>
>
>> On Apr 16, 2015, at 7:35 AM, Pasmanik, Paul <[email protected]>
>> wrote:
>>
>> Thanks, Pat.
>> I have a question regarding the search on the multi-valued field.
>> So, once I have indexed the results of the spark-itemsimilarity for purchase
>> field as multi-valued field in Solar, what kind of search do I perform using
>> user's purchase history? Is it a phrase search (purchase items separated
>> by space, something like purchase: "iphone ipad" with or without high slope
>> value) or is it an OR query using each purchased item (purchase: ("iphone"
>> OR "ipad")) or something totally different?
>>
>> My understanding is that if I have a document with purchase field that has
>> values: 1,2,3,4,5 and another document that has values 3,4,5 and my purchase
>> history has 1,2,4 then the first document should rank higher.
>
> Yes. The longer answer is that Solr with omitNorms=“false” will TF-IDF weight
> terms (in this case individual indicators). So very frequently preferred
> items will be down-weighted on the theory that because you and I like
> "motherhood and apple pie", it doesn’t say much about our similarity of
> taste--everyone likes those. So actual results will depend on frequency of
> item preferences in the corpus. This down-weighting is a good thing since
> otherwise a the popular things would always be recommended.
>
>> Thanks.
>>
>>
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]]
>> Sent: Tuesday, April 07, 2015 7:15 PM
>> To: [email protected]; Pasmanik, Paul
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>>
>> We are working on a release, which will be 0.10.0 so give it a try if you
>> can. It fixes one problem that you may encounter with an out of range index
>> in a vector. You may not see it.
>>
>> 1) The search engine must be able to take one query with multiple fields and
>> apply each field in the query to separate fields in the index. Solr and ES
>> work, not sure about Amazon.
>> 2) Config and experiment seem good.
>> 3) It is good practice to save you interactions in something like a db so
>> they can be replayed to create indicators if needed and to maintain a time
>> range of data. I use a key-value store like the search engine itself or
>> NoSQL DB. The value is that you can look at the collection as an item
>> catalog and so put metadata in columns or doc fields. This metadata can be
>> used to match context later so if you are on a “men’s clothes” page you may
>> want “people who bought this also bought these” but biased or filtered by
>> the “men’s clothes” category.
>> 4) Tags can be used for CF or for content-based recs and CF would generally
>> be better. In the case you ask about the query is items favored since
>> spark-rowsimilarity will produce similar items (similar in their tags, not
>> users who preferred). So the query is items. Extend this and text tokens
>> (bag-of-words) can be used to calculate content-based indicators and recs
>> that are personalized, not just "more items like this”. But again CF data
>> would be preferable if available.
>>
>> As with your cluster investment I’d start small with clear usage based
>> indicators and build up as needed based on your application data.
>>
>> Let us know how it goes
>>
>>
>> On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <[email protected]>
>> wrote:
>>
>> Thanks, Pat.
>> We are only running EMR cluster with 1 master and 1 core node right now and
>> were using EMR AMI 3.2.3 which has Hadoop 2.4.0. We are using default
>> configuration for spark (using aws script for spark) which I believe sets
>> number of instances to 2. Spark version 1.1.0h
>> (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md)
>>
>> We are not in production yet as we are experimenting right now.
>>
>> I have a question about the choice of the search engine to do
>> recommendations.
>> I know the Practical Machine Learning book and mahout docs talk about Solr.
>> Do you see any issues with using Elastic Search or AWS Cloud Search?
>> Also, looking at the content based indicator example on
>> intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is
>> used to produce itemid to items matrix, but then it says to use tags
>> associated with purchases in the query for tags like this:
>> Query:
>> field: purchase; q:user's-purchase-history
>> field: view; q:user's view-history
>> field: tags; q:user's-tags-associated-with-purchases
>>
>> So, we are not providing the actual tags in the tags field query, are we?
>>
>> Thanks
>>
>>
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]]
>> Sent: Monday, April 06, 2015 2:33 PM
>> To: [email protected]
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>>
>> OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this
>> works in other cases.
>>
>> BTW we are nearing a new release. It fixes one cooccurrence problem that you
>> may run into if you are using lots of small files to contain the initial
>> interaction input. This happens often when using Spark Streaming for input.
>>
>> If you want to try the source on github make sure to compile with
>> -DskipTests since there is a failing test unrelated to the Spark code. Be
>> aware that jar names have changed if that matters.
>>
>> Can you report the cluster version of Spark and Hadoop as well as how many
>> nodes?
>>
>> Thanks
>>
>>
>> On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <[email protected]>
>> wrote:
>>
>> Pat, I was not using spark-submit script. I am using mahout
>> spark-itemsimilarity exactly how it is specified in
>> http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
>>
>> So, what I did is I created a bootstrap action that installs spark and
>> mahout on EMR cluster. Then, I used AWS Java APIs to create an EMR job step
>> which can call a script (amazon provides scriptRunner that can run any
>> script). So, I basically create a command (mahout spark-itemsimilarity
>> <parameters>) and pass it to script runner that runs it. One of the
>> parameters is -ma , so I pass in yarn-client.
>>
>> We use AWS java API to programmatically start EMR cluster (trigger by Quartz
>> job) with whatever parameters that job needs.
>> I used instructions in here:
>> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to
>> install spark as bootstrap action. I built mahout-1.0 locally and uploaded
>> a package to s3. I also created a bash script to copy that package from s3
>> to EMR, unpack, remove mahout 0.9 version that is part for EMR ami. Then I
>> used another boostrap action to invoke that script and install mahout. I
>> had to also make changes to mahout script. Added
>> SPARK_HOME=/home/hadoop/spark
>> (this is where I installed spark on EMR). Modified
>> CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to
>> avoid including classpath passed in by amazon script-runner since it
>> contains path to the 2.11 version of scala (installed on EMR by Amazon) that
>> conflicts with spark/mahout 2.10.x version.
>>
>>
>> -----Original Message-----
>> From: Pat Ferrel [mailto:[email protected]]
>> Sent: Thursday, March 26, 2015 3:49 PM
>> To: [email protected]
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>>
>> Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity
>> with the spark-submit script? That shouldn’t work, the job is a standalone
>> app and does not require, nor is it likely to work with spark-submit.
>>
>> Were you able to run on Yarn? How?
>>
>> On Jan 29, 2015, at 9:15 AM, Pat Ferrel <[email protected]> wrote:
>>
>> There are two indices (guava HashBiMaps) that map your ID into and out of
>> Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs
>> and column/itemIDS) per physical machine that all local tasks consult. They
>> are Spark broadcast values. These will grow linearly as the number of items
>> and users grow and as the size of your IDs, treated as strings, grow. The
>> hashmaps have some overhead but in large collections the main cost is the
>> size of the application IDs stored as strings, Mahout’s IDs are ints.
>>
>> On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <[email protected]>
>> wrote:
>>
>> I was able to get spark and mahout installed on EMR cluster as bootstrap
>> actions and was able to run spark-itemsimilarity job via an EMR step with
>> some modifications to mahout script (defining SPARK_HOME and making sure
>> CLASSPATH is not picked up from the invoking script which is amazon's
>> script-runner).
>>
>> I was only able to run this job using yarn-client (yarn-master is not able
>> to submit to resource manager).
>>
>> In yarn-client mode the driver program runs in the client process and
>> submits jobs to executors via yarn manager, so my question is how much
>> memory does this driver need?
>> Will the memory requirement vary based on the size of the input to
>> spark-itemsimilarity?
>>
>> Thanks.
>>
>>
>> -----Original Message-----
>> From: Pasmanik, Paul [mailto:[email protected]]
>> Sent: Thursday, January 15, 2015 12:46 PM
>> To: [email protected]
>> Subject: mahout 1.0 on EMR with spark
>>
>> Has anyone tried running mahout 1.0 on EMR with Spark?
>> I've used instructions at
>> https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get
>> EMR cluster running spark. I am now able to deploy EMR cluster with Spark
>> using AWS JAVA APIs.
>> EMR allows running a custom script as bootstrap action which I can use to
>> install mahout.
>> What I am trying to figure out is whether I would need to build mahout every
>> time I start EMR cluster or have pre-built artifacts and develop a script
>> similar to what awslab is using to install spark?
>>
>> Thanks.
>>
>>
>>
>> ________________________________
>> The information contained in this electronic transmission is intended only
>> for the use of the recipient and may be confidential and privileged.
>> Unauthorized use, disclosure, or reproduction is strictly prohibited and may
>> be unlawful. If you have received this electronic transmission in error,
>> please notify the sender immediately.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>