Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Pat Ferrel Tue, 03 Feb 2015 12:54:02 -0800

BTW if you want to try it out quickly the CLI can be run for each pair. This 
recalculates A’A multiple times but requires less node memory and no code 
changes.

Run it once for every A & B input where B is one of the secondary actions.

On Feb 3, 2015, at 12:33 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

Yes, full support for multiple cross-cooccurrence is supported by the API.

Whether you write your own app/driver or use the shell you can pass in as many 
inputs as you need. The driver cli is already too complicated.

To pass a script to the shell doesn’t require you to go through creating a 
project but has limited debug capabilities. The script shouldn’t be too 
complicated though.

I’m doing this for a client now. If you want to input tuples <userID, itemID> 
from separate files or directories with part-xxxx files you can use the 
TDIndexedDatasetReader#elementReader which will do a parallel input of csv type 
text files and create an IndexedDataset from each.

Pass these in to SimilarityAnalysis.cooccurrencesIDSs. It takes the primary 
action and a list of secondary actions and returns a list of indicators as 
IndexedDatasets. You can then use the TDIndexedDatasetWriter to do parallel 
writes creating directories full of csv part-xxxxx files for each indicator 
matrix. 

If you are going straight into a search engine make sure to set omitScore in 
the schema. The LLR cooccurrence score is really only needed for downsampling, 
the search engine will re-weight the indicators using TF-IDF, which is good for 
recs.

Caveat emptor: doing more than one secondary input has not been thoroughly 
tested but since I’m doing that myself you will get fast support. Also remember 
that IndexedDatasets keep HashMaps in memory on each cluster machine. The will 
be one per userID and itemID collection. So you need enough memory on each node 
to hold them.

Let me know how it goes I’ll be doing the same thing.

On Feb 3, 2015, at 12:06 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

On Tue, Feb 3, 2015 at 11:57 AM, Олег Зотов <olegzoto...@gmail.com> wrote:

> Hello.
> I develop recommendation system and use mahout on spark (1.0 snapshot). In
> the process I have found, that spark-itemsimilarity driver do not allow to
> process more than two action types.  After reading the documentation, I
> found that, I should run it multiple times or use
> SimilarityAnalysis.cooccurrence API. But multiple running is not
> efficiently and write java/scala code is not always very convenient.
> 

Don't you think writing script for spark shell is better for this type of
stuff? IDEA would support full scala syntax support even for scala scripts.

(one problem with shell is that there's a bug where MAHOUT_OPTS enviornment
doesn't work for adjusting spark application specifics with -D...).

> Furthermore, in sources of ItemSimilarityDriver.scala (at 217 line) I have
> found this comment "// todo: allow more than one cross-similarity matrix?"
> 
> It is my first experience of working with opensource, also I hear writing
> here before creating issue is preferred. So my question: what about
> extending spark-itemsimilarity driver api with something like this:
> mahout spark-itemsimilarity --main-filter purchase --secondary-filter
> view,addToCart,like
> (other parameters are omitted)
> The result will be one indicator matrix and set of cross-indicator
> matrices(one for each secondary action)
> 
> If it's helpful feature, I'll do it.
> 
> P.S. Sorry for my poor English, it is not my native language.
> 
нормальный такой инглиш вроде.  извиняться не за что имо.

> 
> Regards, Oleg.
>

Re: Extending spark-itemsimilarity for calculation multiple cross-indicators

Reply via email to