Re: mapreduce ItemSimilarity input optimization

Ted Dunning Mon, 18 Aug 2014 09:15:42 -0700

Sherega,

Use the LLR similarity.


Make sure you reduce the downsampling setting is as Pat F suggested?  That
will make a huge difference.

The filtering of low frequency items is already done.

Also, consider delivering your similar items and recommendations via a
search engine.  Search engine deployment really facilitates exploration.




On Mon, Aug 18, 2014 at 1:43 AM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> Thank you very much. I'll do what you are sayning in bullets 1...5 and try
> again.
>
> I also tried:
> 1. calc data using COUSINE_SIMILARITY
> 2. calc the same data using COOCCURENCE_SIMILARTY
> 3. join #1 and #2 where COOCURENCE >= $threshold
>
> Where threshold is some emperical integer value. I've used  "2" The idea is
> to filter out item pairs which never-ever met together...
> Please see this link:
>
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>
> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> approach still make sense, or it's useless waste of time?
>
> "What do you mean the similar items are terrible? How are you measuring
> that? " I have eye testing only,
> I did automate preparation->calculation->hbase upload-> web-app serving, I
> didn't automate testing.
>
>
>
>
> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
>
> > the things that stand out:
> >
> > 1) remove your maxSimilaritiesPerItem option! 50000
> maxSimilaritiesPerItem
> > will _kill_ performance and give no gain, leave this setting at the
> default
> > of 500
> > 2) use only one action. What do you want the user to do? Do you want them
> > to read a page? Then train on item page views. If those pages lead to a
> > purchase then you want to recommend purchases so train on user purchases.
> > 3) remove your minPrefsPerUser option, this should never be 0 or it will
> > leave users in the training data that have no data and may contribute to
> > longer runs with no gain.
> > 4) this is a pretty small Hadoop cluster for the size of your data but I
> > bet changing #1 will noticeably reduce the runtime
> > 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> > 6) remove your —booleanData option since LLR ignores weights.
> >
> > Remember that this is not the same as personalized recommendations. This
> > method alone will show the same “similar items” for all users.
> >
> > Sorry but both your “recommendation” types sound like the same thing.
> > Using both item page view  _and_ clicks on recommended items will both
> lead
> > to an item page view so you have two actions that lead to the same thing,
> > right? Just train on an item page view (unless you really want the user
> to
> > make a purchase)
> >
> > What do you mean the similar items are terrible? How are you measuring
> > that? Are you doing cross-validation measuring precision or A/B testing?
> > What looks bad to you may be good, the eyeball test is not always
> reliable.
> > If they are coming up completely crazy or random then you may have a bug
> in
> > your ID translation logic.
> >
> > It sounds like you have enough data to produce good results.
> >
> > On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.shey...@gmail.com>
> > wrote:
> >
> > 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
> > but enough for the start..
> > 2. I run it as oozie action.
> > <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >        <java>
> >            <job-tracker>${jobTracker}</job-tracker>
> >            <name-node>${nameNode}</name-node>
> >            <prepare>
> >                <delete path="${mahoutOutputDir}/primary" />
> >                <delete
> > path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >            </prepare>
> >            <configuration>
> >                <property>
> >                    <name>mapred.queue.name</name>
> >                    <value>default</value>
> >                </property>
> >
> >            </configuration>
> >
> >
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >            <arg>--input</arg>
> >            <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense
> user_id,
> > item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> recommendation,
> > a kind of try to increase quality of recommender...]-->
> >
> >            <arg>--output</arg>
> >            <arg>${mahoutOutputDir}/primary</arg>
> >
> >            <arg>--similarityClassname</arg>
> >            <arg>SIMILARITY_COSINE</arg>
> >
> >            <arg>--maxSimilaritiesPerItem</arg>
> >            <arg>50000</arg>
> >
> >            <arg>--minPrefsPerUser</arg>
> >            <arg>0</arg>
> >
> >            <arg>--booleanData</arg>
> >            <arg>false</arg>
> >
> >            <arg>--tempDir</arg>
> >            <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >
> >        </java>
> >        <ok to="to-narrow-table"/>
> >        <error to="kill"/>
> >    </action>
> >
> > 3) RANK does it, here is a script:
> >
> > --user, item, pref previously prepared by hive
> > user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> > (user_id:chararray, item_id:long, pref:double);
> >
> > --get distinct user from the whole input
> > distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >
> > --get distinct item from the whole input
> > distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >
> > --rank user 1....N
> > rankUsers_ = RANK distUserId;
> > rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >
> > --rank items 1....M
> > rankItems_ = RANK distItemId;
> > rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >
> > --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> > joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> > 'skewed';
> > joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> > item_id using 'replicated';
> >
> > projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> > as user_id,
> >                                         rankItems::rank_id
> > as item_id,
> >                                         joinedUsers::user_item_pref::pref
> > as pref;
> >
> > --store mapping for later remapping from RANK back to natural values
> > STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> using
> > PigStorage('\t');
> > STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> using
> > PigStorage('\t');
> > STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> '$projPrefs'
> > using PigStorage('\t');
> >
> > 4) I've seen this idea in different discussion, that different weight for
> > different actions are not good. Sorry, I don't understand what you do
> > suggest.
> > I have two kind of actions: user viewed item, user clicked on recommended
> > item (recommended item produced by my item similarity system).
> > I want to produce two kinds of recommendations:
> > 1. current item + recommend other items which other users visit in
> > conjuction with current item
> > 2. similar item: recommend items similar to current viewed item.
> > What can I try?
> > LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >
> > Right now I do get awful recommendations and I can't understand what can
> I
> > try next :((((((((((((
> >
> >
> > 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
> >
> > > 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
> > > buy more cpus you get nearly linear decrease in runtime.
> > > 2) what is your mahout command line with options, or how are you
> invoking
> > > mahout. I have seen the Mahout mapreduce recommender take this long so
> we
> > > should check what you are doing with downsampling.
> > > 3) do you really need to RANK your ids, that’s a full sort? When using
> > pig
> > > I usually get DISTINCT ones and assign an incrementing integer as the
> > > Mahout ID corresponding
> > > 4) your #2 assigning different weights to different actions usually
> does
> > > not work. I’ve done this before and compared offline metrics and seen
> > > precision go down. I’d get this working using only your primary actions
> > > first. What are you trying to get the user to do? View something, buy
> > > something? Use that action as the primary preference and start out
> with a
> > > weight of 1 using LLR. With LLR the weights are not used anyway so your
> > > data may not produce good results with mixed actions.
> > >
> > > A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> > > 1) output from 2 can be directly ingested and will create output.
> > > 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
> > > at weights.
> > > 3) output has your application specific IDs preserved.
> > > 4) its about 10x faster than mapreduce and will do aways with your ID
> > > translation steps
> > >
> > > One caveat is that your cluster machines will need lots of memory. I
> have
> > > 8-16g on mine.
> > >
> > > On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.shey...@gmail.com>
> > > wrote:
> > >
> > > 1. I do collect preferences for items using 60days sliding window.
> today
> > -
> > > 60 days.
> > > 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> item
> > > view, 5 for clicking recommndation block. The idea is to give more
> value
> > > for recommendations which attact visitor attention). I get ~ 20.000.000
> > of
> > > lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> > > 3. I do use apache pig RANK function to rank all distinct user_id
> > > 4. I do the same for item_id
> > > 5. I do join input dataset with ranked datasets and provide input to
> > mahout
> > > with dense interger user_id, item_id
> > > 6. I do get mahout output and join integer item_id back to get natural
> > key
> > > value.
> > >
> > > step #1-2 takes ~ 40min
> > > step #3-5 takes ~1 hour
> > > mahout calc takes ~3hours
> > >
> > >
> > >
> > > 2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
> > >
> > >> This really doesn't sound right.  It should be possible to process
> > > almost a
> > >> thousand times that much data every night without that much problem.
> > >>
> > >> How are you preparing the input data?
> > >>
> > >> How are you converting to Mahout id's?
> > >>
> > >> Even using python, you should be able to do the conversion in just a
> few
> > >> minutes without any parallelism whatsoever.
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> > > serega.shey...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi, We are trying calculate ItemSimilarity.
> > >>> Right now we have 2*10^7 input lines. I do provide input data as raw
> > > text
> > >>> each day to recalculate item similarities. We do get +100..1000 new
> > > items
> > >>> each day.
> > >>> 1. It takes too much time to prepare input data.
> > >>> 2. It takes too much time to convert user_id, item_id to mahout ids
> > >>>
> > >>> Is there any poissibility to provide data to mahout mapreduce
> > >>> ItemSimilarity using some binary format with compression?
> > >>>
> > >>
> > >
> > >
> >
> >
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to