Re: mapreduce ItemSimilarity input optimization

Serega Sheypak Tue, 19 Aug 2014 08:19:19 -0700

Hi, what is "emon"?
1. I do create "look-with recommendations". I really it's just "raw" output
from itemSimilarityJob with booleanData=true and LLR as similarity function
(your suggestion)
2. I do create "similar" recommendations. I do apply category filter before
serving recommendations


"look-with", means other users watched iPhone case and other accessory with
iphone. I do have accessory for iPhone here, but also water heating
device...
similar - means show only other smarphones as recommendations to iPhone.

Right now the problem is in water heating device in 'look-with' (category
filter not applied). How can I put away such kind of recommendations and
why Do I get them?



2014-08-19 18:01 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:

> That sounds much better.
>
> Do you have metadata like product category? Electronics vs. home
> appliance? One easy thing to do if you have categories in your catalog is
> filter by the same category as the item being viewed.
>
> BTW it sounds like you have an emon
> On Aug 19, 2014, at 12:53 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
> Hi, I 've used LLR with properties you've suggested.
> Right now I have a trouble.
> A trouble:
> Water heat device (
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> )
> is recommedned for iPhone. And it has one of the highest score.
> good things:
> iPhone cases (
>
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> )
> are recommedned for iPhone, It's good
> Other smartphones are recommended to iPhone, it's good
> Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> 32GB, e.t.c.
>
> What could be a reason for recommending "Water heat device " to iPhone?
> iPhone is one of the most popular item. There should be a lot of people
> viewing iPhone with "Water heat device "?
>
>
>
> 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
>
> > Oh, and as to using different algorithms, this is an “ensemble” method.
> In
> > the paper they are talking about using widely differing algorithms like
> ALS
> > + Cooccurrence + … This technique was used to win the Netflix prize but
> in
> > practice the improvements may be to small to warrant running multiple
> > pipelines. In any case it isn’t the first improvement you may want to
> try.
> > For instance your UI will have a drastic effect on how well you recs do,
> > and there are other much easier techniques that we can talk about once
> you
> > get the basics working.
> >
> >
> > On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> >
> > When beginning to use a recommender from Mahout I always suggest you
> start
> > from the defaults. These often give the best results—then tune afterwards
> > to improve.
> >
> > Your intuition is correct that multiple actions can be used to improve
> > results but get the basics working first. The easiest way to use multiple
> > actions is to use spark-itemsimilarity so since you are using mapreduce
> for
> > now, just use one action.
> >
> > I would not try to combine the results from two similarity measures there
> > is no benefit since LLR is better than any of them, at least I’ve never
> > seen it loose. Below is my experience with trying many of the similarity
> > metrics on exactly the same data. I did cross-validation with precision
> > (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
> > LLR is the only method presently used in the Spark version of
> > itemsimilarity.
> >
> > <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> >
> > If you still get weird results double check your ID mapping. Run a small
> > bit of data through and spot check the mapping by hand.
> >
> > At some point you will want to create a cross-validation test. This is
> > good as a sort of integration sanity check when making changes to the
> > recommender. You run cross-validation using standard test data to see if
> > the score changes drastically between releases. Big changes may indicate
> a
> > bug. At the beginning it will help you tune as in the case above where it
> > helped decide on LLR.
> >
> >
> >
> > On Aug 18, 2014, at 1:43 AM, Serega Sheypak <serega.shey...@gmail.com>
> > wrote:
> >
> > Thank you very much. I'll do what you are sayning in bullets 1...5 and
> try
> > again.
> >
> > I also tried:
> > 1. calc data using COUSINE_SIMILARITY
> > 2. calc the same data using COOCCURENCE_SIMILARTY
> > 3. join #1 and #2 where COOCURENCE >= $threshold
> >
> > Where threshold is some emperical integer value. I've used  "2" The idea
> is
> > to filter out item pairs which never-ever met together...
> > Please see this link:
> >
> >
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> >
> > If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> > approach still make sense, or it's useless waste of time?
> >
> > "What do you mean the similar items are terrible? How are you measuring
> > that? " I have eye testing only,
> > I did automate preparation->calculation->hbase upload-> web-app serving,
> I
> > didn't automate testing.
> >
> >
> >
> >
> > 2014-08-18 5:16 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
> >
> >> the things that stand out:
> >>
> >> 1) remove your maxSimilaritiesPerItem option! 50000
> > maxSimilaritiesPerItem
> >> will _kill_ performance and give no gain, leave this setting at the
> > default
> >> of 500
> >> 2) use only one action. What do you want the user to do? Do you want
> them
> >> to read a page? Then train on item page views. If those pages lead to a
> >> purchase then you want to recommend purchases so train on user
> purchases.
> >> 3) remove your minPrefsPerUser option, this should never be 0 or it will
> >> leave users in the training data that have no data and may contribute to
> >> longer runs with no gain.
> >> 4) this is a pretty small Hadoop cluster for the size of your data but I
> >> bet changing #1 will noticeably reduce the runtime
> >> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
> >> 6) remove your —booleanData option since LLR ignores weights.
> >>
> >> Remember that this is not the same as personalized recommendations. This
> >> method alone will show the same “similar items” for all users.
> >>
> >> Sorry but both your “recommendation” types sound like the same thing.
> >> Using both item page view  _and_ clicks on recommended items will both
> > lead
> >> to an item page view so you have two actions that lead to the same
> thing,
> >> right? Just train on an item page view (unless you really want the user
> > to
> >> make a purchase)
> >>
> >> What do you mean the similar items are terrible? How are you measuring
> >> that? Are you doing cross-validation measuring precision or A/B testing?
> >> What looks bad to you may be good, the eyeball test is not always
> > reliable.
> >> If they are coming up completely crazy or random then you may have a bug
> > in
> >> your ID translation logic.
> >>
> >> It sounds like you have enough data to produce good results.
> >>
> >> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.shey...@gmail.com>
> >> wrote:
> >>
> >> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> much
> >> but enough for the start..
> >> 2. I run it as oozie action.
> >> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
> >>      <java>
> >>          <job-tracker>${jobTracker}</job-tracker>
> >>          <name-node>${nameNode}</name-node>
> >>          <prepare>
> >>              <delete path="${mahoutOutputDir}/primary" />
> >>              <delete
> >> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
> >>          </prepare>
> >>          <configuration>
> >>              <property>
> >>                  <name>mapred.queue.name</name>
> >>                  <value>default</value>
> >>              </property>
> >>
> >>          </configuration>
> >>
> >>
> >>
> >
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
> >>          <arg>--input</arg>
> >>          <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
> >> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> > recommendation,
> >> a kind of try to increase quality of recommender...]-->
> >>
> >>          <arg>--output</arg>
> >>          <arg>${mahoutOutputDir}/primary</arg>
> >>
> >>          <arg>--similarityClassname</arg>
> >>          <arg>SIMILARITY_COSINE</arg>
> >>
> >>          <arg>--maxSimilaritiesPerItem</arg>
> >>          <arg>50000</arg>
> >>
> >>          <arg>--minPrefsPerUser</arg>
> >>          <arg>0</arg>
> >>
> >>          <arg>--booleanData</arg>
> >>          <arg>false</arg>
> >>
> >>          <arg>--tempDir</arg>
> >>          <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
> >>
> >>      </java>
> >>      <ok to="to-narrow-table"/>
> >>      <error to="kill"/>
> >>  </action>
> >>
> >> 3) RANK does it, here is a script:
> >>
> >> --user, item, pref previously prepared by hive
> >> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
> >> (user_id:chararray, item_id:long, pref:double);
> >>
> >> --get distinct user from the whole input
> >> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
> >>
> >> --get distinct item from the whole input
> >> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
> >>
> >> --rank user 1....N
> >> rankUsers_ = RANK distUserId;
> >> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
> >>
> >> --rank items 1....M
> >> rankItems_ = RANK distItemId;
> >> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
> >>
> >> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
> >> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
> >> 'skewed';
> >> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
> >> item_id using 'replicated';
> >>
> >> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
> >> as user_id,
> >>                                       rankItems::rank_id
> >> as item_id,
> >>                                       joinedUsers::user_item_pref::pref
> >> as pref;
> >>
> >> --store mapping for later remapping from RANK back to natural values
> >> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> > using
> >> PigStorage('\t');
> >> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> > using
> >> PigStorage('\t');
> >> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> > '$projPrefs'
> >> using PigStorage('\t');
> >>
> >> 4) I've seen this idea in different discussion, that different weight
> for
> >> different actions are not good. Sorry, I don't understand what you do
> >> suggest.
> >> I have two kind of actions: user viewed item, user clicked on
> recommended
> >> item (recommended item produced by my item similarity system).
> >> I want to produce two kinds of recommendations:
> >> 1. current item + recommend other items which other users visit in
> >> conjuction with current item
> >> 2. similar item: recommend items similar to current viewed item.
> >> What can I try?
> >> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
> >>
> >> Right now I do get awful recommendations and I can't understand what can
> > I
> >> try next :((((((((((((
> >>
> >>
> >> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
> >>
> >>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
> >>> buy more cpus you get nearly linear decrease in runtime.
> >>> 2) what is your mahout command line with options, or how are you
> > invoking
> >>> mahout. I have seen the Mahout mapreduce recommender take this long so
> > we
> >>> should check what you are doing with downsampling.
> >>> 3) do you really need to RANK your ids, that’s a full sort? When using
> >> pig
> >>> I usually get DISTINCT ones and assign an incrementing integer as the
> >>> Mahout ID corresponding
> >>> 4) your #2 assigning different weights to different actions usually
> does
> >>> not work. I’ve done this before and compared offline metrics and seen
> >>> precision go down. I’d get this working using only your primary actions
> >>> first. What are you trying to get the user to do? View something, buy
> >>> something? Use that action as the primary preference and start out with
> > a
> >>> weight of 1 using LLR. With LLR the weights are not used anyway so your
> >>> data may not produce good results with mixed actions.
> >>>
> >>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
> >>> 1) output from 2 can be directly ingested and will create output.
> >>> 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
> >>> at weights.
> >>> 3) output has your application specific IDs preserved.
> >>> 4) its about 10x faster than mapreduce and will do aways with your ID
> >>> translation steps
> >>>
> >>> One caveat is that your cluster machines will need lots of memory. I
> > have
> >>> 8-16g on mine.
> >>>
> >>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.shey...@gmail.com>
> >>> wrote:
> >>>
> >>> 1. I do collect preferences for items using 60days sliding window.
> today
> >> -
> >>> 60 days.
> >>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> > item
> >>> view, 5 for clicking recommndation block. The idea is to give more
> value
> >>> for recommendations which attact visitor attention). I get ~ 20.000.000
> >> of
> >>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
> >>> 3. I do use apache pig RANK function to rank all distinct user_id
> >>> 4. I do the same for item_id
> >>> 5. I do join input dataset with ranked datasets and provide input to
> >> mahout
> >>> with dense interger user_id, item_id
> >>> 6. I do get mahout output and join integer item_id back to get natural
> >> key
> >>> value.
> >>>
> >>> step #1-2 takes ~ 40min
> >>> step #3-5 takes ~1 hour
> >>> mahout calc takes ~3hours
> >>>
> >>>
> >>>
> >>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
> >>>
> >>>> This really doesn't sound right.  It should be possible to process
> >>> almost a
> >>>> thousand times that much data every night without that much problem.
> >>>>
> >>>> How are you preparing the input data?
> >>>>
> >>>> How are you converting to Mahout id's?
> >>>>
> >>>> Even using python, you should be able to do the conversion in just a
> > few
> >>>> minutes without any parallelism whatsoever.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
> >>> serega.shey...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi, We are trying calculate ItemSimilarity.
> >>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
> >>> text
> >>>>> each day to recalculate item similarities. We do get +100..1000 new
> >>> items
> >>>>> each day.
> >>>>> 1. It takes too much time to prepare input data.
> >>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
> >>>>>
> >>>>> Is there any poissibility to provide data to mahout mapreduce
> >>>>> ItemSimilarity using some binary format with compression?
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
>
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to