Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Tue, 19 Aug 2014 07:02:52 -0700

That sounds much better.

Do you have metadata like product category? Electronics vs. home appliance? One 
easy thing to do if you have categories in your catalog is filter by the same 
category as the item being viewed.


BTW it sounds like you have an emon
On Aug 19, 2014, at 12:53 AM, Serega Sheypak <serega.shey...@gmail.com> wrote:

Hi, I 've used LLR with properties you've suggested.
Right now I have a trouble.
A trouble:
Water heat device (
http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg)
is recommedned for iPhone. And it has one of the highest score.
good things:
iPhone cases (
https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg)
are recommedned for iPhone, It's good
Other smartphones are recommended to iPhone, it's good
Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
32GB, e.t.c.

What could be a reason for recommending "Water heat device " to iPhone?
iPhone is one of the most popular item. There should be a lot of people
viewing iPhone with "Water heat device "?



2014-08-18 20:15 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:

> Oh, and as to using different algorithms, this is an “ensemble” method. In
> the paper they are talking about using widely differing algorithms like ALS
> + Cooccurrence + … This technique was used to win the Netflix prize but in
> practice the improvements may be to small to warrant running multiple
> pipelines. In any case it isn’t the first improvement you may want to try.
> For instance your UI will have a drastic effect on how well you recs do,
> and there are other much easier techniques that we can talk about once you
> get the basics working.
> 
> 
> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
> 
> When beginning to use a recommender from Mahout I always suggest you start
> from the defaults. These often give the best results—then tune afterwards
> to improve.
> 
> Your intuition is correct that multiple actions can be used to improve
> results but get the basics working first. The easiest way to use multiple
> actions is to use spark-itemsimilarity so since you are using mapreduce for
> now, just use one action.
> 
> I would not try to combine the results from two similarity measures there
> is no benefit since LLR is better than any of them, at least I’ve never
> seen it loose. Below is my experience with trying many of the similarity
> metrics on exactly the same data. I did cross-validation with precision
> (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
> LLR is the only method presently used in the Spark version of
> itemsimilarity.
> 
> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
> 
> If you still get weird results double check your ID mapping. Run a small
> bit of data through and spot check the mapping by hand.
> 
> At some point you will want to create a cross-validation test. This is
> good as a sort of integration sanity check when making changes to the
> recommender. You run cross-validation using standard test data to see if
> the score changes drastically between releases. Big changes may indicate a
> bug. At the beginning it will help you tune as in the case above where it
> helped decide on LLR.
> 
> 
> 
> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
> 
> Thank you very much. I'll do what you are sayning in bullets 1...5 and try
> again.
> 
> I also tried:
> 1. calc data using COUSINE_SIMILARITY
> 2. calc the same data using COOCCURENCE_SIMILARTY
> 3. join #1 and #2 where COOCURENCE >= $threshold
> 
> Where threshold is some emperical integer value. I've used  "2" The idea is
> to filter out item pairs which never-ever met together...
> Please see this link:
> 
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
> 
> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
> approach still make sense, or it's useless waste of time?
> 
> "What do you mean the similar items are terrible? How are you measuring
> that? " I have eye testing only,
> I did automate preparation->calculation->hbase upload-> web-app serving, I
> didn't automate testing.
> 
> 
> 
> 
> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
> 
>> the things that stand out:
>> 
>> 1) remove your maxSimilaritiesPerItem option! 50000
> maxSimilaritiesPerItem
>> will _kill_ performance and give no gain, leave this setting at the
> default
>> of 500
>> 2) use only one action. What do you want the user to do? Do you want them
>> to read a page? Then train on item page views. If those pages lead to a
>> purchase then you want to recommend purchases so train on user purchases.
>> 3) remove your minPrefsPerUser option, this should never be 0 or it will
>> leave users in the training data that have no data and may contribute to
>> longer runs with no gain.
>> 4) this is a pretty small Hadoop cluster for the size of your data but I
>> bet changing #1 will noticeably reduce the runtime
>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
>> 6) remove your —booleanData option since LLR ignores weights.
>> 
>> Remember that this is not the same as personalized recommendations. This
>> method alone will show the same “similar items” for all users.
>> 
>> Sorry but both your “recommendation” types sound like the same thing.
>> Using both item page view  _and_ clicks on recommended items will both
> lead
>> to an item page view so you have two actions that lead to the same thing,
>> right? Just train on an item page view (unless you really want the user
> to
>> make a purchase)
>> 
>> What do you mean the similar items are terrible? How are you measuring
>> that? Are you doing cross-validation measuring precision or A/B testing?
>> What looks bad to you may be good, the eyeball test is not always
> reliable.
>> If they are coming up completely crazy or random then you may have a bug
> in
>> your ID translation logic.
>> 
>> It sounds like you have enough data to produce good results.
>> 
>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.shey...@gmail.com>
>> wrote:
>> 
>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too much
>> but enough for the start..
>> 2. I run it as oozie action.
>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>>      <java>
>>          <job-tracker>${jobTracker}</job-tracker>
>>          <name-node>${nameNode}</name-node>
>>          <prepare>
>>              <delete path="${mahoutOutputDir}/primary" />
>>              <delete
>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>>          </prepare>
>>          <configuration>
>>              <property>
>>                  <name>mapred.queue.name</name>
>>                  <value>default</value>
>>              </property>
>> 
>>          </configuration>
>> 
>> 
>> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>>          <arg>--input</arg>
>>          <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
> recommendation,
>> a kind of try to increase quality of recommender...]-->
>> 
>>          <arg>--output</arg>
>>          <arg>${mahoutOutputDir}/primary</arg>
>> 
>>          <arg>--similarityClassname</arg>
>>          <arg>SIMILARITY_COSINE</arg>
>> 
>>          <arg>--maxSimilaritiesPerItem</arg>
>>          <arg>50000</arg>
>> 
>>          <arg>--minPrefsPerUser</arg>
>>          <arg>0</arg>
>> 
>>          <arg>--booleanData</arg>
>>          <arg>false</arg>
>> 
>>          <arg>--tempDir</arg>
>>          <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>> 
>>      </java>
>>      <ok to="to-narrow-table"/>
>>      <error to="kill"/>
>>  </action>
>> 
>> 3) RANK does it, here is a script:
>> 
>> --user, item, pref previously prepared by hive
>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
>> (user_id:chararray, item_id:long, pref:double);
>> 
>> --get distinct user from the whole input
>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>> 
>> --get distinct item from the whole input
>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>> 
>> --rank user 1....N
>> rankUsers_ = RANK distUserId;
>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>> 
>> --rank items 1....M
>> rankItems_ = RANK distItemId;
>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>> 
>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
>> 'skewed';
>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
>> item_id using 'replicated';
>> 
>> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
>> as user_id,
>>                                       rankItems::rank_id
>> as item_id,
>>                                       joinedUsers::user_item_pref::pref
>> as pref;
>> 
>> --store mapping for later remapping from RANK back to natural values
>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
> using
>> PigStorage('\t');
>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
> using
>> PigStorage('\t');
>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
> '$projPrefs'
>> using PigStorage('\t');
>> 
>> 4) I've seen this idea in different discussion, that different weight for
>> different actions are not good. Sorry, I don't understand what you do
>> suggest.
>> I have two kind of actions: user viewed item, user clicked on recommended
>> item (recommended item produced by my item similarity system).
>> I want to produce two kinds of recommendations:
>> 1. current item + recommend other items which other users visit in
>> conjuction with current item
>> 2. similar item: recommend items similar to current viewed item.
>> What can I try?
>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>> 
>> Right now I do get awful recommendations and I can't understand what can
> I
>> try next :((((((((((((
>> 
>> 
>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
>> 
>>> 1) how many cores in the cluster? The whole idea behind mapreduce is you
>>> buy more cpus you get nearly linear decrease in runtime.
>>> 2) what is your mahout command line with options, or how are you
> invoking
>>> mahout. I have seen the Mahout mapreduce recommender take this long so
> we
>>> should check what you are doing with downsampling.
>>> 3) do you really need to RANK your ids, that’s a full sort? When using
>> pig
>>> I usually get DISTINCT ones and assign an incrementing integer as the
>>> Mahout ID corresponding
>>> 4) your #2 assigning different weights to different actions usually does
>>> not work. I’ve done this before and compared offline metrics and seen
>>> precision go down. I’d get this working using only your primary actions
>>> first. What are you trying to get the user to do? View something, buy
>>> something? Use that action as the primary preference and start out with
> a
>>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>>> data may not produce good results with mixed actions.
>>> 
>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>>> 1) output from 2 can be directly ingested and will create output.
>>> 2) multiple actions can be used with cross-cooccurrence, not by guessing
>>> at weights.
>>> 3) output has your application specific IDs preserved.
>>> 4) its about 10x faster than mapreduce and will do aways with your ID
>>> translation steps
>>> 
>>> One caveat is that your cluster machines will need lots of memory. I
> have
>>> 8-16g on mine.
>>> 
>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.shey...@gmail.com>
>>> wrote:
>>> 
>>> 1. I do collect preferences for items using 60days sliding window. today
>> -
>>> 60 days.
>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
> item
>>> view, 5 for clicking recommndation block. The idea is to give more value
>>> for recommendations which attact visitor attention). I get ~ 20.000.000
>> of
>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>>> 3. I do use apache pig RANK function to rank all distinct user_id
>>> 4. I do the same for item_id
>>> 5. I do join input dataset with ranked datasets and provide input to
>> mahout
>>> with dense interger user_id, item_id
>>> 6. I do get mahout output and join integer item_id back to get natural
>> key
>>> value.
>>> 
>>> step #1-2 takes ~ 40min
>>> step #3-5 takes ~1 hour
>>> mahout calc takes ~3hours
>>> 
>>> 
>>> 
>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
>>> 
>>>> This really doesn't sound right.  It should be possible to process
>>> almost a
>>>> thousand times that much data every night without that much problem.
>>>> 
>>>> How are you preparing the input data?
>>>> 
>>>> How are you converting to Mahout id's?
>>>> 
>>>> Even using python, you should be able to do the conversion in just a
> few
>>>> minutes without any parallelism whatsoever.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>>> serega.shey...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi, We are trying calculate ItemSimilarity.
>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>>> text
>>>>> each day to recalculate item similarities. We do get +100..1000 new
>>> items
>>>>> each day.
>>>>> 1. It takes too much time to prepare input data.
>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>>> 
>>>>> Is there any poissibility to provide data to mahout mapreduce
>>>>> ItemSimilarity using some binary format with compression?
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to