Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Tue, 19 Aug 2014 09:39:02 -0700

emon is a typo

I still don’t understand the difference between these “recommendations” 1) 
"look-with recommendations" = recommended items clicked? 2) similar = items 
viewed by others? The recommendations clicked will lead to viewing an item so 
#2 data includes #1 data, right? I would drop #1 and use only #2 data. Besides 
if you only recommend items that have been recommended you will decrease sales 
because you will never show other items. Over time the recommended items will 
become out of date since you never mix-in new items. You may always recommend 
an iPhone 5 even after it has been discontinued.


If you know the category of an item--filter the recs by that category or 
related categories. You are already doing this in #2 below so if you drop #1 
there is no problem, correct? Users will not see the water heater with the 
iphone.

Question) Why do you get a water heater with iphone? Unless there is a bug 
somewhere the data says that similar people looked at both. Item view data is 
not very predictive and in any case you will get this type if thing if it 
exists in user behavior. There may even be a correlation between the need for 
an iphone and a water heater that you don’t know about or it may just be a 
coincidence. But for now let’s say it’s an anomaly in the data and just filter 
those out by category.

What I was beginning to say is that it sounds like you have an ECOM site. If so 
do you have purchase data? Purchase data is usually much, much better than item 
view data. People tend to look at a lot of things but when they purchase 
something it means a much higher preference than merely looking at something.

The first rule of making a good recommender is find the best action, one that 
shows a user preference in the strongest possible way. For ecommerce that 
usually means a purchase. Then once you have that working you can add more 
actions but only with cross-cooccurrence, adding by weighting will not work 
with this type of recommender, it will only pollute your strong data with 
weaker actions. 

On Aug 19, 2014, at 8:18 AM, Serega Sheypak <serega.shey...@gmail.com> wrote:

Hi, what is "emon"?
1. I do create "look-with recommendations". I really it's just "raw" output
from itemSimilarityJob with booleanData=true and LLR as similarity function
(your suggestion)
2. I do create "similar" recommendations. I do apply category filter before
serving recommendations

"look-with", means other users watched iPhone case and other accessory with
iphone. I do have accessory for iPhone here, but also water heating
device...
similar - means show only other smarphones as recommendations to iPhone.

Right now the problem is in water heating device in 'look-with' (category
filter not applied). How can I put away such kind of recommendations and
why Do I get them?



2014-08-19 18:01 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:

> That sounds much better.
> 
> Do you have metadata like product category? Electronics vs. home
> appliance? One easy thing to do if you have categories in your catalog is
> filter by the same category as the item being viewed.
> 
> BTW it sounds like you have an emon
> On Aug 19, 2014, at 12:53 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
> 
> Hi, I 've used LLR with properties you've suggested.
> Right now I have a trouble.
> A trouble:
> Water heat device (
> http://www.vasko.ru/upload/iblock/58a/58ac7efe640a551f00bec156601e9035.jpg
> )
> is recommedned for iPhone. And it has one of the highest score.
> good things:
> iPhone cases (
> 
> https://www.apple.com/euro/iphone/c/generic/accessories/images/accessories_iphone_5s_case_colors.jpg
> )
> are recommedned for iPhone, It's good
> Other smartphones are recommended to iPhone, it's good
> Other iPhones are recommedned to iPhone. It's good. 16GB recommended to
> 32GB, e.t.c.
> 
> What could be a reason for recommending "Water heat device " to iPhone?
> iPhone is one of the most popular item. There should be a lot of people
> viewing iPhone with "Water heat device "?
> 
> 
> 
> 2014-08-18 20:15 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
> 
>> Oh, and as to using different algorithms, this is an “ensemble” method.
> In
>> the paper they are talking about using widely differing algorithms like
> ALS
>> + Cooccurrence + … This technique was used to win the Netflix prize but
> in
>> practice the improvements may be to small to warrant running multiple
>> pipelines. In any case it isn’t the first improvement you may want to
> try.
>> For instance your UI will have a drastic effect on how well you recs do,
>> and there are other much easier techniques that we can talk about once
> you
>> get the basics working.
>> 
>> 
>> On Aug 18, 2014, at 9:04 AM, Pat Ferrel <pat.fer...@gmail.com> wrote:
>> 
>> When beginning to use a recommender from Mahout I always suggest you
> start
>> from the defaults. These often give the best results—then tune afterwards
>> to improve.
>> 
>> Your intuition is correct that multiple actions can be used to improve
>> results but get the basics working first. The easiest way to use multiple
>> actions is to use spark-itemsimilarity so since you are using mapreduce
> for
>> now, just use one action.
>> 
>> I would not try to combine the results from two similarity measures there
>> is no benefit since LLR is better than any of them, at least I’ve never
>> seen it loose. Below is my experience with trying many of the similarity
>> metrics on exactly the same data. I did cross-validation with precision
>> (MAP, mean average precision). LLR wins in other cases I’ve tried too. So
>> LLR is the only method presently used in the Spark version of
>> itemsimilarity.
>> 
>> <map-comparison.xlsx 2014-08-18 08-50-44 2014-08-18 08-51-53.jpeg>
>> 
>> If you still get weird results double check your ID mapping. Run a small
>> bit of data through and spot check the mapping by hand.
>> 
>> At some point you will want to create a cross-validation test. This is
>> good as a sort of integration sanity check when making changes to the
>> recommender. You run cross-validation using standard test data to see if
>> the score changes drastically between releases. Big changes may indicate
> a
>> bug. At the beginning it will help you tune as in the case above where it
>> helped decide on LLR.
>> 
>> 
>> 
>> On Aug 18, 2014, at 1:43 AM, Serega Sheypak <serega.shey...@gmail.com>
>> wrote:
>> 
>> Thank you very much. I'll do what you are sayning in bullets 1...5 and
> try
>> again.
>> 
>> I also tried:
>> 1. calc data using COUSINE_SIMILARITY
>> 2. calc the same data using COOCCURENCE_SIMILARTY
>> 3. join #1 and #2 where COOCURENCE >= $threshold
>> 
>> Where threshold is some emperical integer value. I've used  "2" The idea
> is
>> to filter out item pairs which never-ever met together...
>> Please see this link:
>> 
>> 
> http://blog.godatadriven.com/merge-mahout-recommendations-results-from-different-algorithms.html
>> 
>> If I replace COUSINE_SIMILARITY with LLR and booleanData=true, does this
>> approach still make sense, or it's useless waste of time?
>> 
>> "What do you mean the similar items are terrible? How are you measuring
>> that? " I have eye testing only,
>> I did automate preparation->calculation->hbase upload-> web-app serving,
> I
>> didn't automate testing.
>> 
>> 
>> 
>> 
>> 2014-08-18 5:16 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
>> 
>>> the things that stand out:
>>> 
>>> 1) remove your maxSimilaritiesPerItem option! 50000
>> maxSimilaritiesPerItem
>>> will _kill_ performance and give no gain, leave this setting at the
>> default
>>> of 500
>>> 2) use only one action. What do you want the user to do? Do you want
> them
>>> to read a page? Then train on item page views. If those pages lead to a
>>> purchase then you want to recommend purchases so train on user
> purchases.
>>> 3) remove your minPrefsPerUser option, this should never be 0 or it will
>>> leave users in the training data that have no data and may contribute to
>>> longer runs with no gain.
>>> 4) this is a pretty small Hadoop cluster for the size of your data but I
>>> bet changing #1 will noticeably reduce the runtime
>>> 5) change —similarityClassname to SIMILARITY_LOGLIKELIHOOD
>>> 6) remove your —booleanData option since LLR ignores weights.
>>> 
>>> Remember that this is not the same as personalized recommendations. This
>>> method alone will show the same “similar items” for all users.
>>> 
>>> Sorry but both your “recommendation” types sound like the same thing.
>>> Using both item page view  _and_ clicks on recommended items will both
>> lead
>>> to an item page view so you have two actions that lead to the same
> thing,
>>> right? Just train on an item page view (unless you really want the user
>> to
>>> make a purchase)
>>> 
>>> What do you mean the similar items are terrible? How are you measuring
>>> that? Are you doing cross-validation measuring precision or A/B testing?
>>> What looks bad to you may be good, the eyeball test is not always
>> reliable.
>>> If they are coming up completely crazy or random then you may have a bug
>> in
>>> your ID translation logic.
>>> 
>>> It sounds like you have enough data to produce good results.
>>> 
>>> On Aug 17, 2014, at 11:14 AM, Serega Sheypak <serega.shey...@gmail.com>
>>> wrote:
>>> 
>>> 1. 7 nodes 4 CPU per node, 48 GB ram, 2 HDD for MR and HDFS. Not too
> much
>>> but enough for the start..
>>> 2. I run it as oozie action.
>>> <action name="run-mahout-primary-similarity-ItemSimilarityJob">
>>>     <java>
>>>         <job-tracker>${jobTracker}</job-tracker>
>>>         <name-node>${nameNode}</name-node>
>>>         <prepare>
>>>             <delete path="${mahoutOutputDir}/primary" />
>>>             <delete
>>> path="${tempDir}/run-mahout-ItemSimilarityJob/primary" />
>>>         </prepare>
>>>         <configuration>
>>>             <property>
>>>                 <name>mapred.queue.name</name>
>>>                 <value>default</value>
>>>             </property>
>>> 
>>>         </configuration>
>>> 
>>> 
>>> 
>> 
> <main-class>org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob</main-class>
>>>         <arg>--input</arg>
>>>         <arg>${tempDir}/to-mahout-id/projPrefs</arg><!-- dense user_id,
>>> item_id, pref [can be 3 or 5, 3 is VIEW item, 5 is CLICK on
>> recommendation,
>>> a kind of try to increase quality of recommender...]-->
>>> 
>>>         <arg>--output</arg>
>>>         <arg>${mahoutOutputDir}/primary</arg>
>>> 
>>>         <arg>--similarityClassname</arg>
>>>         <arg>SIMILARITY_COSINE</arg>
>>> 
>>>         <arg>--maxSimilaritiesPerItem</arg>
>>>         <arg>50000</arg>
>>> 
>>>         <arg>--minPrefsPerUser</arg>
>>>         <arg>0</arg>
>>> 
>>>         <arg>--booleanData</arg>
>>>         <arg>false</arg>
>>> 
>>>         <arg>--tempDir</arg>
>>>         <arg>${tempDir}/run-mahout-ItemSimilarityJob/primary</arg>
>>> 
>>>     </java>
>>>     <ok to="to-narrow-table"/>
>>>     <error to="kill"/>
>>> </action>
>>> 
>>> 3) RANK does it, here is a script:
>>> 
>>> --user, item, pref previously prepared by hive
>>> user_item_pref = LOAD '$user_item_pref' using PigStorage(',') as
>>> (user_id:chararray, item_id:long, pref:double);
>>> 
>>> --get distinct user from the whole input
>>> distUserId = distinct(FOREACH user_item_pref GENERATE user_id);
>>> 
>>> --get distinct item from the whole input
>>> distItemId = distinct(FOREACH user_item_pref GENERATE item_id);
>>> 
>>> --rank user 1....N
>>> rankUsers_ = RANK distUserId;
>>> rankUsers = FOREACH rankUsers_ GENERATE $0 as rank_id, user_id;
>>> 
>>> --rank items 1....M
>>> rankItems_ = RANK distItemId;
>>> rankItems = FOREACH rankItems_ GENERATE $0 as rank_id, item_id;
>>> 
>>> --join and remap natural user_id, item_id, to RANKS: 1.N, 1..M
>>> joinedUsers = join user_item_pref by user_id, rankUsers by user_id USING
>>> 'skewed';
>>> joinedItems = join joinedUsers by user_item_pref::item_id, rankItems by
>>> item_id using 'replicated';
>>> 
>>> projPrefs = FOREACH joinedItems GENERATE joinedUsers::rankUsers::rank_id
>>> as user_id,
>>>                                      rankItems::rank_id
>>> as item_id,
>>>                                      joinedUsers::user_item_pref::pref
>>> as pref;
>>> 
>>> --store mapping for later remapping from RANK back to natural values
>>> STORE (FOREACH rankUsers GENERATE rank_id, user_id) into '$rankUsers'
>> using
>>> PigStorage('\t');
>>> STORE (FOREACH rankItems GENERATE rank_id, item_id) into '$rankItems'
>> using
>>> PigStorage('\t');
>>> STORE (FOREACH projPrefs GENERATE user_id, item_id, pref) into
>> '$projPrefs'
>>> using PigStorage('\t');
>>> 
>>> 4) I've seen this idea in different discussion, that different weight
> for
>>> different actions are not good. Sorry, I don't understand what you do
>>> suggest.
>>> I have two kind of actions: user viewed item, user clicked on
> recommended
>>> item (recommended item produced by my item similarity system).
>>> I want to produce two kinds of recommendations:
>>> 1. current item + recommend other items which other users visit in
>>> conjuction with current item
>>> 2. similar item: recommend items similar to current viewed item.
>>> What can I try?
>>> LLR=http://en.wikipedia.org/wiki/Log-likelihood_ratio= LOG_LIKEHOOD?
>>> 
>>> Right now I do get awful recommendations and I can't understand what can
>> I
>>> try next :((((((((((((
>>> 
>>> 
>>> 2014-08-17 19:02 GMT+04:00 Pat Ferrel <pat.fer...@gmail.com>:
>>> 
>>>> 1) how many cores in the cluster? The whole idea behind mapreduce is
> you
>>>> buy more cpus you get nearly linear decrease in runtime.
>>>> 2) what is your mahout command line with options, or how are you
>> invoking
>>>> mahout. I have seen the Mahout mapreduce recommender take this long so
>> we
>>>> should check what you are doing with downsampling.
>>>> 3) do you really need to RANK your ids, that’s a full sort? When using
>>> pig
>>>> I usually get DISTINCT ones and assign an incrementing integer as the
>>>> Mahout ID corresponding
>>>> 4) your #2 assigning different weights to different actions usually
> does
>>>> not work. I’ve done this before and compared offline metrics and seen
>>>> precision go down. I’d get this working using only your primary actions
>>>> first. What are you trying to get the user to do? View something, buy
>>>> something? Use that action as the primary preference and start out with
>> a
>>>> weight of 1 using LLR. With LLR the weights are not used anyway so your
>>>> data may not produce good results with mixed actions.
>>>> 
>>>> A plug for the (admittedly pre-alpha) spark-itemsimilairty:
>>>> 1) output from 2 can be directly ingested and will create output.
>>>> 2) multiple actions can be used with cross-cooccurrence, not by
> guessing
>>>> at weights.
>>>> 3) output has your application specific IDs preserved.
>>>> 4) its about 10x faster than mapreduce and will do aways with your ID
>>>> translation steps
>>>> 
>>>> One caveat is that your cluster machines will need lots of memory. I
>> have
>>>> 8-16g on mine.
>>>> 
>>>> On Aug 17, 2014, at 1:26 AM, Serega Sheypak <serega.shey...@gmail.com>
>>>> wrote:
>>>> 
>>>> 1. I do collect preferences for items using 60days sliding window.
> today
>>> -
>>>> 60 days.
>>>> 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for
>> item
>>>> view, 5 for clicking recommndation block. The idea is to give more
> value
>>>> for recommendations which attact visitor attention). I get ~ 20.000.000
>>> of
>>>> lines with ~1.000.000 distinct items and ~2.000.000 distinct users
>>>> 3. I do use apache pig RANK function to rank all distinct user_id
>>>> 4. I do the same for item_id
>>>> 5. I do join input dataset with ranked datasets and provide input to
>>> mahout
>>>> with dense interger user_id, item_id
>>>> 6. I do get mahout output and join integer item_id back to get natural
>>> key
>>>> value.
>>>> 
>>>> step #1-2 takes ~ 40min
>>>> step #3-5 takes ~1 hour
>>>> mahout calc takes ~3hours
>>>> 
>>>> 
>>>> 
>>>> 2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
>>>> 
>>>>> This really doesn't sound right.  It should be possible to process
>>>> almost a
>>>>> thousand times that much data every night without that much problem.
>>>>> 
>>>>> How are you preparing the input data?
>>>>> 
>>>>> How are you converting to Mahout id's?
>>>>> 
>>>>> Even using python, you should be able to do the conversion in just a
>> few
>>>>> minutes without any parallelism whatsoever.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <
>>>> serega.shey...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi, We are trying calculate ItemSimilarity.
>>>>>> Right now we have 2*10^7 input lines. I do provide input data as raw
>>>> text
>>>>>> each day to recalculate item similarities. We do get +100..1000 new
>>>> items
>>>>>> each day.
>>>>>> 1. It takes too much time to prepare input data.
>>>>>> 2. It takes too much time to convert user_id, item_id to mahout ids
>>>>>> 
>>>>>> Is there any poissibility to provide data to mahout mapreduce
>>>>>> ItemSimilarity using some binary format with compression?
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to