My personal comments:
1. Data cleansing. One beautiful characteristic of Mahout’s CF recommendation 
is the simplicity of input data, often times just three columns (user, item, 
preference). If any value is missing, just don’t put the record in the input 
file. Therefore I don’t see there is any need to do data cleaning given that 
the application has recorded user-item-preference correctly and you have 
translated user-id and item-id properly.
2. Oftentimes Loglikelihood has a better performance than PearsonCorrelation in 
Mahout’s Collaborative Filtering. The former is focused on discrete values and 
the latter is focused on continuous values. Refer to Ted’s popular post 
Surprise and Coincidence about the former.


Peng Zhang
pzhang.x...@gmail.com





On Jul 21, 2014, at 3:37 PM, Serega Sheypak <serega.shey...@gmail.com> wrote:

> Thanks! I'll report this evening.
> 
> Are there any articles about data preparation for mahout item
> recommendation? There are many books but most of them are copy-paste of
> javadoc and guides from mahout site.
> I'm -1 at math, my challenges are:
> 
> 1. approaches for data cleaning, do I have to apply dead-simple statisical
> rules?
> "The empirical rule also states that approximately 95 percent of the data
> values will fall within two standard deviations from the mean."
> So If my user visits are described as normal distirbution Does it make
> sense? The idea is to put away all noise.
> 
> 2. similarityClassname - don't have any intuition here... I see that people
> use SIMILARITY_LOGLIKELIHOOD and PEARSON
> 
> 
> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:
> 
>> Serega,
>> 
>> See the last line on how to pass outputPathForSimilarityMatrix options to
>> the recommenditembased command:
>> 
>> sudo -u oozie mahout recommenditembased \
>>                   --input visited_items_with_inverted_items \
>> 
>>                   --output result \
>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>                   --usersFile inverted_items \
>>                   --numRecommendations 500 \
>>                   --booleanData false \
>>                   --maxPrefsPerUser 100 \
>>                   --maxSimilaritiesPerItem 500 \
>>                   --minPrefsPerUser 0\
>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>                   --threshold 0.91 \
>>                   --tempDir  temp \
>>                   --outputPathForSimilarityMatrix similarityMatri \
>> 
>> 
>> Peng Zhang
>> pzhang.x...@gmail.com
>> 
>> 
>> 
>> 
>> 
>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <serega.shey...@gmail.com>
>> wrote:
>> 
>>> I've inspected the code, our approach wouldn't work with
>> booleanData=false.
>>> We do calcualte imte similarity in the wrong way...(((
>>> Thank you
>>> 1. We provide "fake" user_id and provide --usersFile in order to get
>>> recommendations for "fake user_id, where user_id is a negative item_id.
>> It
>>> worked when we did provide user_id->item_id pairs without preference.
>>> 2. Our target is to get item similarities. We tried
>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
>> it
>>> returns bad result comparing to RecommenderJob with our "fake" user_id
>>> (inverted item_id)
>>> 
>>> 1. I'll try the option you provided.
>>> 2. I will remove input with fake user_id and usersFile with these fake
>> ids
>>> 
>>> 3.
>>> 
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>> I don't understand how to pass ---outputPathForSimilarityMatrix option to
>>> RecommenderJob
>>> 
>>> 
>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:
>>> 
>>>> Seraga,
>>>> 
>>>> I have two comments:
>>>> 1. Don’t use negative user ids. Since Mahout uses user id as well as
>> item
>>>> id as the row/column index, you’d better use 0, 1, 2, etc as ids
>>>> 2. If you want to get the item similarity information, you can use
>>>> --outputPathForSimilarityMatrix in the command
>>>> 
>>>> Regards,
>>>> Peng Zhang
>>>> M: +86 186-1658-7856
>>>> pzhang.x...@gmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <serega.shey...@gmail.com>
>>>> wrote:
>>>> 
>>>>> All bad things happen here:
>>>>> 
>>>>> 
>>>>> 
>>>>> Name
>>>>> 
>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>> 
>>>>> User
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Process User
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Group
>>>>> 
>>>>> oozie
>>>>> 
>>>>> Mapper Class
>>>>> 
>>>>> PartialMultiplyMapper
>>>>> 
>>>>> Reducer Class
>>>>> 
>>>>> AggregateAndRecommendReducer
>>>>> 
>>>>> 
>>>>> Job Input Directory
>>>>> 
>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>> 
>>>>> Job Output Directory
>>>>> 
>>>>> hdfs://nameservice1/itemrec/output/
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
>>>>> 
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>> records=3313251
>>>>> 
>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
>>>>> 
>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>> (preferences
>>>>> are ignored...?)
>>>>> 
>>>>> 
>>>>> 
>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <serega.shey...@gmail.com>:
>>>>> 
>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>> users_file:
>>>>>> --inverted_item_id
>>>>>> -1
>>>>>> -2
>>>>>> -3
>>>>>> -4
>>>>>> 
>>>>>> users_items_prefs
>>>>>> --inverted item_id
>>>>>> -1 1 1.0
>>>>>> -2 2 1.0
>>>>>> -3 3 1.0
>>>>>> -4 4 1.0
>>>>>> --user_id item_id pref_value
>>>>>> 11   1 1.6
>>>>>> 11   2 1.6
>>>>>> 123 3 2.0
>>>>>> 123 4 2.0
>>>>>> 333 1 2.0
>>>>>> 333 2 1.6
>>>>>> --e.t.c.
>>>>>> 
>>>>>> if I set --booleanData true
>>>>>> then mahout returns the result.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>> andrew.mussel...@gmail.com
>>>>> :
>>>>>> 
>>>>>> I'm confused about how you're constructing the user file, and why
>> there
>>>>>>> are negated item ids here.
>>>>>>> 
>>>>>>> Can you post some more details please, including Mahout version and
>>>> some
>>>>>>> sample data sets?
>>>>>>> 
>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>> serega.shey...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>> I gather items which users visit during shopping and then create a
>>>> file:
>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
>>>> depends
>>>>>>> on
>>>>>>>> user action type and data source)
>>>>>>>> UNION
>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>> 
>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>> 
>>>>>>>> The idea is to get item similary. If any user visits item named
>> "A", i
>>>>>>> want
>>>>>>>> to show him items "B", "c", "xxx" using preferences of other users.
>>>>>>>> 
>>>>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
>>>>>>>> 
>>>>>>>> Here are my settings:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>                 --input visited_items_with_inverted_items \
>>>>>>>> 
>>>>>>>>                 --output result \
>>>>>>>>                 --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>>>>>>>>                 --usersFile inverted_items \
>>>>>>>>                 --numRecommendations 500 \
>>>>>>>>                 --booleanData false \
>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>                 --maxSimilaritiesPerItem 500 \
>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>                 --threshold 0.91 \
>>>>>>>>                 --tempDir  temp \
>>>>>>>> 
>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>> 
>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>> 
>>>>>>> 
>>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
>>>>>>>> 
>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>> records=17570268
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>> records=5221907
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>> records=7528530
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>> records=6626130
>>>>>>>> 
>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>> records=3312879
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>> records=3312879
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>> records=3313251
>>>>>>>> 
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>> records=3313251
>>>>>>>> 
>>>>>>>> --------
>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
>>>>>>>> --------
>>>>>>>> 
>>>>>>>> why 0???
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to