Thanks! I'll report this evening.

Are there any articles about data preparation for mahout item
recommendation? There are many books but most of them are copy-paste of
javadoc and guides from mahout site.
I'm -1 at math, my challenges are:

1. approaches for data cleaning, do I have to apply dead-simple statisical
rules?
"The empirical rule also states that approximately 95 percent of the data
values will fall within two standard deviations from the mean."
So If my user visits are described as normal distirbution Does it make
sense? The idea is to put away all noise.

2. similarityClassname - don't have any intuition here... I see that people
use SIMILARITY_LOGLIKELIHOOD and PEARSON


2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:

> Serega,
>
> See the last line on how to pass outputPathForSimilarityMatrix options to
> the recommenditembased command:
>
> sudo -u oozie mahout recommenditembased \
>                    --input visited_items_with_inverted_items \
>
>                    --output result \
>                    --similarityClassname SIMILARITY_LOGLIKELIHOOD \
>                    --usersFile inverted_items \
>                    --numRecommendations 500 \
>                    --booleanData false \
>                    --maxPrefsPerUser 100 \
>                    --maxSimilaritiesPerItem 500 \
>                    --minPrefsPerUser 0\
>                    --maxPrefsPerUserInItemSimilarity 30 \
>                    --threshold 0.91 \
>                    --tempDir  temp \
>                    --outputPathForSimilarityMatrix similarityMatri \
>
>
> Peng Zhang
> pzhang.x...@gmail.com
>
>
>
>
>
> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
> > I've inspected the code, our approach wouldn't work with
> booleanData=false.
> > We do calcualte imte similarity in the wrong way...(((
> > Thank you
> > 1. We provide "fake" user_id and provide --usersFile in order to get
> > recommendations for "fake user_id, where user_id is a negative item_id.
> It
> > worked when we did provide user_id->item_id pairs without preference.
> > 2. Our target is to get item similarities. We tried
> > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob but
> it
> > returns bad result comparing to RecommenderJob with our "fake" user_id
> > (inverted item_id)
> >
> > 1. I'll try the option you provided.
> > 2. I will remove input with fake user_id and usersFile with these fake
> ids
> >
> > 3.
> >
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
> > I don't understand how to pass ---outputPathForSimilarityMatrix option to
> > RecommenderJob
> >
> >
> > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:
> >
> >> Seraga,
> >>
> >> I have two comments:
> >> 1. Don’t use negative user ids. Since Mahout uses user id as well as
> item
> >> id as the row/column index, you’d better use 0, 1, 2, etc as ids
> >> 2. If you want to get the item similarity information, you can use
> >> --outputPathForSimilarityMatrix in the command
> >>
> >> Regards,
> >> Peng Zhang
> >> M: +86 186-1658-7856
> >> pzhang.x...@gmail.com
> >>
> >>
> >>
> >>
> >>
> >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <serega.shey...@gmail.com>
> >> wrote:
> >>
> >>> All bad things happen here:
> >>>
> >>>
> >>>
> >>> Name
> >>>
> >>> RecommenderJob-PartialMultiplyMapper-Reducer
> >>>
> >>> User
> >>>
> >>> oozie
> >>>
> >>> Process User
> >>>
> >>> oozie
> >>>
> >>> Group
> >>>
> >>> oozie
> >>>
> >>> Mapper Class
> >>>
> >>> PartialMultiplyMapper
> >>>
> >>> Reducer Class
> >>>
> >>> AggregateAndRecommendReducer
> >>>
> >>>
> >>> Job Input Directory
> >>>
> >>> hdfs://nameservice1/itemrec/temp/partialMultiply
> >>>
> >>> Job Output Directory
> >>>
> >>> hdfs://nameservice1/itemrec/output/
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input records=3312879
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output records=3313251
> >>>
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
> records=3313251
> >>>
> >>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output records=0
> >>>
> >>> Why does mahout returns 0 rows? it works when booleanData=true
> >> (preferences
> >>> are ignored...?)
> >>>
> >>>
> >>>
> >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <serega.shey...@gmail.com>:
> >>>
> >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
> >>>> users_file:
> >>>> --inverted_item_id
> >>>> -1
> >>>> -2
> >>>> -3
> >>>> -4
> >>>>
> >>>> users_items_prefs
> >>>> --inverted item_id
> >>>> -1 1 1.0
> >>>> -2 2 1.0
> >>>> -3 3 1.0
> >>>> -4 4 1.0
> >>>> --user_id item_id pref_value
> >>>> 11   1 1.6
> >>>> 11   2 1.6
> >>>> 123 3 2.0
> >>>> 123 4 2.0
> >>>> 333 1 2.0
> >>>> 333 2 1.6
> >>>> --e.t.c.
> >>>>
> >>>> if I set --booleanData true
> >>>> then mahout returns the result.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
> andrew.mussel...@gmail.com
> >>> :
> >>>>
> >>>> I'm confused about how you're constructing the user file, and why
> there
> >>>>> are negated item ids here.
> >>>>>
> >>>>> Can you post some more details please, including Mahout version and
> >> some
> >>>>> sample data sets?
> >>>>>
> >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
> >> serega.shey...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi, I'm trying to create item similarity.
> >>>>>> I gather items which users visit during shopping and then create a
> >> file:
> >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, 1.9],
> >> depends
> >>>>> on
> >>>>>> user action type and data source)
> >>>>>> UNION
> >>>>>> -item_id, item_id, 1.0 (from items dictionary)
> >>>>>>
> >>>>>> and I do provide a userFile, where user_id = -item_id
> >>>>>>
> >>>>>> The idea is to get item similary. If any user visits item named
> "A", i
> >>>>> want
> >>>>>> to show him items "B", "c", "xxx" using preferences of other users.
> >>>>>>
> >>>>>> The problem is that the last (???) mapreduce job returns 0 rows:
> >>>>>>
> >>>>>> Here are my settings:
> >>>>>>
> >>>>>>
> >>>>>> sudo -u oozie mahout recommenditembased \
> >>>>>>                  --input visited_items_with_inverted_items \
> >>>>>>
> >>>>>>                  --output result \
> >>>>>>                  --similarityClassname SIMILARITY_LOGLIKELIHOOD \
> >>>>>>                  --usersFile inverted_items \
> >>>>>>                  --numRecommendations 500 \
> >>>>>>                  --booleanData false \
> >>>>>>                  --maxPrefsPerUser 100 \
> >>>>>>                  --maxSimilaritiesPerItem 500 \
> >>>>>>                  --minPrefsPerUser 0\
> >>>>>>                  --maxPrefsPerUserInItemSimilarity 30 \
> >>>>>>                  --threshold 0.91 \
> >>>>>>                  --tempDir  temp \
> >>>>>>
> >>>>>> Some counters... I don't get what do they mean....
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
> >>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
> >>>>>>
> >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>>>  USER_RATINGS_NEGLECTED=1,798,738
> >>>>>>
> >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
> >>>>> USER_RATINGS_USED=12,429,693
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
> >>>>>>
> >>>>>
> >>
> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     COOCCURRENCES=35882374
> >>>>>>
> >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:     PRUNED_COOCCURRENCES=0
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
> >> records=17570268
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
> >>>>> records=5221907
> >>>>>>
> >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
> records=7528530
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
> >>>>> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
> >> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
> >>>>> records=6626130
> >>>>>>
> >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
> >>>>> records=3312879
> >>>>>>
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
> records=3312879
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
> >> records=3313251
> >>>>>>
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
> >>>>> records=3313251
> >>>>>>
> >>>>>> --------
> >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output records=0
> >>>>>> --------
> >>>>>>
> >>>>>> why 0???
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to