Sorry I haven’t read this thread carefully but it looks like you may be using 
the wrong IDs.

For most Mahout jobs you have to prepare you data to have Mahout IDs. You do 
this by looking at each datum and as you see a new unique application specific 
user or item ID you give it a Mahout ID starting from 0. So Mahout ID can be 
thought of as row and column numbers in a matrix. The Mahout IDs for rows will 
be 0 thru # of rows-1 same for columns.

This always requires that you translate into Mahout IDs then after the job is 
run translate back into your application IDs. You need a bi-directional 
dictionary of some type. I use a HashBiMap from Guava.

Also I’d avoid the threshold for now. If you get that wrong it will mess things 
up badly and is very hard to tune. It’s there for completeness but I never use 
it.


On Jul 25, 2014, at 12:55 AM, Serega Sheypak <serega.shey...@gmail.com> wrote:

Hi, nothing helps...
I do use mahout 0.9 compiled for CDH 4.7
I do provide only positive values
I do use itemsimilarityJob and do get 2000 similarities for 1400 unique
items
Input data is:
16*10^6 preferences
4*10^6 users
0.6*10^ items
I do use perason correlation and preferece vlaues are: 1.0 and 2.0


2014-07-22 9:32 GMT+04:00 Serega Sheypak <serega.shey...@gmail.com>:

> Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening.
> Right now I don't see how can it help me. As far as I know the stuff I try
> to use is pretty old and stable.
> looks like I do apply it in a wrong way.
> 
> There is an option for recommenditembased named "--threshold". I do
> provide data for recommenditembased with preference values in range
> [1.1..2.0].
> I set --threshold to 1.2
> --threshold is absolute and can be from [1.1 . .2+] or it's relative and
> can be [0.0 .. 0.99999]?
> 
> 
> 2014-07-22 3:54 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
> 
> That version is no longer supported.  You should upgrade to 0.9
>> 
>> 
>> 
>> 
>> On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <
>> serega.shey...@gmail.com>
>> wrote:
>> 
>>> 0.7-cdh4.7.0
>>> Anyway, recommenditembased does produce these catalogs:
>>> 
>>> /recommenditembased/temp/maxValues.bin
>>> /recommenditembased/temp/norms.bin
>>> /recommenditembased/temp/numNonZeroEntries.bin
>>> /recommenditembased/temp/pairwiseSimilarity
>>> /recommenditembased/temp/partialMultiply
>>> /recommenditembased/temp/prePartialMultiply1
>>> /recommenditembased/temp/prePartialMultiply2
>>> /recommenditembased/temp/preparePreferenceMatrix
>>> /recommenditembased/temp/similarityMatrix
>>> /recommenditembased/temp/weights
>>> 
>>> I suppose that "/recommenditembased/temp/similarityMatrix" is the thing
>> In
>>> eed. Right now I try to read it using
>>> 
>>> matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING
>>> com.twitter.elephantbird.pig.load.SequenceFileLoader(
>>>    '-c com.twitter.elephantbird.pig.util.IntWritableConverter',
>>>    '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter'
>>> )  as (intId: int, vector:tuple(cardinality:int,
>>> entries:bag{t:tuple(some_id:long, some_value:double)}));
>>> 
>>> 
>>> Looks like the vector is empty... Or i do something wrong.
>>> 
>>> 
>>> 
>>> 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:
>>> 
>>>> Which version of Mahout?
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak <
>>> serega.shey...@gmail.com
>>>>> 
>>>> wrote:
>>>> 
>>>>> Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while
>>>> processing
>>>>> Job-Specific
>>>>> 
>>>>> sudo -u hdfs hadoop fs -rm -r
>>>> hdfs://nameservice1/recommenditembased/output
>>>>> sudo -u hdfs hadoop fs -rm -r
>>> hdfs://nameservice1/recommenditembased/temp
>>>>> sudo -u oozie mahout recommenditembased \
>>>>>                    --input \
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks
>>>>> \
>>>>>                    --output \
>>>>>                    hdfs://nameservice1/recommenditembased/output \
>>>>>                    --similarityClassname \
>>>>>                    SIMILARITY_LOGLIKELIHOOD \
>>>>>                   --numRecommendations \
>>>>>                    500 \
>>>>>                    --booleanData \
>>>>>                    false \
>>>>>                    --maxPrefsPerUser \
>>>>>                    1000 \
>>>>>                    --maxSimilaritiesPerItem \
>>>>>                    1000 \
>>>>>                    --minPrefsPerUser \
>>>>>                    5 \
>>>>>                    --maxPrefsPerUserInItemSimilarity \
>>>>>                    30 \
>>>>>                    --threshold \
>>>>>                   1.1 \
>>>>>                    --tempDir \
>>>>>                    hdfs://nameservice1/recommenditembased/temp \
>>>>>                    --outputPathForSimilarityMatrix \
>>>>> 
>> hdfs://nameservice1/recommenditembased/sim_matrix
>>>>> 
>>>>> 
>>>>> I'm on Cloudera cdh 4.7, looks like this feature is not supported.
>>>>> 
>>>>> 
>>>>> 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:
>>>>> 
>>>>>> Serega,
>>>>>> 
>>>>>> See the last line on how to pass outputPathForSimilarityMatrix
>>> options
>>>> to
>>>>>> the recommenditembased command:
>>>>>> 
>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>                   --input visited_items_with_inverted_items \
>>>>>> 
>>>>>>                   --output result \
>>>>>>                   --similarityClassname SIMILARITY_LOGLIKELIHOOD
>> \
>>>>>>                   --usersFile inverted_items \
>>>>>>                   --numRecommendations 500 \
>>>>>>                   --booleanData false \
>>>>>>                   --maxPrefsPerUser 100 \
>>>>>>                   --maxSimilaritiesPerItem 500 \
>>>>>>                   --minPrefsPerUser 0\
>>>>>>                   --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>                   --threshold 0.91 \
>>>>>>                   --tempDir  temp \
>>>>>>                   --outputPathForSimilarityMatrix
>> similarityMatri \
>>>>>> 
>>>>>> 
>>>>>> Peng Zhang
>>>>>> pzhang.x...@gmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Jul 21, 2014, at 3:09 PM, Serega Sheypak <
>>> serega.shey...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I've inspected the code, our approach wouldn't work with
>>>>>> booleanData=false.
>>>>>>> We do calcualte imte similarity in the wrong way...(((
>>>>>>> Thank you
>>>>>>> 1. We provide "fake" user_id and provide --usersFile in order to
>>> get
>>>>>>> recommendations for "fake user_id, where user_id is a negative
>>>> item_id.
>>>>>> It
>>>>>>> worked when we did provide user_id->item_id pairs without
>>> preference.
>>>>>>> 2. Our target is to get item similarities. We tried
>>>>>>> 
>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>> but
>>>>>> it
>>>>>>> returns bad result comparing to RecommenderJob with our "fake"
>>>> user_id
>>>>>>> (inverted item_id)
>>>>>>> 
>>>>>>> 1. I'll try the option you provided.
>>>>>>> 2. I will remove input with fake user_id and usersFile with
>> these
>>>> fake
>>>>>> ids
>>>>>>> 
>>>>>>> 3.
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
>>>>>>> I don't understand how to pass ---outputPathForSimilarityMatrix
>>>> option
>>>>> to
>>>>>>> RecommenderJob
>>>>>>> 
>>>>>>> 
>>>>>>> 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>:
>>>>>>> 
>>>>>>>> Seraga,
>>>>>>>> 
>>>>>>>> I have two comments:
>>>>>>>> 1. Don’t use negative user ids. Since Mahout uses user id as
>> well
>>> as
>>>>>> item
>>>>>>>> id as the row/column index, you’d better use 0, 1, 2, etc as
>> ids
>>>>>>>> 2. If you want to get the item similarity information, you can
>> use
>>>>>>>> --outputPathForSimilarityMatrix in the command
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Peng Zhang
>>>>>>>> M: +86 186-1658-7856
>>>>>>>> pzhang.x...@gmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 21, 2014, at 4:00 AM, Serega Sheypak <
>>>> serega.shey...@gmail.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> All bad things happen here:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Name
>>>>>>>>> 
>>>>>>>>> RecommenderJob-PartialMultiplyMapper-Reducer
>>>>>>>>> 
>>>>>>>>> User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Process User
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Group
>>>>>>>>> 
>>>>>>>>> oozie
>>>>>>>>> 
>>>>>>>>> Mapper Class
>>>>>>>>> 
>>>>>>>>> PartialMultiplyMapper
>>>>>>>>> 
>>>>>>>>> Reducer Class
>>>>>>>>> 
>>>>>>>>> AggregateAndRecommendReducer
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Job Input Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/temp/partialMultiply
>>>>>>>>> 
>>>>>>>>> Job Output Directory
>>>>>>>>> 
>>>>>>>>> hdfs://nameservice1/itemrec/output/
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map input
>>>>> records=3312879
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Map output
>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce input
>>>>>> records=3313251
>>>>>>>>> 
>>>>>>>>> 14/07/20 23:57:47 INFO mapred.JobClient:     Reduce output
>>>> records=0
>>>>>>>>> 
>>>>>>>>> Why does mahout returns 0 rows? it works when booleanData=true
>>>>>>>> (preferences
>>>>>>>>> are ignored...?)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak <
>>>> serega.shey...@gmail.com
>>>>>> :
>>>>>>>>> 
>>>>>>>>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40
>>>>>>>>>> users_file:
>>>>>>>>>> --inverted_item_id
>>>>>>>>>> -1
>>>>>>>>>> -2
>>>>>>>>>> -3
>>>>>>>>>> -4
>>>>>>>>>> 
>>>>>>>>>> users_items_prefs
>>>>>>>>>> --inverted item_id
>>>>>>>>>> -1 1 1.0
>>>>>>>>>> -2 2 1.0
>>>>>>>>>> -3 3 1.0
>>>>>>>>>> -4 4 1.0
>>>>>>>>>> --user_id item_id pref_value
>>>>>>>>>> 11   1 1.6
>>>>>>>>>> 11   2 1.6
>>>>>>>>>> 123 3 2.0
>>>>>>>>>> 123 4 2.0
>>>>>>>>>> 333 1 2.0
>>>>>>>>>> 333 2 1.6
>>>>>>>>>> --e.t.c.
>>>>>>>>>> 
>>>>>>>>>> if I set --booleanData true
>>>>>>>>>> then mahout returns the result.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman <
>>>>>> andrew.mussel...@gmail.com
>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>> I'm confused about how you're constructing the user file, and
>>> why
>>>>>> there
>>>>>>>>>>> are negated item ids here.
>>>>>>>>>>> 
>>>>>>>>>>> Can you post some more details please, including Mahout
>> version
>>>> and
>>>>>>>> some
>>>>>>>>>>> sample data sets?
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak <
>>>>>>>> serega.shey...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi, I'm trying to create item similarity.
>>>>>>>>>>>> I gather items which users visit during shopping and then
>>>> create a
>>>>>>>> file:
>>>>>>>>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6,
>>> 1.9],
>>>>>>>> depends
>>>>>>>>>>> on
>>>>>>>>>>>> user action type and data source)
>>>>>>>>>>>> UNION
>>>>>>>>>>>> -item_id, item_id, 1.0 (from items dictionary)
>>>>>>>>>>>> 
>>>>>>>>>>>> and I do provide a userFile, where user_id = -item_id
>>>>>>>>>>>> 
>>>>>>>>>>>> The idea is to get item similary. If any user visits item
>>> named
>>>>>> "A", i
>>>>>>>>>>> want
>>>>>>>>>>>> to show him items "B", "c", "xxx" using preferences of
>> other
>>>>> users.
>>>>>>>>>>>> 
>>>>>>>>>>>> The problem is that the last (???) mapreduce job returns 0
>>> rows:
>>>>>>>>>>>> 
>>>>>>>>>>>> Here are my settings:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> sudo -u oozie mahout recommenditembased \
>>>>>>>>>>>>                 --input visited_items_with_inverted_items
>> \
>>>>>>>>>>>> 
>>>>>>>>>>>>                 --output result \
>>>>>>>>>>>>                 --similarityClassname
>>> SIMILARITY_LOGLIKELIHOOD
>>>> \
>>>>>>>>>>>>                 --usersFile inverted_items \
>>>>>>>>>>>>                 --numRecommendations 500 \
>>>>>>>>>>>>                 --booleanData false \
>>>>>>>>>>>>                 --maxPrefsPerUser 100 \
>>>>>>>>>>>>                 --maxSimilaritiesPerItem 500 \
>>>>>>>>>>>>                 --minPrefsPerUser 0\
>>>>>>>>>>>>                 --maxPrefsPerUserInItemSimilarity 30 \
>>>>>>>>>>>>                 --threshold 0.91 \
>>>>>>>>>>>>                 --tempDir  temp \
>>>>>>>>>>>> 
>>>>>>>>>>>> Some counters... I don't get what do they mean....
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>> org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:08 INFO mapred.JobClient:     USERS=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>>> USER_RATINGS_NEGLECTED=1,798,738
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:43:43 INFO mapred.JobClient:
>>>>>>>>>>> USER_RATINGS_USED=12,429,693
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:44:24 INFO mapred.JobClient:     ROWS=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> COOCCURRENCES=35882374
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:45:18 INFO mapred.JobClient:
>>>>> PRUNED_COOCCURRENCES=0
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Map output
>>>>>>>> records=17570268
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=5221907
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:00 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:46:34 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map input
>>>>>> records=7528530
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:06 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map input
>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Map output
>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=6626130
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:47:40 INFO mapred.JobClient:     Reduce output
>>>>>>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map input
>>>>>> records=3312879
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Map output
>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce input
>>>>>>>>>>> records=3313251
>>>>>>>>>>>> 
>>>>>>>>>>>> --------
>>>>>>>>>>>> 14/07/20 22:48:26 INFO mapred.JobClient:     Reduce output
>>>>> records=0
>>>>>>>>>>>> --------
>>>>>>>>>>>> 
>>>>>>>>>>>> why 0???
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 

Reply via email to