Ok, I have recompiled mahout 0.9 for CDH 4.7. I'll try this evening. Right now I don't see how can it help me. As far as I know the stuff I try to use is pretty old and stable. looks like I do apply it in a wrong way.
There is an option for recommenditembased named "--threshold". I do provide data for recommenditembased with preference values in range [1.1..2.0]. I set --threshold to 1.2 --threshold is absolute and can be from [1.1 . .2+] or it's relative and can be [0.0 .. 0.99999]? 2014-07-22 3:54 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>: > That version is no longer supported. You should upgrade to 0.9 > > > > > On Mon, Jul 21, 2014 at 11:41 AM, Serega Sheypak <serega.shey...@gmail.com > > > wrote: > > > 0.7-cdh4.7.0 > > Anyway, recommenditembased does produce these catalogs: > > > > /recommenditembased/temp/maxValues.bin > > /recommenditembased/temp/norms.bin > > /recommenditembased/temp/numNonZeroEntries.bin > > /recommenditembased/temp/pairwiseSimilarity > > /recommenditembased/temp/partialMultiply > > /recommenditembased/temp/prePartialMultiply1 > > /recommenditembased/temp/prePartialMultiply2 > > /recommenditembased/temp/preparePreferenceMatrix > > /recommenditembased/temp/similarityMatrix > > /recommenditembased/temp/weights > > > > I suppose that "/recommenditembased/temp/similarityMatrix" is the thing > In > > eed. Right now I try to read it using > > > > matrix = LOAD '/recommenditembased/temp/similarityMatrix' USING > > com.twitter.elephantbird.pig.load.SequenceFileLoader( > > '-c com.twitter.elephantbird.pig.util.IntWritableConverter', > > '-c com.twitter.elephantbird.pig.mahout.VectorWritableConverter' > > ) as (intId: int, vector:tuple(cardinality:int, > > entries:bag{t:tuple(some_id:long, some_value:double)})); > > > > > > Looks like the vector is empty... Or i do something wrong. > > > > > > > > 2014-07-21 22:09 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>: > > > > > Which version of Mahout? > > > > > > > > > On Mon, Jul 21, 2014 at 11:05 AM, Serega Sheypak < > > serega.shey...@gmail.com > > > > > > > wrote: > > > > > > > Hi, I've tried: Unexpected --outputPathForSimilarityMatrix while > > > processing > > > > Job-Specific > > > > > > > > sudo -u hdfs hadoop fs -rm -r > > > hdfs://nameservice1/recommenditembased/output > > > > sudo -u hdfs hadoop fs -rm -r > > hdfs://nameservice1/recommenditembased/temp > > > > sudo -u oozie mahout recommenditembased \ > > > > --input \ > > > > > > > > > > > > > > > > > > hdfs://nameservice1/user/hive/warehouse/staging_weighted_visits_and_rec_clicks > > > > \ > > > > --output \ > > > > hdfs://nameservice1/recommenditembased/output \ > > > > --similarityClassname \ > > > > SIMILARITY_LOGLIKELIHOOD \ > > > > --numRecommendations \ > > > > 500 \ > > > > --booleanData \ > > > > false \ > > > > --maxPrefsPerUser \ > > > > 1000 \ > > > > --maxSimilaritiesPerItem \ > > > > 1000 \ > > > > --minPrefsPerUser \ > > > > 5 \ > > > > --maxPrefsPerUserInItemSimilarity \ > > > > 30 \ > > > > --threshold \ > > > > 1.1 \ > > > > --tempDir \ > > > > hdfs://nameservice1/recommenditembased/temp \ > > > > --outputPathForSimilarityMatrix \ > > > > hdfs://nameservice1/recommenditembased/sim_matrix > > > > > > > > > > > > I'm on Cloudera cdh 4.7, looks like this feature is not supported. > > > > > > > > > > > > 2014-07-21 11:18 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>: > > > > > > > > > Serega, > > > > > > > > > > See the last line on how to pass outputPathForSimilarityMatrix > > options > > > to > > > > > the recommenditembased command: > > > > > > > > > > sudo -u oozie mahout recommenditembased \ > > > > > --input visited_items_with_inverted_items \ > > > > > > > > > > --output result \ > > > > > --similarityClassname SIMILARITY_LOGLIKELIHOOD \ > > > > > --usersFile inverted_items \ > > > > > --numRecommendations 500 \ > > > > > --booleanData false \ > > > > > --maxPrefsPerUser 100 \ > > > > > --maxSimilaritiesPerItem 500 \ > > > > > --minPrefsPerUser 0\ > > > > > --maxPrefsPerUserInItemSimilarity 30 \ > > > > > --threshold 0.91 \ > > > > > --tempDir temp \ > > > > > --outputPathForSimilarityMatrix similarityMatri > \ > > > > > > > > > > > > > > > Peng Zhang > > > > > pzhang.x...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Jul 21, 2014, at 3:09 PM, Serega Sheypak < > > serega.shey...@gmail.com> > > > > > wrote: > > > > > > > > > > > I've inspected the code, our approach wouldn't work with > > > > > booleanData=false. > > > > > > We do calcualte imte similarity in the wrong way...((( > > > > > > Thank you > > > > > > 1. We provide "fake" user_id and provide --usersFile in order to > > get > > > > > > recommendations for "fake user_id, where user_id is a negative > > > item_id. > > > > > It > > > > > > worked when we did provide user_id->item_id pairs without > > preference. > > > > > > 2. Our target is to get item similarities. We tried > > > > > > > org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob > > > but > > > > > it > > > > > > returns bad result comparing to RecommenderJob with our "fake" > > > user_id > > > > > > (inverted item_id) > > > > > > > > > > > > 1. I'll try the option you provided. > > > > > > 2. I will remove input with fake user_id and usersFile with these > > > fake > > > > > ids > > > > > > > > > > > > 3. > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java > > > > > > I don't understand how to pass ---outputPathForSimilarityMatrix > > > option > > > > to > > > > > > RecommenderJob > > > > > > > > > > > > > > > > > > 2014-07-21 4:58 GMT+04:00 Peng Zhang <pzhang.x...@gmail.com>: > > > > > > > > > > > >> Seraga, > > > > > >> > > > > > >> I have two comments: > > > > > >> 1. Don’t use negative user ids. Since Mahout uses user id as > well > > as > > > > > item > > > > > >> id as the row/column index, you’d better use 0, 1, 2, etc as ids > > > > > >> 2. If you want to get the item similarity information, you can > use > > > > > >> --outputPathForSimilarityMatrix in the command > > > > > >> > > > > > >> Regards, > > > > > >> Peng Zhang > > > > > >> M: +86 186-1658-7856 > > > > > >> pzhang.x...@gmail.com > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> On Jul 21, 2014, at 4:00 AM, Serega Sheypak < > > > serega.shey...@gmail.com > > > > > > > > > > >> wrote: > > > > > >> > > > > > >>> All bad things happen here: > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> Name > > > > > >>> > > > > > >>> RecommenderJob-PartialMultiplyMapper-Reducer > > > > > >>> > > > > > >>> User > > > > > >>> > > > > > >>> oozie > > > > > >>> > > > > > >>> Process User > > > > > >>> > > > > > >>> oozie > > > > > >>> > > > > > >>> Group > > > > > >>> > > > > > >>> oozie > > > > > >>> > > > > > >>> Mapper Class > > > > > >>> > > > > > >>> PartialMultiplyMapper > > > > > >>> > > > > > >>> Reducer Class > > > > > >>> > > > > > >>> AggregateAndRecommendReducer > > > > > >>> > > > > > >>> > > > > > >>> Job Input Directory > > > > > >>> > > > > > >>> hdfs://nameservice1/itemrec/temp/partialMultiply > > > > > >>> > > > > > >>> Job Output Directory > > > > > >>> > > > > > >>> hdfs://nameservice1/itemrec/output/ > > > > > >>> > > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map input > > > > records=3312879 > > > > > >>> > > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Map output > > > > records=3313251 > > > > > >>> > > > > > >>> > > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce input > > > > > records=3313251 > > > > > >>> > > > > > >>> 14/07/20 23:57:47 INFO mapred.JobClient: Reduce output > > > records=0 > > > > > >>> > > > > > >>> Why does mahout returns 0 rows? it works when booleanData=true > > > > > >> (preferences > > > > > >>> are ignored...?) > > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> 2014-07-20 23:19 GMT+04:00 Serega Sheypak < > > > serega.shey...@gmail.com > > > > >: > > > > > >>> > > > > > >>>> the version is: CDH-4.7.0-1.cdh4.7.0.p0.40 > > > > > >>>> users_file: > > > > > >>>> --inverted_item_id > > > > > >>>> -1 > > > > > >>>> -2 > > > > > >>>> -3 > > > > > >>>> -4 > > > > > >>>> > > > > > >>>> users_items_prefs > > > > > >>>> --inverted item_id > > > > > >>>> -1 1 1.0 > > > > > >>>> -2 2 1.0 > > > > > >>>> -3 3 1.0 > > > > > >>>> -4 4 1.0 > > > > > >>>> --user_id item_id pref_value > > > > > >>>> 11 1 1.6 > > > > > >>>> 11 2 1.6 > > > > > >>>> 123 3 2.0 > > > > > >>>> 123 4 2.0 > > > > > >>>> 333 1 2.0 > > > > > >>>> 333 2 1.6 > > > > > >>>> --e.t.c. > > > > > >>>> > > > > > >>>> if I set --booleanData true > > > > > >>>> then mahout returns the result. > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> > > > > > >>>> 2014-07-20 23:12 GMT+04:00 Andrew Musselman < > > > > > andrew.mussel...@gmail.com > > > > > >>> : > > > > > >>>> > > > > > >>>> I'm confused about how you're constructing the user file, and > > why > > > > > there > > > > > >>>>> are negated item ids here. > > > > > >>>>> > > > > > >>>>> Can you post some more details please, including Mahout > version > > > and > > > > > >> some > > > > > >>>>> sample data sets? > > > > > >>>>> > > > > > >>>>>> On Jul 20, 2014, at 11:57 AM, Serega Sheypak < > > > > > >> serega.shey...@gmail.com> > > > > > >>>>> wrote: > > > > > >>>>>> > > > > > >>>>>> Hi, I'm trying to create item similarity. > > > > > >>>>>> I gather items which users visit during shopping and then > > > create a > > > > > >> file: > > > > > >>>>>> user_id, item_id, weight (where weight can be: [1.0, 1.6, > > 1.9], > > > > > >> depends > > > > > >>>>> on > > > > > >>>>>> user action type and data source) > > > > > >>>>>> UNION > > > > > >>>>>> -item_id, item_id, 1.0 (from items dictionary) > > > > > >>>>>> > > > > > >>>>>> and I do provide a userFile, where user_id = -item_id > > > > > >>>>>> > > > > > >>>>>> The idea is to get item similary. If any user visits item > > named > > > > > "A", i > > > > > >>>>> want > > > > > >>>>>> to show him items "B", "c", "xxx" using preferences of other > > > > users. > > > > > >>>>>> > > > > > >>>>>> The problem is that the last (???) mapreduce job returns 0 > > rows: > > > > > >>>>>> > > > > > >>>>>> Here are my settings: > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> sudo -u oozie mahout recommenditembased \ > > > > > >>>>>> --input visited_items_with_inverted_items \ > > > > > >>>>>> > > > > > >>>>>> --output result \ > > > > > >>>>>> --similarityClassname > > SIMILARITY_LOGLIKELIHOOD > > > \ > > > > > >>>>>> --usersFile inverted_items \ > > > > > >>>>>> --numRecommendations 500 \ > > > > > >>>>>> --booleanData false \ > > > > > >>>>>> --maxPrefsPerUser 100 \ > > > > > >>>>>> --maxSimilaritiesPerItem 500 \ > > > > > >>>>>> --minPrefsPerUser 0\ > > > > > >>>>>> --maxPrefsPerUserInItemSimilarity 30 \ > > > > > >>>>>> --threshold 0.91 \ > > > > > >>>>>> --tempDir temp \ > > > > > >>>>>> > > > > > >>>>>> Some counters... I don't get what do they mean.... > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: > > > > > >>>>>> > > > > org.apache.mahout.cf.taste.hadoop.item.ToUserVectorsReducer$Counters > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:43:08 INFO mapred.JobClient: USERS=7528530 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > > > > > >>>>>> > > > > > >>>>> > > > > > >> > > > > > > > > > > > > > > > org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper$Elements > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > > > > > >>>>>> USER_RATINGS_NEGLECTED=1,798,738 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:43:43 INFO mapred.JobClient: > > > > > >>>>> USER_RATINGS_USED=12,429,693 > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: > > > > > >>>>>> > > > > > >>>>> > > > > > >> > > > > > > > > > > > > > > > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:44:24 INFO mapred.JobClient: ROWS=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > > > > > >>>>>> > > > > > >>>>> > > > > > >> > > > > > > > > > > > > > > > org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > > > > COOCCURRENCES=35882374 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:45:18 INFO mapred.JobClient: > > > > PRUNED_COOCCURRENCES=0 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map input > > > > > records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Map output > > > > > >> records=17570268 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=5221907 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:00 INFO mapred.JobClient: Reduce output > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:46:34 INFO mapred.JobClient: Reduce output > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map input > > > > > records=7528530 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Map output > > > > > >> records=3313251 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=3313251 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:06 INFO mapred.JobClient: Reduce output > > > > > >>>>> records=3313251 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map input > > > > > records=6626130 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Map output > > > > > >> records=6626130 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=6626130 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:47:40 INFO mapred.JobClient: Reduce output > > > > > >>>>> records=3312879 > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map input > > > > > records=3312879 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Map output > > > > > >> records=3313251 > > > > > >>>>>> > > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce input > > > > > >>>>> records=3313251 > > > > > >>>>>> > > > > > >>>>>> -------- > > > > > >>>>>> 14/07/20 22:48:26 INFO mapred.JobClient: Reduce output > > > > records=0 > > > > > >>>>>> -------- > > > > > >>>>>> > > > > > >>>>>> why 0??? > > > > > >>>>> > > > > > >>>> > > > > > >>>> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >