1. I do collect preferences for items using 60days sliding window. today - 60 days. 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item view, 5 for clicking recommndation block. The idea is to give more value for recommendations which attact visitor attention). I get ~ 20.000.000 of lines with ~1.000.000 distinct items and ~2.000.000 distinct users 3. I do use apache pig RANK function to rank all distinct user_id 4. I do the same for item_id 5. I do join input dataset with ranked datasets and provide input to mahout with dense interger user_id, item_id 6. I do get mahout output and join integer item_id back to get natural key value.
step #1-2 takes ~ 40min step #3-5 takes ~1 hour mahout calc takes ~3hours 2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>: > This really doesn't sound right. It should be possible to process almost a > thousand times that much data every night without that much problem. > > How are you preparing the input data? > > How are you converting to Mahout id's? > > Even using python, you should be able to do the conversion in just a few > minutes without any parallelism whatsoever. > > > > > On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > > > Hi, We are trying calculate ItemSimilarity. > > Right now we have 2*10^7 input lines. I do provide input data as raw text > > each day to recalculate item similarities. We do get +100..1000 new items > > each day. > > 1. It takes too much time to prepare input data. > > 2. It takes too much time to convert user_id, item_id to mahout ids > > > > Is there any poissibility to provide data to mahout mapreduce > > ItemSimilarity using some binary format with compression? > > >