Re: mapreduce ItemSimilarity input optimization

Serega Sheypak Sun, 17 Aug 2014 01:27:36 -0700

1. I do collect preferences for items using 60days sliding window. today -
60 days.
2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
view, 5 for clicking recommndation block. The idea is to give more value
for recommendations which attact visitor attention). I get ~ 20.000.000 of
lines with ~1.000.000 distinct items and ~2.000.000 distinct users
3. I do use apache pig RANK function to rank all distinct user_id
4. I do the same for item_id
5. I do join input dataset with ranked datasets and provide input to mahout
with dense interger user_id, item_id
6. I do get mahout output and join integer item_id back to get natural key
 value.


step #1-2 takes ~ 40min
step #3-5 takes ~1 hour
mahout calc takes ~3hours



2014-08-17 10:45 GMT+04:00 Ted Dunning <ted.dunn...@gmail.com>:

> This really doesn't sound right.  It should be possible to process almost a
> thousand times that much data every night without that much problem.
>
> How are you preparing the input data?
>
> How are you converting to Mahout id's?
>
> Even using python, you should be able to do the conversion in just a few
> minutes without any parallelism whatsoever.
>
>
>
>
> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
> > Hi, We are trying calculate ItemSimilarity.
> > Right now we have 2*10^7 input lines. I do provide input data as raw text
> > each day to recalculate item similarities. We do get +100..1000 new items
> > each day.
> > 1. It takes too much time to prepare input data.
> > 2. It takes too much time to convert user_id, item_id to mahout ids
> >
> > Is there any poissibility to provide data to mahout mapreduce
> > ItemSimilarity using some binary format with compression?
> >
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to