Hi Ted, I have seen some benchmark results between different versions of co-occurrence computation, I will share them if I can find, today or tomorrow.
On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <[email protected]> wrote: > Jimmy Lin's stripes work was presented at the last Summit and there was > heated (well, warm and cordial at least) discussion with the Map-reduce > committers about whether good use of a combiner wouldn't do just as well. > > My take-away as a spectator is that a combiner was > > a) vastly easier to code > > b) would be pretty certain to be within 2x as performant and likely very > close to the same speed > > c) would not need changing each time the underlying map-reduce changed > > My conclusion was that combiners were the way to go (for me). Your > mileage, > as always, will vary. > > On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <[email protected]> wrote: > > > Hi, > > I haven't seen the code, but maybe Mahout needs some optimization while > > computing item-item co-occurrences. It may be re-implemented using > > "stripes" > > approach using in-mapper combining if it is not. It can be found at: > > > > 1. www.aclweb.org/anthology/D/D08/D08-1044.pdf > > > > If it already is, sorry for the post. > > > > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde < > > [email protected]> wrote: > > > > > Sebastian, thank's for the reply. The step name is* > > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer. and each > > > map task > > > takes around 10 hours to finish. > > > > > > Reduce task dir > > > > > > > > > (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output) > > > has map output files ( files like map_2.out) and each one is 5GB in > size. > > > > > > I have been looking at the code and saw what you describe in the > e-mail. > > It > > > makes sense. But still 160 GB of intermediate info from a 2.6 GB input > > file > > > still makes me wonder if something is wrong. > > > > > > Should I just wait for the patch? > > > Thanks again! > > > Charly > > > > > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter < > > > [email protected] > > > > wrote: > > > > > > > Hi Charly, > > > > > > > > can you tell which Map/Reduce step was executed last before you ran > out > > > > of disk space? > > > > > > > > I'm not familiar with the Netflix dataset and can only guess what > > > > happened, but I would say that you ran out of diskspace because > > > > ItemSimilarityJob currently uses all preferences to compute the > > > > similarities. This makes it scale in the square of the number of > > > > occurrences of the most popular item, which is a bad thing if that > > > > number is huge. We need a way to limit the number of preferences > > > > considered per item, there is already a ticket for this ( > > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to > > provide > > > > a patch in the next days. > > > > > > > > --sebastian > > > > > > > > > > > > > > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde: > > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) and I > > > have > > > > > just ran out of disk space (160 GB) in my mapred.local.dir when > > running > > > > > RowSimilarityJob. > > > > > > > > > > Is this normal behaviour? How can I improve this? > > > > > > > > > > Thanks! > > > > > Charly > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Gökhan Çapan > > > -- Gökhan Çapan
