I really must read the papers in order keep on commenting on this thread...if by any chance I can dive into the Job "internals" may be I can see to write an optmization.
For now, I will buy another disk and retry the tests. Thanks for the replies! Charly On Thu, Aug 12, 2010 at 5:12 PM, Ted Dunning <[email protected]> wrote: > I am sympathetic with the goals of stripes and have not analyzed the > situation in detail. Instead, I am simply reporting that at least one guy > with very deep knowledge of the Hadoop map-reduce framework felt that > similar results could be achieved without quite so much fanciness. > > On Thu, Aug 12, 2010 at 1:06 PM, Gökhan Çapan <[email protected]> wrote: > > > Hi Ted, > > Combining phase is also applicable to stripes approach (also in-mapper > > combining). The experiments I have remembered was from the paper, stripes > > approach produces larger intermediate key-value pairs before the > combiners > > according to this experiment, though. Below is from the paper; > > > > "*Results demonstrate that the stripes approach is* > > *far more efficient than the pairs approach: 666 seconds* > > *(11m 6s) compared to 3758 seconds (62m 38s)* > > *for the entire APW sub-corpus (improvement by a* > > *factor of 5.7). On the entire sub-corpus, the mappers* > > *in the pairs approach generated 2.6 billion intermediate* > > *key-value pairs totally 31.2 GB. After the* > > *combiners, this was reduced to 1.1 billion key-value* > > *pairs, which roughly quantifies the amount of data* > > *involved in the shuffling and sorting of the keys. On* > > *the other hand, the mappers in the stripes approach* > > *generated 653 million intermediate key-value pairs* > > *totally 48.1 GB; after the combiners, only 28.8 million* > > *key-value pairs were left. The stripes approach* > > *provides more opportunities for combiners to aggregate* > > *intermediate results, thus greatly reducing network* > > *traffic in the sort and shuffle phase.*" > > > > > > On Thu, Aug 12, 2010 at 10:34 PM, Gökhan Çapan <[email protected]> > wrote: > > > > > Hi Ted, > > > > > > I have seen some benchmark results between different versions of > > > co-occurrence computation, I will share them if I can find, today or > > > tomorrow. > > > > > > > > > On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning <[email protected] > > >wrote: > > > > > >> Jimmy Lin's stripes work was presented at the last Summit and there > was > > >> heated (well, warm and cordial at least) discussion with the > Map-reduce > > >> committers about whether good use of a combiner wouldn't do just as > > well. > > >> > > >> My take-away as a spectator is that a combiner was > > >> > > >> a) vastly easier to code > > >> > > >> b) would be pretty certain to be within 2x as performant and likely > very > > >> close to the same speed > > >> > > >> c) would not need changing each time the underlying map-reduce changed > > >> > > >> My conclusion was that combiners were the way to go (for me). Your > > >> mileage, > > >> as always, will vary. > > >> > > >> On Thu, Aug 12, 2010 at 7:45 AM, Gökhan Çapan <[email protected]> > > wrote: > > >> > > >> > Hi, > > >> > I haven't seen the code, but maybe Mahout needs some optimization > > while > > >> > computing item-item co-occurrences. It may be re-implemented using > > >> > "stripes" > > >> > approach using in-mapper combining if it is not. It can be found at: > > >> > > > >> > 1. www.aclweb.org/anthology/D/D08/D08-1044.pdf > > >> > > > >> > If it already is, sorry for the post. > > >> > > > >> > On Thu, Aug 12, 2010 at 3:51 PM, Charly Lizarralde < > > >> > [email protected]> wrote: > > >> > > > >> > > Sebastian, thank's for the reply. The step name is* > > >> > > :*RowSimilarityJob-CooccurrencesMapper-SimilarityReducer. and > each > > >> > > map task > > >> > > takes around 10 hours to finish. > > >> > > > > >> > > Reduce task dir > > >> > > > > >> > > > > >> > > > >> > > > (var/lib/hadoop-0.20/cache/hadoop/mapred/local/taskTracker/jobcache/job_201008111833_0007/attempt_201008111833_0007_r_000000_0/output) > > >> > > has map output files ( files like map_2.out) and each one is 5GB > in > > >> size. > > >> > > > > >> > > I have been looking at the code and saw what you describe in the > > >> e-mail. > > >> > It > > >> > > makes sense. But still 160 GB of intermediate info from a 2.6 GB > > input > > >> > file > > >> > > still makes me wonder if something is wrong. > > >> > > > > >> > > Should I just wait for the patch? > > >> > > Thanks again! > > >> > > Charly > > >> > > > > >> > > On Thu, Aug 12, 2010 at 2:34 AM, Sebastian Schelter < > > >> > > [email protected] > > >> > > > wrote: > > >> > > > > >> > > > Hi Charly, > > >> > > > > > >> > > > can you tell which Map/Reduce step was executed last before you > > ran > > >> out > > >> > > > of disk space? > > >> > > > > > >> > > > I'm not familiar with the Netflix dataset and can only guess > what > > >> > > > happened, but I would say that you ran out of diskspace because > > >> > > > ItemSimilarityJob currently uses all preferences to compute the > > >> > > > similarities. This makes it scale in the square of the number of > > >> > > > occurrences of the most popular item, which is a bad thing if > that > > >> > > > number is huge. We need a way to limit the number of preferences > > >> > > > considered per item, there is already a ticket for this ( > > >> > > > https://issues.apache.org/jira/browse/MAHOUT-460) and I plan to > > >> > provide > > >> > > > a patch in the next days. > > >> > > > > > >> > > > --sebastian > > >> > > > > > >> > > > > > >> > > > > > >> > > > Am 12.08.2010 00:15, schrieb Charly Lizarralde: > > >> > > > > Hi, I am testing ItemSimilarityJob with Netflix data (2.6 GB) > > and > > >> I > > >> > > have > > >> > > > > just ran out of disk space (160 GB) in my mapred.local.dir > when > > >> > running > > >> > > > > RowSimilarityJob. > > >> > > > > > > >> > > > > Is this normal behaviour? How can I improve this? > > >> > > > > > > >> > > > > Thanks! > > >> > > > > Charly > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > -- > > >> > Gökhan Çapan > > >> > > > >> > > > > > > > > > > > > -- > > > Gökhan Çapan > > > > > > > > > > > -- > > Gökhan Çapan > > >
