Re: Pipelining Mappers and Reducers

2010-08-13 Thread Ferdy Galema
Hello, Thank you for your efforts. We've also been trying some things and currently we have implemented a solution that we are satisfied with. It allows us to instantly start merging any indexing job. It's a single class that wraps an indexing Job and submits it to the JobTracker, asynchrono

Re: Pipelining Mappers and Reducers

2010-08-08 Thread Shai Erera
Hi I've done some work and thought I'd report back the results (that are not too encouraging). Approach 1: * Mappers output a side effect Lucene Directory (written on-disk) and a pair where the value is the location of the index on disk and the key is unimportant for now. * Reducer merges the on

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Very well. Could you keep us informed on how your instant merging plans work out? We're actually running a similar indexing process. It's very interesting to be able to start merging Lucene indexes once the first mappers have finished, instead of waiting until ALL mappers have finished. Shai

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Well, the current scale does not warrant sharding up the index. 13GB of data, ~8-10GB index is something a single machine (even not a strong one) can handle pretty well. And as the data will grow, so will the number of Mappers, and the number of sub-indexes. So at some point I will need to merge in

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
You right about the fact that merging cannot be done by simply appending. Have you thought about the possibility to actually take advantage of fact that your final index will be split into several segments? Especially if you plan to increase the scale of the input data, you may eventually want

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Specifically at the moment, a Mapper output is an index (Lucene), and the Reducer's job is to take all indexes and merge them down to 1. Searching on an index w/ hundreds of segments is inefficient. Basically, if you think of the Reducer as holding the 'final' index (to be passed on in the larger p

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Could you elaborate on how you merge your data? If you have independent map tasks with single key-value outputs that can be written as soon as possible, I'm still curious why need to reduce at all. Surely there must be a way to merge your data 'on read'. About multi-level reducing; if you merg

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Yes, I'm aware of the fact that I can run w/ 0 reducers. However, I do need my output to be merged, and unfortunately the merge is not really that simple that I can use 'getmerge'. I think I'll give the Combiner a chance now - hopefully its reduce() method will get called as Mappers finish their wo

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Just a quick pointer here: You are aware of the fact that you can configure the number of reduce tasks to be zero? If I read well, you mention the order of your map outputs in de merge does not really matter, as well as having a single value for every key. Perhaps you could eliminate your redu

Re: Pipelining Mappers and Reducers

2010-07-28 Thread Shai Erera
Without going too much into the detail (so that we keep the discussion simple) -- I'm processing large volumes of text documents. Currently I'm training (myself :)) on a 13GB collection, but after I figure out the recipe for writing the proper mix of Mappers/Reducers/Jobs, I will move to a TB colle

Re: Pipelining Mappers and Reducers

2010-07-27 Thread Gregory Lawrence
Shai, It's hard to determine what the best solution would be without knowing more about your problem. In general, combiner functions work well but they will be of little value if each mapper output contains a unique key. This is because combiner functions only "combine" multiple values associat

Re: Pipelining Mappers and Reducers

2010-07-27 Thread Shai Erera
Thanks for the prompt response Amogh ! I'm kinda rookie w/ Hadoop, so please forgive my perhaps "too rookie" questions :). Check the property mapred.reduce.slowstart.completed.maps > >From what I read here ( http://hadoop.apache.org/common/docs/current/mapred-default.html), this parameter contro

Re: Pipelining Mappers and Reducers

2010-07-27 Thread Amogh Vasekar
Hi, >>What would really be great for me is if I could have the Reducer start >>processing the map outputs as they are ready, and not after all Mappers finish Check the property mapred.reduce.slowstart.completed.maps >>I've read about chaining mappers, but to the best of my understanding the >>se

Pipelining Mappers and Reducers

2010-07-27 Thread Shai Erera
Hi I have a scenario for which I'd like to write a MR job in which Mappers do some work and eventually the output of all mappers need to be combined by a single Reducer. Each Mapper outputs that is distinct from all other Mappers, meaning the Reducer.reduce() method always receives a single eleme