Re: hadoop without disk i/o

2010-07-29 Thread juber patel
thanks Ken for the reply. well, that's what I am doing right now. But the output from mappers needs to be processed together. Due to the nature of the problem, sorting is trivial once the map output becomes available. That's why I don't want to spend time in hadoop's inbuilt sort that involves dis

Re: hadoop without disk i/o

2010-07-29 Thread Ken Goodhope
If you don't need sorted input, then you probably don't even need a reducer. Try putting all your functionality in the mapper and then set reduce tasks to zero. On Thursday, July 29, 2010, juber patel wrote: > Hi, > > Is it possible to use hadoop and not use disk i/o, apart from the > initial inp

hadoop without disk i/o

2010-07-29 Thread juber patel
Hi, Is it possible to use hadoop and not use disk i/o, apart from the initial input? I am asking this with the assumption that disk i/o is the bottleneck in overall processing, even more than the network access if you are on a dedicated, high speed cluster. (Does anyone have experience to confirm

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Very well. Could you keep us informed on how your instant merging plans work out? We're actually running a similar indexing process. It's very interesting to be able to start merging Lucene indexes once the first mappers have finished, instead of waiting until ALL mappers have finished. Shai

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Well, the current scale does not warrant sharding up the index. 13GB of data, ~8-10GB index is something a single machine (even not a strong one) can handle pretty well. And as the data will grow, so will the number of Mappers, and the number of sub-indexes. So at some point I will need to merge in

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
You right about the fact that merging cannot be done by simply appending. Have you thought about the possibility to actually take advantage of fact that your final index will be split into several segments? Especially if you plan to increase the scale of the input data, you may eventually want

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Specifically at the moment, a Mapper output is an index (Lucene), and the Reducer's job is to take all indexes and merge them down to 1. Searching on an index w/ hundreds of segments is inefficient. Basically, if you think of the Reducer as holding the 'final' index (to be passed on in the larger p

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Could you elaborate on how you merge your data? If you have independent map tasks with single key-value outputs that can be written as soon as possible, I'm still curious why need to reduce at all. Surely there must be a way to merge your data 'on read'. About multi-level reducing; if you merg

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
Yes, I'm aware of the fact that I can run w/ 0 reducers. However, I do need my output to be merged, and unfortunately the merge is not really that simple that I can use 'getmerge'. I think I'll give the Combiner a chance now - hopefully its reduce() method will get called as Mappers finish their wo

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Ferdy Galema
Just a quick pointer here: You are aware of the fact that you can configure the number of reduce tasks to be zero? If I read well, you mention the order of your map outputs in de merge does not really matter, as well as having a single value for every key. Perhaps you could eliminate your redu