What does the data look like? You mention 30k records, is that for 10MB or for 600MB, or do you have a constant 30k records with vastly varying file sizes?
If the data is 10MB and you have 30k records, and it takes ~2 mins to process each record, I'd suggest using map to distribute the data across several reducers then do the actual processing on reduce. On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign <cubicdes...@gmail.com> wrote: > Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node * > 10 as recommended by Hadoop documentation) and my job still takes several > hours to run instead of one. > > Can be the overhead added by Hadoop that big? I mean I have over 30000 > small tasks (about one minute), each one starting its own JVM. > > >