What does the data look like?

You mention 30k records, is that for 10MB or for 600MB, or do you have a
constant 30k records with vastly varying file sizes?

If the data is 10MB and you have 30k records, and it takes ~2 mins to
process each record, I'd suggest using map to distribute the data across
several reducers then do the actual processing on reduce.



On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign <cubicdes...@gmail.com> wrote:

> Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node *
> 10 as recommended by Hadoop documentation) and my job still takes several
> hours to run instead of one.
>
> Can be the overhead added by Hadoop that big? I mean I have over 30000
> small tasks (about one minute), each one starting its own JVM.
>
>
>

Reply via email to