Hello Jason,

> In general if the values become very large, it becomes simpler to store
> them
> outline in hdfs, and just pass the hdfs path for the item as the value
> in
> the map reduce task.
> This greatly reduces the amount of IO done, and doesn't blow up the
> sort
> space on the reducer.
> You loose the magic of data locality, but given the item size, and you
> gain
> the IO back by not having to pass the full values to the reducer, or
> handle
> them when sorting the map outputs.

Ah that actually sounds like a nice idea; instead of having the reducer emit 
the huge value, it can create a temporarely file and emit the filename instead.

I wasn't really planning on having huge values anyway (values above 1MB will be 
the exception rather than the rule), but since it's theoretically possible for 
our software to generate them, it seemed like a good idea to investigate any 
real constraints that we might run into.

Your idea sounds like a good workaround for this. Thanks!


Regards,

Leon Mergen










Reply via email to