In general if the values become very large, it becomes simpler to store them
outline in hdfs, and just pass the hdfs path for the item as the value in
the map reduce task.
This greatly reduces the amount of IO done, and doesn't blow up the sort
space on the reducer.
You loose the magic of data locality, but given the item size, and you gain
the IO back by not having to pass the full values to the reducer, or handle
them when sorting the map outputs.

On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen <l.p.mer...@solatis.com> wrote:

> Hello Owen,
>
> > Keys and values can be large. They are certainly capped above by
> > Java's 2GB limit on byte arrays. More practically, you will have
> > problems running out of memory with keys or values of 100 MB. There is
> > no restriction that a key/value pair fits in a single hdfs block, but
> > performance would suffer. (In particular, the FileInputFormats split
> > at block sized chunks, which means you will have maps that scan an
> > entire block without processing anything.)
>
> Thanks for the quick reply.
>
> Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit
> that is caused by the Java VM heap size ? If so, could that, for example, be
> increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?
>
> Regards,
>
> Leon Mergen
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Reply via email to