Hello Jason, > In general if the values become very large, it becomes simpler to store > them > outline in hdfs, and just pass the hdfs path for the item as the value > in > the map reduce task. > This greatly reduces the amount of IO done, and doesn't blow up the > sort > space on the reducer. > You loose the magic of data locality, but given the item size, and you > gain > the IO back by not having to pass the full values to the reducer, or > handle > them when sorting the map outputs.
Ah that actually sounds like a nice idea; instead of having the reducer emit the huge value, it can create a temporarely file and emit the filename instead. I wasn't really planning on having huge values anyway (values above 1MB will be the exception rather than the rule), but since it's theoretically possible for our software to generate them, it seemed like a good idea to investigate any real constraints that we might run into. Your idea sounds like a good workaround for this. Thanks! Regards, Leon Mergen