Distributed Cache For 100MB+ Data Structure

Kyle Moses Thu, 11 Oct 2012 10:13:44 -0700

Problem Background:

I have a Hadoop MapReduce program that uses a IPv6 radix tree to provideauxiliary input during the reduce phase of the second job in it'sworkflow, but doesn't need the data at any other point.It seems pretty straight forward to use the distributed cache to buildthis data structure inside each reducer in the setup() method.This solution is functional, but ends up using a large amount of memoryif I have 3 or more reducers running on the same node and the setup timeof the radix tree is non-trivial.Additionally, the IPv6 version of the structure is quite a bit larger inmemory.


Question:

Is there a "good" way to share this data structure across all reducerson the same node within the Hadoop framework?


Initial Thoughts:

It seems like this might be possible by altering the Task JVM Reuseparameters, but from what I have read this would also affect map tasksand I'm concerned about drawbacks/side-effects.


Thanks for your help!

Distributed Cache For 100MB+ Data Structure

Reply via email to