Mapping Hadoop Reduce to Spark

Steve Lewis Sat, 30 Aug 2014 09:05:29 -0700

When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions will hold even if the values associated with a specific
key are too large enough to fit in memory.


In my Hadoop code I use all of these conditions - specifically with my
larger data sets the size of data I wish to group exceeds the available
memory.

I think I understand the operation of groupby but my understanding is that
this requires that the results for a single key, and perhaps all keys fit
on a single machine.

Is there away to perform like Hadoop ad not require that an entire group
fir in memory?

Mapping Hadoop Reduce to Spark

Reply via email to