When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions will hold even if the values associated with a specific
key are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically with my
larger data sets the size of data I wish to group exceeds the available
memory.

I think I understand the operation of groupby but my understanding is that
this requires that the results for a single key, and perhaps all keys fit
on a single machine.

Is there away to perform like Hadoop ad not require that an entire group
fir in memory?

Reply via email to