When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions will hold even if the values associated with a specific key are too large enough to fit in memory.
In my Hadoop code I use all of these conditions - specifically with my larger data sets the size of data I wish to group exceeds the available memory. I think I understand the operation of groupby but my understanding is that this requires that the results for a single key, and perhaps all keys fit on a single machine. Is there away to perform like Hadoop ad not require that an entire group fir in memory?