matei, it is good to hear that the restriction that keys need to fit in memory no longer applies to combineByKey. however join requiring keys to fit in memory is still a big deal to me. does it apply to both sides of the join, or only one (while othe other side is streaming)?
On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > In 1.1, you'll be able to get all of these properties using sortByKey, and > then mapPartitions on top to iterate through the key-value pairs. > Unfortunately sortByKey does not let you control the Partitioner, but it's > fairly easy to write your own version that does if this is important. > > In previous versions, the values for each key had to fit in memory (though > we could have data on disk across keys), and this is still true for > groupByKey, cogroup and join. Those restrictions will hopefully go away in > a later release. But sortByKey + mapPartitions lets you just iterate > through the key-value pairs without worrying about this. > > Matei > > On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) > wrote: > > When programming in Hadoop it is possible to guarantee > 1) All keys sent to a specific partition will be handled by the same > machine (thread) > 2) All keys received by a specific machine (thread) will be received in > sorted order > 3) These conditions will hold even if the values associated with a > specific key are too large enough to fit in memory. > > In my Hadoop code I use all of these conditions - specifically with my > larger data sets the size of data I wish to group exceeds the available > memory. > > I think I understand the operation of groupby but my understanding is that > this requires that the results for a single key, and perhaps all keys fit > on a single machine. > > Is there away to perform like Hadoop ad not require that an entire group > fir in memory? > >