Just to be clear, no operation requires all the keys to fit in memory, only the 
values for each specific key. All the values for each individual key need to 
fit, but the system can spill to disk across keys. Right now it's for both 
sides of it, unless you do a broadcast join by hand with something like 
mapPartitions.

Matei

On August 31, 2014 at 12:44:26 PM, Koert Kuipers (ko...@tresata.com) wrote:

matei,
it is good to hear that the restriction that keys need to fit in memory no 
longer applies to combineByKey. however join requiring keys to fit in memory is 
still a  big deal to me. does it apply to both sides of the join, or only one 
(while othe other side is streaming)?


On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then 
mapPartitions on top to iterate through the key-value pairs. Unfortunately 
sortByKey does not let you control the Partitioner, but it's fairly easy to 
write your own version that does if this is important.

In previous versions, the values for each key had to fit in memory (though we 
could have data on disk across keys), and this is still true for groupByKey, 
cogroup and join. Those restrictions will hopefully go away in a later release. 
But sortByKey + mapPartitions lets you just iterate through the key-value pairs 
without worrying about this.

Matei

On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) wrote:

When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same machine 
(thread)
2) All keys received by a specific machine (thread) will be received in sorted 
order
3) These conditions will hold even if the values associated with a specific key 
are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically with my larger 
data sets the size of data I wish to group exceeds the available memory.

I think I understand the operation of groupby but my understanding is that this 
requires that the results for a single key, and perhaps all keys fit on a 
single machine.

Is there away to perform like Hadoop ad not require that an entire group fir in 
memory?


Reply via email to