Re: Mapping Hadoop Reduce to Spark

Koert Kuipers Sun, 31 Aug 2014 12:45:01 -0700

matei,
it is good to hear that the restriction that keys need to fit in memory no
longer applies to combineByKey. however join requiring keys to fit in
memory is still a  big deal to me. does it apply to both sides of the join,
or only one (while othe other side is streaming)?



On Sat, Aug 30, 2014 at 1:30 PM, Matei Zaharia <[email protected]>
wrote:

> In 1.1, you'll be able to get all of these properties using sortByKey, and
> then mapPartitions on top to iterate through the key-value pairs.
> Unfortunately sortByKey does not let you control the Partitioner, but it's
> fairly easy to write your own version that does if this is important.
>
> In previous versions, the values for each key had to fit in memory (though
> we could have data on disk across keys), and this is still true for
> groupByKey, cogroup and join. Those restrictions will hopefully go away in
> a later release. But sortByKey + mapPartitions lets you just iterate
> through the key-value pairs without worrying about this.
>
> Matei
>
> On August 30, 2014 at 9:04:37 AM, Steve Lewis ([email protected])
> wrote:
>
>  When programming in Hadoop it is possible to guarantee
> 1) All keys sent to a specific partition will be handled by the same
> machine (thread)
> 2) All keys received by a specific machine (thread) will be received in
> sorted order
> 3) These conditions will hold even if the values associated with a
> specific key are too large enough to fit in memory.
>
> In my Hadoop code I use all of these conditions - specifically with my
> larger data sets the size of data I wish to group exceeds the available
> memory.
>
> I think I understand the operation of groupby but my understanding is that
> this requires that the results for a single key, and perhaps all keys fit
> on a single machine.
>
> Is there away to perform like Hadoop ad not require that an entire group
> fir in memory?
>
>

Re: Mapping Hadoop Reduce to Spark

Reply via email to