Re: Mapping Hadoop Reduce to Spark

Steve Lewis Sun, 31 Aug 2014 12:27:49 -0700

Is there a sample of how to do this -
I see 1.1 is out but cannot find samples of mapPartitions
A Java sample would be very useful



On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> In 1.1, you'll be able to get all of these properties using sortByKey, and
> then mapPartitions on top to iterate through the key-value pairs.
> Unfortunately sortByKey does not let you control the Partitioner, but it's
> fairly easy to write your own version that does if this is important.
>
> In previous versions, the values for each key had to fit in memory (though
> we could have data on disk across keys), and this is still true for
> groupByKey, cogroup and join. Those restrictions will hopefully go away in
> a later release. But sortByKey + mapPartitions lets you just iterate
> through the key-value pairs without worrying about this.
>
> Matei
>
> On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
> wrote:
>
>  When programming in Hadoop it is possible to guarantee
> 1) All keys sent to a specific partition will be handled by the same
> machine (thread)
> 2) All keys received by a specific machine (thread) will be received in
> sorted order
> 3) These conditions will hold even if the values associated with a
> specific key are too large enough to fit in memory.
>
> In my Hadoop code I use all of these conditions - specifically with my
> larger data sets the size of data I wish to group exceeds the available
> memory.
>
> I think I understand the operation of groupby but my understanding is that
> this requires that the results for a single key, and perhaps all keys fit
> on a single machine.
>
> Is there away to perform like Hadoop ad not require that an entire group
> fir in memory?
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Mapping Hadoop Reduce to Spark

Reply via email to