Re: Mapping Hadoop Reduce to Spark

Steve Lewis Thu, 04 Sep 2014 16:59:11 -0700

Assume I define a partitioner like

   /**
  * partition on the first letter
   */
    public class PartitionByStart extends Partitioner {
        @Override public int numPartitions() {
            return 26;
        }


        @Override public int getPartition(final Object key) {
            String s = (String)key;
            if(s.length() == 0)
                throw new IllegalStateException("problem"); // ToDo change
            int ret = s.charAt(0) - 'A';
            ret = Math.min(25,ret) ;
            ret = Math.max(0,ret);
            return 25 - ret;
        }
    }

   how, short or running on a large cluster can I test that code which
might look like (Unrolling all the chained methods)

       ones = ones.partitionBy(new PartitionByStart());
        JavaPairRDD<String, Integer> sorted = ones.sortByKey();
        JavaRDD<WordNumber> answer = sorted.mapPartitions(new
WordCountFlatMapFinction());

      partitions properly - in other words on a local instance how would
partitoning work and what do I expect to see in switching from one
partition to another as the code runs?



On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> In 1.1, you'll be able to get all of these properties using sortByKey, and
> then mapPartitions on top to iterate through the key-value pairs.
> Unfortunately sortByKey does not let you control the Partitioner, but it's
> fairly easy to write your own version that does if this is important.
>
> In previous versions, the values for each key had to fit in memory (though
> we could have data on disk across keys), and this is still true for
> groupByKey, cogroup and join. Those restrictions will hopefully go away in
> a later release. But sortByKey + mapPartitions lets you just iterate
> through the key-value pairs without worrying about this.
>
> Matei
>
> On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
> wrote:
>
>  When programming in Hadoop it is possible to guarantee
> 1) All keys sent to a specific partition will be handled by the same
> machine (thread)
> 2) All keys received by a specific machine (thread) will be received in
> sorted order
> 3) These conditions will hold even if the values associated with a
> specific key are too large enough to fit in memory.
>
> In my Hadoop code I use all of these conditions - specifically with my
> larger data sets the size of data I wish to group exceeds the available
> memory.
>
> I think I understand the operation of groupby but my understanding is that
> this requires that the results for a single key, and perhaps all keys fit
> on a single machine.
>
> Is there away to perform like Hadoop ad not require that an entire group
> fir in memory?
>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Mapping Hadoop Reduce to Spark

Reply via email to