Re: Mapping Hadoop Reduce to Spark

Matei Zaharia Thu, 04 Sep 2014 17:21:20 -0700

BTW you can also use rdd.partitions() to get a list of Partition objects and 
see how many there are.


On September 4, 2014 at 5:18:30 PM, Matei Zaharia ([email protected]) 
wrote:

Partitioners also work in local mode, the only question is how to see which 
data fell into each partition, since most RDD operations hide the fact that 
it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns 
an RDD of elements of type T into an RDD of List<T>, with a separate list for 
each partition. You should get back 26 lists, with the right elements in each 
one.

Matei

On September 4, 2014 at 4:58:19 PM, Steve Lewis ([email protected]) wrote:

Assume I define a partitioner like 

   /**
  * partition on the first letter
   */
    public class PartitionByStart extends Partitioner {
        @Override public int numPartitions() {
            return 26;
        }

        @Override public int getPartition(final Object key) {
            String s = (String)key;
            if(s.length() == 0)
                throw new IllegalStateException("problem"); // ToDo change
            int ret = s.charAt(0) - 'A';
            ret = Math.min(25,ret) ;
            ret = Math.max(0,ret);
            return 25 - ret;
        }
    }

   how, short or running on a large cluster can I test that code which might 
look like (Unrolling all the chained methods)

       ones = ones.partitionBy(new PartitionByStart());
        JavaPairRDD<String, Integer> sorted = ones.sortByKey();
        JavaRDD<WordNumber> answer = sorted.mapPartitions(new 
WordCountFlatMapFinction());

      partitions properly - in other words on a local instance how would 
partitoning work and what do I expect to see in switching from one partition to 
another as the code runs?



On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia <[email protected]> wrote:
In 1.1, you'll be able to get all of these properties using sortByKey, and then 
mapPartitions on top to iterate through the key-value pairs. Unfortunately 
sortByKey does not let you control the Partitioner, but it's fairly easy to 
write your own version that does if this is important.

In previous versions, the values for each key had to fit in memory (though we 
could have data on disk across keys), and this is still true for groupByKey, 
cogroup and join. Those restrictions will hopefully go away in a later release. 
But sortByKey + mapPartitions lets you just iterate through the key-value pairs 
without worrying about this.

Matei

On August 30, 2014 at 9:04:37 AM, Steve Lewis ([email protected]) wrote:

When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same machine 
(thread)
2) All keys received by a specific machine (thread) will be received in sorted 
order
3) These conditions will hold even if the values associated with a specific key 
are too large enough to fit in memory.

In my Hadoop code I use all of these conditions - specifically with my larger 
data sets the size of data I wish to group exceeds the available memory.

I think I understand the operation of groupby but my understanding is that this 
requires that the results for a single key, and perhaps all keys fit on a 
single machine.

Is there away to perform like Hadoop ad not require that an entire group fir in 
memory?




--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Mapping Hadoop Reduce to Spark

Reply via email to