Assume I define a partitioner like /** * partition on the first letter */ public class PartitionByStart extends Partitioner { @Override public int numPartitions() { return 26; }
@Override public int getPartition(final Object key) { String s = (String)key; if(s.length() == 0) throw new IllegalStateException("problem"); // ToDo change int ret = s.charAt(0) - 'A'; ret = Math.min(25,ret) ; ret = Math.max(0,ret); return 25 - ret; } } how, short or running on a large cluster can I test that code which might look like (Unrolling all the chained methods) ones = ones.partitionBy(new PartitionByStart()); JavaPairRDD<String, Integer> sorted = ones.sortByKey(); JavaRDD<WordNumber> answer = sorted.mapPartitions(new WordCountFlatMapFinction()); partitions properly - in other words on a local instance how would partitoning work and what do I expect to see in switching from one partition to another as the code runs? On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > In 1.1, you'll be able to get all of these properties using sortByKey, and > then mapPartitions on top to iterate through the key-value pairs. > Unfortunately sortByKey does not let you control the Partitioner, but it's > fairly easy to write your own version that does if this is important. > > In previous versions, the values for each key had to fit in memory (though > we could have data on disk across keys), and this is still true for > groupByKey, cogroup and join. Those restrictions will hopefully go away in > a later release. But sortByKey + mapPartitions lets you just iterate > through the key-value pairs without worrying about this. > > Matei > > On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com) > wrote: > > When programming in Hadoop it is possible to guarantee > 1) All keys sent to a specific partition will be handled by the same > machine (thread) > 2) All keys received by a specific machine (thread) will be received in > sorted order > 3) These conditions will hold even if the values associated with a > specific key are too large enough to fit in memory. > > In my Hadoop code I use all of these conditions - specifically with my > larger data sets the size of data I wish to group exceeds the available > memory. > > I think I understand the operation of groupby but my understanding is that > this requires that the results for a single key, and perhaps all keys fit > on a single machine. > > Is there away to perform like Hadoop ad not require that an entire group > fir in memory? > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com