Re: Mapping Hadoop Reduce to Spark

2014-09-05 Thread Lukas Nalezenec
Hi, FYI: There is bug in Java mapPartitions - SPARK-3369 https://issues.apache.org/jira/browse/SPARK-3369. In Java results from mapPartitions and similar functions must fit in memory. Look at example below - it returns List. Lukas On 1.9.2014 00:50, Matei Zaharia wrote: mapPartitions just

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Steve Lewis
Assume I define a partitioner like /** * partition on the first letter */ public class PartitionByStart extends Partitioner { @Override public int numPartitions() { return 26; } @Override public int getPartition(final Object key) {

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
Partitioners also work in local mode, the only question is how to see which data fell into each partition, since most RDD operations hide the fact that it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns an RDD of elements of type T into an RDD of ListT, with a

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
BTW you can also use rdd.partitions() to get a list of Partition objects and see how many there are. On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Partitioners also work in local mode, the only question is how to see which data fell into each partition,

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Steve Lewis
Is there a sample of how to do this - I see 1.1 is out but cannot find samples of mapPartitions A Java sample would be very useful On Sat, Aug 30, 2014 at 10:30 AM, Matei Zaharia matei.zaha...@gmail.com wrote: In 1.1, you'll be able to get all of these properties using sortByKey, and then

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Koert Kuipers
matei, it is good to hear that the restriction that keys need to fit in memory no longer applies to combineByKey. however join requiring keys to fit in memory is still a big deal to me. does it apply to both sides of the join, or only one (while othe other side is streaming)? On Sat, Aug 30,

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
Just to be clear, no operation requires all the keys to fit in memory, only the values for each specific key. All the values for each individual key need to fit, but the system can spill to disk across keys. Right now it's for both sides of it, unless you do a broadcast join by hand with

Re: Mapping Hadoop Reduce to Spark

2014-08-31 Thread Matei Zaharia
mapPartitions just gives you an Iterator of the values in each partition, and lets you return an Iterator of outputs. For instance, take a look at  https://github.com/apache/spark/blob/master/core/src/test/java/org/apache/spark/JavaAPISuite.java#L694. Matei On August 31, 2014 at 12:26:51 PM,

Mapping Hadoop Reduce to Spark

2014-08-30 Thread Steve Lewis
When programming in Hadoop it is possible to guarantee 1) All keys sent to a specific partition will be handled by the same machine (thread) 2) All keys received by a specific machine (thread) will be received in sorted order 3) These conditions will hold even if the values associated with a

Re: Mapping Hadoop Reduce to Spark

2014-08-30 Thread Matei Zaharia
In 1.1, you'll be able to get all of these properties using sortByKey, and then mapPartitions on top to iterate through the key-value pairs. Unfortunately sortByKey does not let you control the Partitioner, but it's fairly easy to write your own version that does if this is important. In