Spark hook to create external process

2014-05-29 Thread ansriniv
I have a requirement where for every Spark executor threadpool thread, I need to launch an associated external process. My job will consist of some processing in the Spark executor thread and some processing by its associated external process with the 2 communicating via some IPC mechanism. Is

getPreferredLocations

2014-05-29 Thread ansriniv
I am building my own custom RDD class. 1) Is there a guarantee that a partition will only be processed on a node which is in the getPreferredLocations set of nodes returned by the RDD ? 2) I am implementing this custom RDD in Java and plan to extend JavaRDD. However, I dont see a

parallel Reduce within a key

2014-06-20 Thread ansriniv
Hi, I am on Spark 0.9.0 I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32 cores in the cluster). I have an input rdd with 64 partitions. I am running sc.mapPartitions(...).reduce(...) I can see that I get full parallelism on the mapper (all my 32 cores are busy