I have a requirement where for every Spark executor threadpool thread, I need
to launch an associated external process.
My job will consist of some processing in the Spark executor thread and some
processing by its associated external process with the 2 communicating via
some IPC mechanism.
Is
I am building my own custom RDD class.
1) Is there a guarantee that a partition will only be processed on a node
which is in the getPreferredLocations set of nodes returned by the RDD ?
2) I am implementing this custom RDD in Java and plan to extend JavaRDD.
However, I dont see a
Hi,
I am on Spark 0.9.0
I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
cores in the cluster).
I have an input rdd with 64 partitions.
I am running sc.mapPartitions(...).reduce(...)
I can see that I get full parallelism on the mapper (all my 32 cores are
busy