Hi,
I am on Spark 0.9.0
I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
cores in the cluster).
I have an input rdd with 64 partitions.
I am running "sc.mapPartitions(...).reduce(...)"
I can see that I get full parallelism on the mapper (all my 32 cores are
busy simu
I am building my own custom RDD class.
1) Is there a guarantee that a partition will only be processed on a node
which is in the "getPreferredLocations" set of nodes returned by the RDD ?
2) I am implementing this custom RDD in Java and plan to extend JavaRDD.
However, I dont see a "getPreferred
Hi Matei,
Thanks for the reply.
I would like to avoid having to spawn these external processes every time
during the processing of the task to reduce task latency. I'd like these to
be pre-spawned as much as possible - tying them to lifecycle of
corresponding threadpool thread would simplify mana
I have a requirement where for every Spark executor threadpool thread, I need
to launch an associated external process.
My job will consist of some processing in the Spark executor thread and some
processing by its associated external process with the 2 communicating via
some IPC mechanism.
Is th