RE: # because I already have a bunch of InputSplits, do I still need to specify the number of executors to get processing parallelized?
I would say it’s best practice to have as many executors as data nodes and as many cores as you can get from the cluster – if YARN has enough resources it will deploy the executors distributed across the cluster, then each of them will try to process the data locally (check the spark ui for NODE_LOCAL), with as many splits in parallel as you defined in spark.executor.cores -adrian From: Sandy Ryza Date: Thursday, September 24, 2015 at 2:43 AM To: Anfernee Xu Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers Hi Anfernee, That's correct that each InputSplit will map to exactly a Spark partition. On YARN, each Spark executor maps to a single YARN container. Each executor can run multiple tasks over its lifetime, both parallel and sequentially. If you enable dynamic allocation, after the stage including the InputSplits gets submitted, Spark will try to request an appropriate number of executors. The memory in the YARN resource requests is --executor-memory + what's set for spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory. -Sandy On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu <anfernee...@gmail.com<mailto:anfernee...@gmail.com>> wrote: Hi Spark experts, I'm coming across these terminologies and having some confusions, could you please help me understand them better? For instance I have implemented a Hadoop InputFormat to load my external data in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my questions is about # Each InputSplit will exactly map to a Spark partition, is that correct? # If I run on Yarn, how does Spark executor/task map to Yarn container? # because I already have a bunch of InputSplits, do I still need to specify the number of executors to get processing parallelized? # How does -executor-memory map to the memory requirement in Yarn's resource request? -- --Anfernee