Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

Adrian Tanase Thu, 24 Sep 2015 02:27:53 -0700

RE: # because I already have a bunch of InputSplits, do I still need to specify 
the number of executors to get processing parallelized?


I would say it’s best practice to have as many executors as data nodes and as 
many cores as you can get from the cluster – if YARN has enough  resources it 
will deploy the executors distributed across the cluster, then each of them 
will try to process the data locally (check the spark ui for NODE_LOCAL), with 
as many splits in parallel as you defined in spark.executor.cores

-adrian

From: Sandy Ryza
Date: Thursday, September 24, 2015 at 2:43 AM
To: Anfernee Xu
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task 
and Yarn containers

Hi Anfernee,

That's correct that each InputSplit will map to exactly a Spark partition.

On YARN, each Spark executor maps to a single YARN container.  Each executor 
can run multiple tasks over its lifetime, both parallel and sequentially.

If you enable dynamic allocation, after the stage including the InputSplits 
gets submitted, Spark will try to request an appropriate number of executors.

The memory in the YARN resource requests is --executor-memory + what's set for 
spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory.

-Sandy

On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu 
<anfernee...@gmail.com<mailto:anfernee...@gmail.com>> wrote:
Hi Spark experts,

I'm coming across these terminologies and having some confusions, could you 
please help me understand them better?

For instance I have implemented a Hadoop InputFormat to load my external data 
in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my 
questions is about

# Each InputSplit will exactly map to a Spark partition, is that correct?

# If I run on Yarn, how does Spark executor/task map to Yarn container?

# because I already have a bunch of InputSplits, do I still need to specify the 
number of executors to get processing parallelized?

# How does -executor-memory map to the memory requirement in Yarn's resource 
request?

--
--Anfernee

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

Reply via email to