Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Adrian Tanase
RE: # because I already have a bunch of InputSplits, do I still need to specify 
the number of executors to get processing parallelized?

I would say it’s best practice to have as many executors as data nodes and as 
many cores as you can get from the cluster – if YARN has enough  resources it 
will deploy the executors distributed across the cluster, then each of them 
will try to process the data locally (check the spark ui for NODE_LOCAL), with 
as many splits in parallel as you defined in spark.executor.cores

-adrian

From: Sandy Ryza
Date: Thursday, September 24, 2015 at 2:43 AM
To: Anfernee Xu
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task 
and Yarn containers

Hi Anfernee,

That's correct that each InputSplit will map to exactly a Spark partition.

On YARN, each Spark executor maps to a single YARN container.  Each executor 
can run multiple tasks over its lifetime, both parallel and sequentially.

If you enable dynamic allocation, after the stage including the InputSplits 
gets submitted, Spark will try to request an appropriate number of executors.

The memory in the YARN resource requests is --executor-memory + what's set for 
spark.yarn.executor.memoryOverhead, which defaults to 10% of --executor-memory.

-Sandy

On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu 
<anfernee...@gmail.com<mailto:anfernee...@gmail.com>> wrote:
Hi Spark experts,

I'm coming across these terminologies and having some confusions, could you 
please help me understand them better?

For instance I have implemented a Hadoop InputFormat to load my external data 
in Spark, in turn my custom InputFormat will create a bunch of InputSplit's, my 
questions is about

# Each InputSplit will exactly map to a Spark partition, is that correct?

# If I run on Yarn, how does Spark executor/task map to Yarn container?

# because I already have a bunch of InputSplits, do I still need to specify the 
number of executors to get processing parallelized?

# How does -executor-memory map to the memory requirement in Yarn's resource 
request?

--
--Anfernee



Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Sabarish Sasidharan
A little caution is needed as one executor per node may not always be ideal
esp when your nodes have lots of RAM. But yes, using lesser number of
executors has benefits like more efficient broadcasts.

Regards
Sab
On 24-Sep-2015 2:57 pm, "Adrian Tanase" <atan...@adobe.com> wrote:

> RE: # because I already have a bunch of InputSplits, do I still need to
> specify the number of executors to get processing parallelized?
>
> I would say it’s best practice to have as many executors as data nodes and
> as many cores as you can get from the cluster – if YARN has enough
>  resources it will deploy the executors distributed across the cluster,
> then each of them will try to process the data locally (check the spark ui
> for NODE_LOCAL), with as many splits in parallel as you defined in
> spark.executor.cores
>
> -adrian
>
> From: Sandy Ryza
> Date: Thursday, September 24, 2015 at 2:43 AM
> To: Anfernee Xu
> Cc: "user@spark.apache.org"
> Subject: Re: Custom Hadoop InputSplit, Spark partitions, spark
> executors/task and Yarn containers
>
> Hi Anfernee,
>
> That's correct that each InputSplit will map to exactly a Spark partition.
>
> On YARN, each Spark executor maps to a single YARN container.  Each
> executor can run multiple tasks over its lifetime, both parallel and
> sequentially.
>
> If you enable dynamic allocation, after the stage including the
> InputSplits gets submitted, Spark will try to request an appropriate number
> of executors.
>
> The memory in the YARN resource requests is --executor-memory + what's set
> for spark.yarn.executor.memoryOverhead, which defaults to 10% of
> --executor-memory.
>
> -Sandy
>
> On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu <anfernee...@gmail.com>
> wrote:
>
>> Hi Spark experts,
>>
>> I'm coming across these terminologies and having some confusions, could
>> you please help me understand them better?
>>
>> For instance I have implemented a Hadoop InputFormat to load my external
>> data in Spark, in turn my custom InputFormat will create a bunch of
>> InputSplit's, my questions is about
>>
>> # Each InputSplit will exactly map to a Spark partition, is that correct?
>>
>> # If I run on Yarn, how does Spark executor/task map to Yarn container?
>>
>> # because I already have a bunch of InputSplits, do I still need to
>> specify the number of executors to get processing parallelized?
>>
>> # How does -executor-memory map to the memory requirement in Yarn's
>> resource request?
>>
>> --
>> --Anfernee
>>
>
>


Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Anfernee Xu
Hi Spark experts,

I'm coming across these terminologies and having some confusions, could you
please help me understand them better?

For instance I have implemented a Hadoop InputFormat to load my external
data in Spark, in turn my custom InputFormat will create a bunch of
InputSplit's, my questions is about

# Each InputSplit will exactly map to a Spark partition, is that correct?

# If I run on Yarn, how does Spark executor/task map to Yarn container?

# because I already have a bunch of InputSplits, do I still need to specify
the number of executors to get processing parallelized?

# How does -executor-memory map to the memory requirement in Yarn's
resource request?

-- 
--Anfernee


Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-23 Thread Sandy Ryza
Hi Anfernee,

That's correct that each InputSplit will map to exactly a Spark partition.

On YARN, each Spark executor maps to a single YARN container.  Each
executor can run multiple tasks over its lifetime, both parallel and
sequentially.

If you enable dynamic allocation, after the stage including the InputSplits
gets submitted, Spark will try to request an appropriate number of
executors.

The memory in the YARN resource requests is --executor-memory + what's set
for spark.yarn.executor.memoryOverhead, which defaults to 10% of
--executor-memory.

-Sandy

On Wed, Sep 23, 2015 at 2:38 PM, Anfernee Xu  wrote:

> Hi Spark experts,
>
> I'm coming across these terminologies and having some confusions, could
> you please help me understand them better?
>
> For instance I have implemented a Hadoop InputFormat to load my external
> data in Spark, in turn my custom InputFormat will create a bunch of
> InputSplit's, my questions is about
>
> # Each InputSplit will exactly map to a Spark partition, is that correct?
>
> # If I run on Yarn, how does Spark executor/task map to Yarn container?
>
> # because I already have a bunch of InputSplits, do I still need to
> specify the number of executors to get processing parallelized?
>
> # How does -executor-memory map to the memory requirement in Yarn's
> resource request?
>
> --
> --Anfernee
>