Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat

Hi everyone,


The possibility to have in memory shuffling is discussed in this issue 
https://github.com/apache/spark/pull/5403. It was in 2015.


In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still 
shuffle using disks. I would like to know :



What is the current state of in memory shuffling ?

Is it implemented in production ?

Does the current shuffle still use disks to work ?

Is it possible to somehow do it in RAM only ?


Regards,

Thomas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-05 Thread Thomas Lavocat
Excerpts from Prem Sure's message of 2018-07-04 19:39:29 +0530:
> Hoping below would help in clearing some..
> executors dont have control to share the data among themselves except
> sharing accumulators via driver's support.
> Its all based on the data locality or remote nature, tasks/stages are
> defined to perform which may result in shuffle.

If I understand correctly :

* Only shuffle data goes through the driver
* The receivers data stays node local until a shuffle occurs

Is that right ?

> On Wed, Jul 4, 2018 at 1:56 PM, thomas lavocat <
> thomas.lavo...@univ-grenoble-alpes.fr> wrote:
> 
> > Hello,
> >
> > I have a question on Spark Dataflow. If I understand correctly, all
> > received data is sent from the executor to the driver of the application
> > prior to task creation.
> >
> > Then the task embeding the data transit from the driver to the executor in
> > order to be processed.
> >
> > As executor cannot exchange data themselves, in a shuffle, data also
> > transit to the driver.
> >
> > Is that correct ?
> >
> > Thomas
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread thomas lavocat

Hello,

I have a question on Spark Dataflow. If I understand correctly, all 
received data is sent from the executor to the driver of the application 
prior to task creation.


Then the task embeding the data transit from the driver to the executor 
in order to be processed.


As executor cannot exchange data themselves, in a shuffle, data also 
transit to the driver.


Is that correct ?

Thomas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat

Thank you very much for your answer.

Since I don't have dependent jobs I will continue to use this functionality.


On 05/06/2018 13:52, Saisai Shao wrote:
"dependent" I mean this batch's job relies on the previous batch's 
result. So this batch should wait for the finish of previous batch, if 
you set "spark.streaming.concurrentJobs" larger than 1, then the 
current batch could start without waiting for the previous batch (if 
it is delayed), which will lead to unexpected results.



thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二 
下午7:48写道:



On 05/06/2018 13:44, Saisai Shao wrote:

You need to read the code, this is an undocumented configuration.

I'm on it right now, but, Spark is a big piece of software.

Basically this will break the ordering of Streaming jobs, AFAIK
it may get unexpected results if you streaming jobs are not
independent.

What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
    Thomas


thomas lavocat mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二
下午7:17写道:

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:

spark.streaming.concurrentJobs is a driver side internal
configuration, this means that how many streaming jobs can
be submitted concurrently in one batch. Usually this should
not be configured by user, unless you're familiar with Spark
Streaming internals, and know the implication of this
configuration.


How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and
found out that my overall throughput is increased in
correlation with this property.
But I'm experiencing scalability issues. With more than 16
receivers spread over 8 executors, my executors no longer
receive work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas







Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat


On 05/06/2018 13:44, Saisai Shao wrote:

You need to read the code, this is an undocumented configuration.

I'm on it right now, but, Spark is a big piece of software.
Basically this will break the ordering of Streaming jobs, AFAIK it may 
get unexpected results if you streaming jobs are not independent.

What do you mean exactly by not independent ?
Are several source joined together dependent ?

Thanks,
Thomas


thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二 
下午7:17写道:


Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:

spark.streaming.concurrentJobs is a driver side internal
configuration, this means that how many streaming jobs can be
submitted concurrently in one batch. Usually this should not be
configured by user, unless you're familiar with Spark Streaming
internals, and know the implication of this configuration.


How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found
out that my overall throughput is increased in correlation with
this property.
But I'm experiencing scalability issues. With more than 16
receivers spread over 8 executors, my executors no longer receive
work from the driver and fall idle.
Is there an explanation ?

Thanks,
Thomas





Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat

Hello,

Thank's for your answer.


On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal 
configuration, this means that how many streaming jobs can be 
submitted concurrently in one batch. Usually this should not be 
configured by user, unless you're familiar with Spark Streaming 
internals, and know the implication of this configuration.


How can I find some documentation about those implications ?

I've experimented some configuration of this parameters and found out 
that my overall throughput is increased in correlation with this property.
But I'm experiencing scalability issues. With more than 16 receivers 
spread over 8 executors, my executors no longer receive work from the 
driver and fall idle.

Is there an explanation ?

Thanks,
Thomas



[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat

Hi everyone,

I'm wondering if the property  spark.streaming.concurrentJobs should 
reflects the total number of possible concurrent task on the cluster, or 
the a local number of concurrent tasks on one compute node.


Thanks for your help.

Thomas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org