Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat
Hi everyone, The possibility to have in memory shuffling is discussed in this issue https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I would like to know : What is the current state of

Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-05 Thread Thomas Lavocat
tasks/stages are > defined to perform which may result in shuffle. If I understand correctly : * Only shuffle data goes through the driver * The receivers data stays node local until a shuffle occurs Is that right ? > On Wed, Jul 4, 2018 at 1:56 PM, thomas lavocat < > thomas.lavo...@univ-

[Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread thomas lavocat
Hello, I have a question on Spark Dataflow. If I understand correctly, all received data is sent from the executor to the driver of the application prior to task creation. Then the task embeding the data transit from the driver to the executor in order to be processed. As executor cannot

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat
ious batch, if you set "spark.streaming.concurrentJobs" larger than 1, then the current batch could start without waiting for the previous batch (if it is delayed), which will lead to unexpected results. thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
are not independent. What do you mean exactly by not independent ? Are several source joined together dependent ? Thanks, Thomas thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二 下午7:17写道: Hello, Thank's for your answer. On 05/06/2018 11:24, Saisai S

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hello, Thank's for your answer. On 05/06/2018 11:24, Saisai Shao wrote: spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're

[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hi everyone, I'm wondering if the property  spark.streaming.concurrentJobs should reflects the total number of possible concurrent task on the cluster, or the a local number of concurrent tasks on one compute node. Thanks for your help. Thomas