Hi everyone,
The possibility to have in memory shuffling is discussed in this issue
https://github.com/apache/spark/pull/5403. It was in 2015.
In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still
shuffle using disks. I would like to know :
What is the current state of
tasks/stages are
> defined to perform which may result in shuffle.
If I understand correctly :
* Only shuffle data goes through the driver
* The receivers data stays node local until a shuffle occurs
Is that right ?
> On Wed, Jul 4, 2018 at 1:56 PM, thomas lavocat <
> thomas.lavo...@univ-
Hello,
I have a question on Spark Dataflow. If I understand correctly, all
received data is sent from the executor to the driver of the application
prior to task creation.
Then the task embeding the data transit from the driver to the executor
in order to be processed.
As executor cannot
ious batch, if
you set "spark.streaming.concurrentJobs" larger than 1, then the
current batch could start without waiting for the previous batch (if
it is delayed), which will lead to unexpected results.
thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日
are not independent.
What do you mean exactly by not independent ?
Are several source joined together dependent ?
Thanks,
Thomas
thomas lavocat <mailto:thomas.lavo...@univ-grenoble-alpes.fr>> 于2018年6月5日周二
下午7:17写道:
Hello,
Thank's for your answer.
On 05/06/2018 11:24, Saisai S
Hello,
Thank's for your answer.
On 05/06/2018 11:24, Saisai Shao wrote:
spark.streaming.concurrentJobs is a driver side internal
configuration, this means that how many streaming jobs can be
submitted concurrently in one batch. Usually this should not be
configured by user, unless you're
Hi everyone,
I'm wondering if the property spark.streaming.concurrentJobs should
reflects the total number of possible concurrent task on the cluster, or
the a local number of concurrent tasks on one compute node.
Thanks for your help.
Thomas