Hello everyone,

We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We
have an RDD (3GB) that we periodically (every 30min) refresh by reading from
HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/,
and then we create /RDD rdd = df.as[T].rdd/. The first unexpected thing is
that although /df /has parallelism of 90 (because that many HDFS files we
read),
/rdd /has parallelism of 18 (executors X cores = 9 x 2 in our setup). In the
final stage, we repartition the /rdd /using the /HashPartitioner /and the
parallelism of 18 (we denote it as /final_rdd/), and cache it using
MEMORY_ONLY_SER for 30 minutes.
We repartition the rdd using the same key as in partitioning the HDFS files
in the first place. Finally, we /leftOuterJoin /DStream of 9 Kafka
partitions (which are of total size of 300MB) and /final_rdd /(3GB).
This DStream is partitioned by the same (join) key.
Our batch interval size is 15 seconds, and we read new data from Kafka in
each batch.
                                                           
We noticed that /final_rdd /is sometimes (non-deterministically) unevenly
scheduled across executors. And sometimes only 1 or 2 executors are
executing all the tasks.
The problem with uneven assignment is that it persists until we reload HDFS
data (for half an hour).
Why is this happening? We are aware that Spark uses locality when assigning
tasks to executors, but we also tried to set
s/park.shuffle.reduceLocality.enabled=false/.
Unfortunately, this did not help, neither for rdd, nor for the final_rdd.

Any ideas how to address the problem?

Many thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to