Hello everyone, We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We have an RDD (3GB) that we periodically (every 30min) refresh by reading from HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/, and then we create /RDD rdd = df.as[T].rdd/. The first unexpected thing is that although /df /has parallelism of 90 (because that many HDFS files we read), /rdd /has parallelism of 18 (executors X cores = 9 x 2 in our setup). In the final stage, we repartition the /rdd /using the /HashPartitioner /and the parallelism of 18 (we denote it as /final_rdd/), and cache it using MEMORY_ONLY_SER for 30 minutes. We repartition the rdd using the same key as in partitioning the HDFS files in the first place. Finally, we /leftOuterJoin /DStream of 9 Kafka partitions (which are of total size of 300MB) and /final_rdd /(3GB). This DStream is partitioned by the same (join) key. Our batch interval size is 15 seconds, and we read new data from Kafka in each batch. We noticed that /final_rdd /is sometimes (non-deterministically) unevenly scheduled across executors. And sometimes only 1 or 2 executors are executing all the tasks. The problem with uneven assignment is that it persists until we reload HDFS data (for half an hour). Why is this happening? We are aware that Spark uses locality when assigning tasks to executors, but we also tried to set s/park.shuffle.reduceLocality.enabled=false/. Unfortunately, this did not help, neither for rdd, nor for the final_rdd.
Any ideas how to address the problem? Many thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org