Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-20 Thread LongVehicle
Hi Vijay, Thanks for the follow-up. The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS read stage) is because we load the same HDFS data in different jobs, and these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven assignment problem that we had

[Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-30 Thread LongVehicle
Hello everyone, We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We have an RDD (3GB) that we periodically (every 30min) refresh by reading from HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/, and then we create /RDD rdd = df.as[T].rdd/. The