Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-23 Thread vijay.bvp
thanks for adding RDD lineage graph. I could see 18 parallel tasks for HDFS Read was it changed. what is the spark job configuration, how many executors and cores per exeuctor i would say keep the partitioning multiple of (no of executors * cores) for all the RDD's if you have 3 executors

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-20 Thread LongVehicle
Hi Vijay, Thanks for the follow-up. The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS read stage) is because we load the same HDFS data in different jobs, and these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven assignment problem that we had

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread vijay.bvp
apologies for the long answer. understanding partitioning at each stage of the the RDD graph/lineage is important for efficient parallelism and having load balanced. This applies to working with any sources streaming or static. you have tricky situation here of one source kafka with 9

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread Aleksandar Vitorovic
Hi Vijay, Thank you very much for your reply. Setting the number of partitions explicitly in the join, and memory pressure influence on partitioning were definitely very good insights. At the end, we avoid the issue of uneven load balancing completely by doing the following two: a) Reducing the

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-31 Thread vijay.bvp
Summarizing 1) Static data set read from Parquet files as DataFrame in HDFS has initial parallelism of 90 (based on no input files) 2) static data set DataFrame is converted as rdd, and rdd has parallelism of 18 this was not expected dataframe.rdd is lazy evaluation there must be some operation

[Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-01-30 Thread LongVehicle
Hello everyone, We are running Spark Streaming jobs in Spark 2.1 in cluster mode in YARN. We have an RDD (3GB) that we periodically (every 30min) refresh by reading from HDFS. Namely, we create a DataFrame /df / using /sqlContext.read.parquet/, and then we create /RDD rdd = df.as[T].rdd/. The