Summarizing 1) Static data set read from Parquet files as DataFrame in HDFS has initial parallelism of 90 (based on no input files)
2) static data set DataFrame is converted as rdd, and rdd has parallelism of 18 this was not expected dataframe.rdd is lazy evaluation there must be some operation you were doing that would have triggered conversion from 90 to 18, this would be some operation that breaks stage/requires shuffling such as groupby, reduceby, repartition,coalesce if you are using coalesce, the second parameter shuff is by default false which means upstream parallelism is not preserved. 3) you have DStream of Kafka source with 9 partitions this is joined with above static data set? when joining have you tried setting up numPartitions an optional parameter to provide no of partitions required. 4) your batch interval is 15 seconds but you are caching the static data set for 30 minutes, what exactly you mean caching for 30 minutes? Note when you cache data based on the memory pressure there is chance that partitioning is not preserved. it would be useful to provide spark UI screen shots for one complete batch, the DAG and other details thanks Vijay -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org