Summarizing

1) Static data set read from Parquet files as DataFrame in HDFS has initial
parallelism of 90 (based on no input files)

2) static data set DataFrame is converted as rdd, and rdd has parallelism of
18 this was not expected
dataframe.rdd is lazy evaluation there must be some operation you were doing
that would have triggered
conversion from 90 to 18, this would be some operation that breaks
stage/requires shuffling such as groupby, reduceby, repartition,coalesce
if you are using coalesce, the second parameter shuff is by default false
which means upstream parallelism is not preserved.

3) you have DStream of Kafka source with 9 partitions this is joined with
above static data set? when joining have you tried setting up numPartitions
an optional parameter to provide no of partitions required.

4) your batch interval is 15 seconds but you are caching the static data set
for 30 minutes, what exactly you mean caching for 30 minutes?

Note when you cache data based on the memory pressure there is chance that
partitioning is not preserved. 

it would be useful to provide spark UI screen shots for one complete batch,
the DAG and other details

thanks
Vijay



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to