We are using createDirectStream api to receive messages from 48 partitioned topic. I am setting up --num-executors 48 & --executor-cores 1 in spark-submit
All partitions were parallely received and corresponding RDDs in foreachRDD were executed in parallel. But when I join a transformed RDD jsonDF (code below) with another RDD, i could see that they are not executed in parallell for each partitions. There were more shuffle read and writes and no of executors executing were less than the no of partitions. I mean, no of executors were not equal to no of partitions when join is executing. How could I make sure to execute join in all executors. Can anyone provide help. kstream.foreachRDD { rdd => val jsonDF = sqlContext.read.json(rdd).toDF ..... ... val metaDF = ssc.sparkContext.textfile("file").toDF val join = jsonDF.join(metaDF) join.map (....).count }