apologies for the long answer. understanding partitioning at each stage of the the RDD graph/lineage is important for efficient parallelism and having load balanced. This applies to working with any sources streaming or static. you have tricky situation here of one source kafka with 9 partitions and static data set 90 partitions.
before joining both these try to have number of partitions equal for both RDD's you can either repartition kafka source to 90 partitions or coalesce flat file RDD to 9 partitions or midway between 9 and 90. in general no of tasks that can run in parallel equal to total no of cores spark job has (no of executors * no of cores per executor). As an example if the flat file has 90 partitions and if you set 4 executors each with 5 cores for a total of 20 cores if you have 20+20+20+20+10 tasks gets scheduled. as you can see at the last you will have only 10 tasks though you have 20 cores. compare this with 6 executors each with 5 cores for a total of 30 cores, then it would be 30+30+30. ideally no of partitions for each RDD (in the graph lineage) should be a multiple of total no of available cores for the spark job. in terms of data locality prefer process-local over node-local over rack local as an example 5 executors with 4 cores and 4 executors with 5 cores each of this option will have 20 cores in total. But with 4 executors its less shuffling more process-local/node-local need to look at RDD graph for this df = sqlContext.read.parquet(...) and RDD rdd = df.as[T].rdd on your final question, you should be able to tune the static RDD without external store by carefully looking at each batch RDD lineage for that 30 mins before the RDD gets refreshed again. if you would like to use external system Apache Ignite is something that you can use as cache. thanks Vijay -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org