How a Spark job reads datasources depends on the underlying source system,the job configuration about number of executors and cores per executor. https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets
About Shuffle operations. https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187 this has great explanation of how shuffle works https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark ======== A sample of code and job configuration, the DAG underlying source (HDFS or others) would help explain thanks VP -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org