How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

About Shuffle operations. 
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill
https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187

this has great explanation of how shuffle works
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark

========
A sample of code and job configuration, the DAG underlying source (HDFS or
others) would help explain

thanks
VP



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to