Trying to find a complete documentation about an internal architecture of Apache Spark, but have no results there.
For example I'm trying to understand next thing: Assume that we have 1Tb text file on HDFS (3 nodes in a cluster, replication factor is 1). This file will be spitted into 128Mb chunks and each chunk will be stored only on one node. We run Spark Workers on these nodes. I know that Spark is trying to work with data stored in HDFS on the same node (to avoid network I/O). For example I'm trying to do a word count in this 1Tb text file. Here I have next questions: 1. Does Spark will load chuck (128Mb) into RAM, count words, and then delete it from memory and do it sequentially? What if there will be no available RAM? 2. When does Spark will use not local data on HDFS? 3. What if I will need to do more complex task, when a results of each iteration on each Worker need to be transferred to all other Workers (shuffling?), do I need to write them by my self to HDFS and then read them? For example I can't understand how does K-means clustering or Gradient descent works on Spark. I will appreciate any link to Apache Spark architecture guide. -- Best regards, Vitalii Duk