Hi, I would like to verify or correct my understanding of Spark at the conceptual level. As I understand, ignoring streaming mode for a minute, Spark takes some input data (which can be an hdfs file), lets your code transform the data, and ultimately dispatches some computation over the data across its cluster, thus being a distributed computing platform. It probably/typically splits the dataset for the computation across the cluster machines, so that each Spark machine participating in the Spark cluster performs the computation on its subset of the data.
Is that the case? In case I nailed it, how then does it handle a distributed hdfs file? does it pull all of the file to/through one Spark server, and partition it from there across its cluster, or does it partition the hdfs file across its cluster without such a bottleneck - somehow intelligently letting each Spark server pull some of the data from HDFS, or, does it all rely on Spark being installed on each "hdfs server" and just using the hdfs file chunks of that server locally, without transporting any input hdfs data at all? Many thanks! Matan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org