Hi, 

I would like to verify or correct my understanding of Spark at the
conceptual level. 
As I understand, ignoring streaming mode for a minute, Spark takes some
input data (which can be an hdfs file), lets your code transform the data,
and ultimately dispatches some computation over the data across its cluster,
thus being a distributed computing platform. It probably/typically splits
the dataset for the computation across the cluster machines, so that each
Spark machine participating in the Spark cluster performs the computation on
its subset of the data. 

Is that the case? 
In case I nailed it, how then does it handle a distributed hdfs file? does
it pull all of the file to/through one Spark server, and partition it from
there across its cluster, or does it partition the hdfs file across its
cluster without such a bottleneck - somehow intelligently letting each Spark
server pull some of the data from HDFS, or, does it all rely on Spark being
installed on each "hdfs server" and just using the hdfs file chunks of that
server locally, without transporting any input hdfs data at all? 

Many thanks! 
Matan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-using-HDFS-data-newb-tp17169.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to