I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :

"For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on"

And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

"To minimize global bandwidth consumption and read latency, HDFS tries to
satisfy a read request from a replica that is closest to the reader. If
there exists a replica on the same rack as the reader node, then that
replica is preferred to satisfy the read request"


If I need a block to be used on two datanodes, will it used the replica too
or will it have a network communication?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-creation-on-HDFS-tp3894.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to