Hi,

I have this doubt: Assume that an rdd is stored across multiple nodes and
one of the nodes fails. So, a partition is lost. Now, I know that when this
node is back, it uses the lineage from its neighbours and recomputes that
partition alone.

1) How does it get the source data (original data before applying any
transformations) that is lost during the crash. Is it our responsibility to
get back the source data before using the lineage?.  We have only lineage
stored on other nodes.

2)Suppose the underlying HDFS deploys replication factor =3. We know that
spark doesn't replicate RDD. When a partition is lost, is there a
possibility to use the second copy of the original data stored in HDFS and
generate the required partition using lineage from other nodes?.

3)Does it make any difference to spark if HDFS replicates its blocks more
that once?

Can someone please enlighten me on these fundamentals?

Thank you




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-recomputing-lost-partition-of-rdd-tp21535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to