Hi, I have this doubt: Assume that an rdd is stored across multiple nodes and one of the nodes fails. So, a partition is lost. Now, I know that when this node is back, it uses the lineage from its neighbours and recomputes that partition alone.
1) How does it get the source data (original data before applying any transformations) that is lost during the crash. Is it our responsibility to get back the source data before using the lineage?. We have only lineage stored on other nodes. 2)Suppose the underlying HDFS deploys replication factor =3. We know that spark doesn't replicate RDD. When a partition is lost, is there a possibility to use the second copy of the original data stored in HDFS and generate the required partition using lineage from other nodes?. 3)Does it make any difference to spark if HDFS replicates its blocks more that once? Can someone please enlighten me on these fundamentals? Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-recomputing-lost-partition-of-rdd-tp21535.html Sent from the Apache Spark User List mailing list archive at Nabble.com.