I think there are a number of misconceptions here. It is not necessary
that the original node come back in order to recreate the lost
partition. The lineage is not retrieved from neighboring nodes. The
source data is retrieved in the same way that it was the first time
that the partition was computed. The caller does not need to do
anything; Spark does the recomputation. The point is that the creation
of the partition is deterministic and so can be replayed anywhere.
Spark *can* replicate RDDs, optionally. Resilience of data stored on
HDFS is up to HDFS and is transparent to Spark. Spark will use the
data locality information to try to schedule work next to the data, no
matter what the replication factor. More replication potentially
allows more options in scheduling tasks, I suppose, since the data is
found on more nodes.

On Fri, Feb 6, 2015 at 9:47 AM, Kartheek.R <kartheek.m...@gmail.com> wrote:
> Hi,
>
> I have this doubt: Assume that an rdd is stored across multiple nodes and
> one of the nodes fails. So, a partition is lost. Now, I know that when this
> node is back, it uses the lineage from its neighbours and recomputes that
> partition alone.
>
> 1) How does it get the source data (original data before applying any
> transformations) that is lost during the crash. Is it our responsibility to
> get back the source data before using the lineage?.  We have only lineage
> stored on other nodes.
>
> 2)Suppose the underlying HDFS deploys replication factor =3. We know that
> spark doesn't replicate RDD. When a partition is lost, is there a
> possibility to use the second copy of the original data stored in HDFS and
> generate the required partition using lineage from other nodes?.
>
> 3)Does it make any difference to spark if HDFS replicates its blocks more
> that once?
>
> Can someone please enlighten me on these fundamentals?
>
> Thank you
>
> ________________________________
> View this message in context: Question about recomputing lost partition of
> rdd ?
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to