[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135841#comment-14135841
 ] 

Hari Shreedharan commented on SPARK-3129:
-----------------------------------------

As long as at least one executor containing  every block needs to be available 
when the driver comes back up - that is pretty much it. Since each block is 
replicated unless all 3 executors holding a block fails, it will not lose data.

A WAL would be necessary to recover data which have not been pushed as blocks 
yet (look at the store(Any) method). Adding a persistent WAL is going to hit 
performance, especially if the WAL has to be durable (you'd need to do an 
hflush if the WAL is on HDFS, or its equivalent on any other system). So you'd 
be paying for persisting the data when each block is created, whereas in this 
case, you are paying only at startup and driver restarts. Even the amount of 
data transferred is very less, since it is just metadata. If the WAL is not 
durable, then there is no guarantee it would be recoverable. If the WAL is 
local to each executor somehow, you'd still have to send all the block info to 
the driver when it comes back up.

TD and I had discussed the WAL approach and felt it is actually more complex 
and might affect performance more than this one. In this case, all the building 
blocks are already there (since we already know how to get block infos from 
executors which hold on to the blocks). We just need to add Akka messages to 
ask the executors to re-send block metadata.

> Prevent data loss in Spark Streaming
> ------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to