[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14135841#comment-14135841 ]
Hari Shreedharan commented on SPARK-3129: ----------------------------------------- As long as at least one executor containing every block needs to be available when the driver comes back up - that is pretty much it. Since each block is replicated unless all 3 executors holding a block fails, it will not lose data. A WAL would be necessary to recover data which have not been pushed as blocks yet (look at the store(Any) method). Adding a persistent WAL is going to hit performance, especially if the WAL has to be durable (you'd need to do an hflush if the WAL is on HDFS, or its equivalent on any other system). So you'd be paying for persisting the data when each block is created, whereas in this case, you are paying only at startup and driver restarts. Even the amount of data transferred is very less, since it is just metadata. If the WAL is not durable, then there is no guarantee it would be recoverable. If the WAL is local to each executor somehow, you'd still have to send all the block info to the driver when it comes back up. TD and I had discussed the WAL approach and felt it is actually more complex and might affect performance more than this one. In this case, all the building blocks are already there (since we already know how to get block infos from executors which hold on to the blocks). We just need to add Akka messages to ask the executors to re-send block metadata. > Prevent data loss in Spark Streaming > ------------------------------------ > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature > Reporter: Hari Shreedharan > Assignee: Hari Shreedharan > Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org