[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138032#comment-14138032 ]
Hari Shreedharan commented on SPARK-3129: ----------------------------------------- Thanks Matei for the background. I had considered some of the factors (like executors always talking to the latest ones) - but I was not aware of the distinct RDD ids etc. TD and I discussed this offline and we agreed that the WAL would probably be the best way to go. I am planning to do some benchmarking of appending data to a 5-node HDFS cluster on EC2 today. Considering that HBase does use a WAL on HDFS, my expectation is that the perf should be reasonable. I will post the application on github and post a link here. I will run the application and see how it goes. I will also post it here. > Prevent data loss in Spark Streaming > ------------------------------------ > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature > Components: Streaming > Reporter: Hari Shreedharan > Assignee: Hari Shreedharan > Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org