[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138032#comment-14138032
 ] 

Hari Shreedharan commented on SPARK-3129:
-----------------------------------------

Thanks Matei for the background. I had considered some of the factors (like 
executors always talking to the latest ones) - but I was not aware of the 
distinct RDD ids etc. 

TD and I discussed this offline and we agreed that the WAL would probably be 
the best way to go. I am planning to do some benchmarking of appending data to 
a 5-node HDFS cluster on EC2 today. Considering that HBase does use a WAL on 
HDFS, my expectation is that the perf should be reasonable.

I will post the application on github and post a link here. I will run the 
application and see how it goes. I will also post it here.

> Prevent data loss in Spark Streaming
> ------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>          Components: Streaming
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to