[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105086#comment-14105086
 ] 

Saisai Shao commented on SPARK-3129:
------------------------------------

Hi Hari, I have some high level questions about this:

1. In the design doc, you mentioned to do "Once the RDD is generated, the RDD 
is checkpointed to HDFS - at which point it is fully 
recoverable", I'm not sure you checkpoint only the metadata of RDD or also 
about the data? I think RDD checkpointing is little expensive for each batch 
duration if the batch duration is quite short.
2. If we keep executors alive when driver dies, do we still need to keep 
receivers to receive data from external source? If so I think there may 
potentially have some problems: firstly memory usage will be accumulated since 
no data is consumed; secondly when driver comes back how to balance the data 
processing priority, since old data needs to be processed first, this will 
delay the newly coming data processing time and lead to unwanted issue if 
latency is larger than the batch duration.
3. In some scenarios we need to operate DStream with RDD (like join real-time 
data with history log), normally RDD is cached in BM's memory, I think we also 
need to recover this RDD's metadata, not only streaming data if we need to 
recover the processing.

Maybe there are many other details we need to think about, because to do driver 
HA is quite complex. Please correct me if something is misunderstood. Thanks a 
lot.


> Prevent data loss in Spark Streaming
> ------------------------------------
>
>                 Key: SPARK-3129
>                 URL: https://issues.apache.org/jira/browse/SPARK-3129
>             Project: Spark
>          Issue Type: New Feature
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>         Attachments: StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to