[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105086#comment-14105086 ]
Saisai Shao commented on SPARK-3129: ------------------------------------ Hi Hari, I have some high level questions about this: 1. In the design doc, you mentioned to do "Once the RDD is generated, the RDD is checkpointed to HDFS - at which point it is fully recoverable", I'm not sure you checkpoint only the metadata of RDD or also about the data? I think RDD checkpointing is little expensive for each batch duration if the batch duration is quite short. 2. If we keep executors alive when driver dies, do we still need to keep receivers to receive data from external source? If so I think there may potentially have some problems: firstly memory usage will be accumulated since no data is consumed; secondly when driver comes back how to balance the data processing priority, since old data needs to be processed first, this will delay the newly coming data processing time and lead to unwanted issue if latency is larger than the batch duration. 3. In some scenarios we need to operate DStream with RDD (like join real-time data with history log), normally RDD is cached in BM's memory, I think we also need to recover this RDD's metadata, not only streaming data if we need to recover the processing. Maybe there are many other details we need to think about, because to do driver HA is quite complex. Please correct me if something is misunderstood. Thanks a lot. > Prevent data loss in Spark Streaming > ------------------------------------ > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature > Reporter: Hari Shreedharan > Assignee: Hari Shreedharan > Attachments: StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org