[ 
https://issues.apache.org/jira/browse/SPARK-38605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509394#comment-17509394
 ] 

Jungtaek Lim commented on SPARK-38605:
--------------------------------------

I don't have a strong idea about this. It sounds great to be resilient to 
single failure, but we still need to ensure the behavior is still atomic among 
multiple trials. The number of trials and proper interval among trials would be 
something need to think through. (Even if we let them be configurable, 
reasonable default values are needed.)

> Retrying on file manager operation in HDFSMetadataLog
> -----------------------------------------------------
>
>                 Key: SPARK-38605
>                 URL: https://issues.apache.org/jira/browse/SPARK-38605
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.4.0
>            Reporter: L. C. Hsieh
>            Priority: Major
>
> Currently HDFSMetadataLog uses CheckpointFileManager to do some file 
> operation like opening metadata file. It is very easy to be affected by 
> network blips and causes the streaming query failed. Although we can restart 
> the streaming query, but it takes more time to recover.
> Such file operations should be resilient with such situation by retrying.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to