[ https://issues.apache.org/jira/browse/SPARK-40039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578189#comment-17578189 ]
Attila Zsolt Piros commented on SPARK-40039: -------------------------------------------- I am working on this. > Introducing checkpoint file manager based on Hadoop's Abortable interface > ------------------------------------------------------------------------- > > Key: SPARK-40039 > URL: https://issues.apache.org/jira/browse/SPARK-40039 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 3.4.0 > Reporter: Attila Zsolt Piros > Assignee: Attila Zsolt Piros > Priority: Major > > Currently on S3 the checkpoint file manager (called > FileContextBasedCheckpointFileManager) is based on rename. So when a file is > opened for an atomic stream a temporary file used instead and when the stream > is committed the file is renamed. > But on S3 a rename will be a file copy. So it has some serious performance > implication. > But on Hadoop 3 there is new interface introduce called *Abortable* and > *S3AFileSystem* has this capability which is implemented by on top S3's > multipart upload. So when the file is committed a POST is sent > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html]) > and when aborted a DELETE will be send > ([https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html]) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org