[jira] [Created] (SPARK-47008) Spark to support S3 Express One Zone Storage

Steve Loughran (Jira) Thu, 08 Feb 2024 06:46:05 -0800

Steve Loughran created SPARK-47008:
--------------------------------------

             Summary: Spark to support S3 Express One Zone Storage
                 Key: SPARK-47008
                 URL: https://issues.apache.org/jira/browse/SPARK-47008
             Project: Spark
          Issue Type: Sub-task
          Components: Spark Core
    Affects Versions: 3.5.1
            Reporter: Steve Loughran



Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.

Most of this is transparent. However, one aspect which can surface as an issue 
is that these stores report prefixes in a listing when there are pending 
uploads, *even when there are no files underneath*

This leads to a situation where a listStatus of a path returns a list of file 
status entries which appears to contain one or more directories -but a 
listStatus on that path raises a FileNotFoundException: there is nothing there.

HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 

A filesystem can now be probed for inconsistent directoriy listings through 
{{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}

If true, then treewalking code SHOULD NOT report a failure if, when walking 
into a subdirectory, a list/getFileStatus on that directory raises a 
FileNotFoundException.

Although most of this is handled in the hadoop code, but there some places 
where treewalking is done inside spark These need to be identified and make 
resilient to failure on the recurse down the tree

* SparkHadoopUtil list methods , 
* especially listLeafStatuses used by OrcFileOperator
org.apache.spark.util.Utils#fetchHcfsFile

{{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
here, or the logic can be replicated. Using the hadoop implementation would be 
better from a maintenance perspective




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-47008) Spark to support S3 Express One Zone Storage

Reply via email to