[ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47008:
----------------------------------
    Affects Version/s: 4.0.0
                           (was: 3.5.1)

> Spark to support S3 Express One Zone Storage
> --------------------------------------------
>
>                 Key: SPARK-47008
>                 URL: https://issues.apache.org/jira/browse/SPARK-47008
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to