[ https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-47008: ---------------------------------- Affects Version/s: 4.0.0 (was: 3.5.1) > Spark to support S3 Express One Zone Storage > -------------------------------------------- > > Key: SPARK-47008 > URL: https://issues.apache.org/jira/browse/SPARK-47008 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Steve Loughran > Priority: Major > > Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage. > Most of this is transparent. However, one aspect which can surface as an > issue is that these stores report prefixes in a listing when there are > pending uploads, *even when there are no files underneath* > This leads to a situation where a listStatus of a path returns a list of file > status entries which appears to contain one or more directories -but a > listStatus on that path raises a FileNotFoundException: there is nothing > there. > HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, > A filesystem can now be probed for inconsistent directoriy listings through > {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}} > If true, then treewalking code SHOULD NOT report a failure if, when walking > into a subdirectory, a list/getFileStatus on that directory raises a > FileNotFoundException. > Although most of this is handled in the hadoop code, but there some places > where treewalking is done inside spark These need to be identified and make > resilient to failure on the recurse down the tree > * SparkHadoopUtil list methods , > * especially listLeafStatuses used by OrcFileOperator > org.apache.spark.util.Utils#fetchHcfsFile > {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist > here, or the logic can be replicated. Using the hadoop implementation would > be better from a maintenance perspective -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org