[ https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968 ]
Leo Timofeyev edited comment on SPARK-47008 at 5/12/24 9:07 PM: ---------------------------------------------------------------- Hey [~ste...@apache.org] and [~dongjoon] What do you think about something like this? {code:java} def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = { def recurse(status: FileStatus): Seq[FileStatus] = { val fsHasPathCapability = fs.hasPathCapability(status.getPath, SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT) val statusResult = Try { fs.listStatus(status.getPath) } statusResult match { case Failure(e) => if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) { Seq.empty[FileStatus] } else throw e case Success(sr) => val (directories, leaves) = sr.partition(_.isDirectory) (leaves ++ directories.flatMap(f => listLeafStatuses(fs, f))).toImmutableArraySeq } } if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus) }{code} I have couple of unit tests for this changes. Should I go ahead with a PR? was (Author: JIRAUSER303957): Hey [~ste...@apache.org] What do you think about something like this? {code:java} def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = { def recurse(status: FileStatus): Seq[FileStatus] = { val fsHasPathCapability = fs.hasPathCapability(status.getPath, SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT) val statusResult = Try { fs.listStatus(status.getPath) } statusResult match { case Failure(e) => if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) { Seq.empty[FileStatus] } else throw e case Success(sr) => val (directories, leaves) = sr.partition(_.isDirectory) (leaves ++ directories.flatMap(f => listLeafStatuses(fs, f))).toImmutableArraySeq } } if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus) }{code} I have couple of unit tests for this changes. [~dongjoon] Should I go ahead with a PR? > Spark to support S3 Express One Zone Storage > -------------------------------------------- > > Key: SPARK-47008 > URL: https://issues.apache.org/jira/browse/SPARK-47008 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Affects Versions: 4.0.0 > Reporter: Steve Loughran > Priority: Major > > Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage. > Most of this is transparent. However, one aspect which can surface as an > issue is that these stores report prefixes in a listing when there are > pending uploads, *even when there are no files underneath* > This leads to a situation where a listStatus of a path returns a list of file > status entries which appears to contain one or more directories -but a > listStatus on that path raises a FileNotFoundException: there is nothing > there. > HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, > A filesystem can now be probed for inconsistent directoriy listings through > {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}} > If true, then treewalking code SHOULD NOT report a failure if, when walking > into a subdirectory, a list/getFileStatus on that directory raises a > FileNotFoundException. > Although most of this is handled in the hadoop code, but there some places > where treewalking is done inside spark These need to be identified and make > resilient to failure on the recurse down the tree > * SparkHadoopUtil list methods , > * especially listLeafStatuses used by OrcFileOperator > org.apache.spark.util.Utils#fetchHcfsFile > {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist > here, or the logic can be replicated. Using the hadoop implementation would > be better from a maintenance perspective -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org