[jira] [Comment Edited] (SPARK-47008) Spark to support S3 Express One Zone Storage

Leo Timofeyev (Jira) Sun, 12 May 2024 14:08:05 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844968#comment-17844968
 ]


Leo Timofeyev edited comment on SPARK-47008 at 5/12/24 9:07 PM:
----------------------------------------------------------------

Hey [~ste...@apache.org] and [~dongjoon] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
    val fsHasPathCapability =
      fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
    val statusResult = Try {
      fs.listStatus(status.getPath)
    }
    statusResult match {
      case Failure(e) =>
        if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
          Seq.empty[FileStatus]
        }
        else throw e
      case Success(sr) =>
        val (directories, leaves) = sr.partition(_.isDirectory)
        (leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
    }
  }

  if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus)
}{code}
I have couple of unit tests for this changes. Should I go ahead with a PR?


was (Author: JIRAUSER303957):
Hey [~ste...@apache.org] 

What do you think about something like this?
{code:java}
def listLeafStatuses(fs: FileSystem, baseStatus: FileStatus): Seq[FileStatus] = 
{
  def recurse(status: FileStatus): Seq[FileStatus] = {
    val fsHasPathCapability =
      fs.hasPathCapability(status.getPath, 
SparkHadoopUtil.DIRECTORY_LISTING_INCONSISTENT)
    val statusResult = Try {
      fs.listStatus(status.getPath)
    }
    statusResult match {
      case Failure(e) =>
        if (e.isInstanceOf[FileNotFoundException] && fsHasPathCapability) {
          Seq.empty[FileStatus]
        }
        else throw e
      case Success(sr) =>
        val (directories, leaves) = sr.partition(_.isDirectory)
        (leaves ++ directories.flatMap(f => listLeafStatuses(fs, 
f))).toImmutableArraySeq
    }
  }

  if (baseStatus.isDirectory) recurse(baseStatus) else Seq(baseStatus)
}{code}
I have couple of unit tests for this changes. [~dongjoon] Should I go ahead 
with a PR?

> Spark to support S3 Express One Zone Storage
> --------------------------------------------
>
>                 Key: SPARK-47008
>                 URL: https://issues.apache.org/jira/browse/SPARK-47008
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: Steve Loughran
>            Priority: Major
>
> Hadoop 3.4.0 adds support for AWS S3 Express One Zone Storage.
> Most of this is transparent. However, one aspect which can surface as an 
> issue is that these stores report prefixes in a listing when there are 
> pending uploads, *even when there are no files underneath*
> This leads to a situation where a listStatus of a path returns a list of file 
> status entries which appears to contain one or more directories -but a 
> listStatus on that path raises a FileNotFoundException: there is nothing 
> there.
> HADOOP-18996 handles this in all of hadoop code, including FileInputFormat, 
> A filesystem can now be probed for inconsistent directoriy listings through 
> {{fs.hasPathCapability(path, "fs.capability.directory.listing.inconsistent")}}
> If true, then treewalking code SHOULD NOT report a failure if, when walking 
> into a subdirectory, a list/getFileStatus on that directory raises a 
> FileNotFoundException.
> Although most of this is handled in the hadoop code, but there some places 
> where treewalking is done inside spark These need to be identified and make 
> resilient to failure on the recurse down the tree
> * SparkHadoopUtil list methods , 
> * especially listLeafStatuses used by OrcFileOperator
> org.apache.spark.util.Utils#fetchHcfsFile
> {{org.apache.hadoop.fs.FileUtil.maybeIgnoreMissingDirectory()}} can assist 
> here, or the logic can be replicated. Using the hadoop implementation would 
> be better from a maintenance perspective



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-47008) Spark to support S3 Express One Zone Storage

Reply via email to