[jira] [Commented] (HADOOP-18599) Expose `listStatus(Path path, String startFrom)` on `AzureBlobFileSystem`

Steve Loughran (Jira) Thu, 19 Jan 2023 09:06:07 -0800


    [ 
https://issues.apache.org/jira/browse/HADOOP-18599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678817#comment-17678817
 ]


Steve Loughran commented on HADOOP-18599:
-----------------------------------------

all public FS APIs need to go into hadoop-common with
* designs which can be implemented in other filesystems (e.g. s3)
* a strict specification to define that behaviour
* A set of contract tests derived from that specification to verify that all 
implementations match the spec
* ideally implementations for > 1 store to show it is flexible.
There is also an implicit commitment to maintain that indefinitely. Which you 
would probably expect even for an abfs only change

If this scares you off it is with good reason -it's really hard to get the 
stuff in. If one was to be added, it should be a builder api and return a 
RemoteIterator<>.

Now, before you start on that: why don't you use listStatusIterator() instead, 
*because it and the s3a one return the result a page at at time, while 
asynchronously prefetching the next page*. You only need to block for the first 
page of results and can then process it while the next one is retrieved for you.

Isn't that what you wanted?



> Expose `listStatus(Path path, String startFrom)` on `AzureBlobFileSystem`
> -------------------------------------------------------------------------
>
>                 Key: HADOOP-18599
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18599
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/azure
>    Affects Versions: 3.3.2, 3.3.4
>            Reporter: Thomas Newton
>            Priority: Minor
>
> When working with Azure blob storage listing operations can often be quite 
> slow even on storage accounts with the hierarchical namespace. 
> This can be mitigated by listing only a specific subset of directories using 
> a function like 
> [https://hadoop.apache.org/docs/r3.3.4/api/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.html#listStatus-org.apache.hadoop.fs.Path-java.lang.String-org.apache.hadoop.fs.azurebfs.utils.TracingContext-]
> Which accepts a `startFrom` argument and lists all files in order starting 
> from there.
> I'm wondering if we could add a method to the `AzureBlobFileSystem`
> Something like:
> ```
> public FileStatus[] listStatus(final Path f, final String startFrom) throws 
> IOException
> ```
> This exposes the functionality that already exists on the underlying 
> `AzureBlobFileSystemStore`. My understanding from reading a bit of the code 
> is that users should mainly be dealing with `AzureBlobFileSystem`s and 
> `AzureBlobFileSystem` seem easier to use to me hence the benefit of exposing 
> it on the `AzureBlobFileSystem`.
>  
> I'm very un-familiar with java but I'm told that keeping strictly to 
> interfaces is strongly preferred. However I can see some examples already on 
> `AzureBlobFileSystem` that do not belong to any interface (e.g. `breakLease`) 
> so I'm hoping its acceptable to add a method like I described only for the 
> one `FileSystem` implementation.
>  
> The specific motivation for this is to unblock 
> [https://github.com/delta-io/delta/issues/1568]
> I would be willing to contribute this if maintainers think the plan is 
> reasonable. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-18599) Expose `listStatus(Path path, String startFrom)` on `AzureBlobFileSystem`

Reply via email to