[ 
https://issues.apache.org/jira/browse/HADOOP-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771817#comment-16771817
 ] 

Steve Loughran commented on HADOOP-16077:
-----------------------------------------

If you call {{FileSystems.listFiles(path, recursive)}}, you get a 
RemoteIterator<LocatedFileStatus> ; LocatedFileStatus contains an array of 
blocklocations, which are meant to contain the block locations and storage types

This is the best API For a recursive file listing as

* on HDFS: bulk incremental updates to reduce marshalling & time NN is locked
* on object stores: the option of switching to more efficient path enumeration 
over treewalks. S3A does this & delivers O(files/1000) listings irrespective of 
the directory tree depth

now, that's a bigger leap for ls -R than just listing the storage type, but 
it'd be great to expose that operation in general, because ls -R is so 
inefficient here.

Trouble is of course, both Ls and LsR extend Command, which implements its 
treewalk recursively. Moving to a new iterator would be traumatic. Except 
maybe, just maybe, we could do something like have it support both forms of 
list & recurse, and for it to become an option to switch to; if you ask for 
storage levels, you must explicitly ask for the new recurse option.

Maybe a separate "deepLs" command would be the strategy

Have a look at {{S3aUtils.applyLocatedFiles()}} if you want to see some fun 
with closures and iterating over a list of LocatedFileStatus entries. That 
could all be promoted into {{org.apache.hadoop.util.LambdaUtils}} or the new 
{{org.apache.hadoop.fs.impl}} package.


BTW: I'm thinking that we could have the object stores expose their archive 
status of files in the storage type, so things like AWS Glacier storage would 
be visible. Being able to list here would be idea.

> Add an option in ls command to include storage policy
> -----------------------------------------------------
>
>                 Key: HADOOP-16077
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16077
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: common
>    Affects Versions: 3.3.0
>            Reporter: Ayush Saxena
>            Assignee: Ayush Saxena
>            Priority: Major
>         Attachments: HADOOP-16077-01.patch, HADOOP-16077-02.patch, 
> HADOOP-16077-03.patch, HADOOP-16077-04.patch, HADOOP-16077-05.patch, 
> HADOOP-16077-06.patch, HADOOP-16077-07.patch, HADOOP-16077-08.patch, 
> HADOOP-16077-09.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to