[ 
https://issues.apache.org/jira/browse/HADOOP-13208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15421944#comment-15421944
 ] 

Chris Nauroth commented on HADOOP-13208:
----------------------------------------

[~ste...@apache.org], thank you for the patch.  I'm still working through some 
of the logic in detail, but this looks great overall.  I have a few questions 
and comments so far.

The control flow across the various iterators is a bit hard to follow.  I think 
this is an unavoidable consequence of the optimized logic, but I'd like to 
suggest some things that might make it easier to follow.

# I notice that all call sites first contruct an {{ObjectListingIterator}} and 
then immediately pass it into construction of a {{FileStatusListingIterator}}.  
Since the two classes cannot be used independently in a meaningful way, I 
wonder if it would be better to combine all of the logic into a single class.  
If not combined, then perhaps it would be helpful to have a 
{{createFileStatusListingIterator}} helper method to encapsulate the 2-step 
construction, and all call sites could call that.
# Could the first listing call be pushed into the {{ObjectListingIterator}} 
constructor?  That would have the benefit of the iterator fully encapsulating 
all of the calls so that each call site doesn't need to bootstrap it with the 
first listing call.  If the constructor also accepted a {{recursive}} flag, 
then it could take care of deciding whether or not to pass a "/" delimiter: 
another detail that call sites wouldn't need to worry about getting right.
# It appears that any reference to a {{FileStatusGenerator}} is always going to 
point to an instance of a single implementation: {{GenerateS3AFileStatus}}.  Is 
there a reason for the extra indirection, or can the logic of 
{{GenerateS3AFileStatus}} be inlined to the call sites?  (Maybe the indirection 
sets up something helpful in a subsequent patch that I haven't reviewed yet?)
# I could see moving the new inner classes to top-level package-private 
classes, assuming that wouldn't force relaxing visibility on {{S3AFileSystem}} 
internals too much.

{code}
   * This implementation is optimized for S3, which can do a bulk listing
   * off all entries under a path in one single operation. Thus there is
   * no need to recursively walk the directory tree.
{code}

This comment on {{listLocatedStatus}} confused me, because this method does not 
recursively traverse the sub-tree.  As I understand it, this patch does not 
reduce the number of S3 calls during a {{listLocatedStatus}} operation, 
assuming that the caller fully exhausts the returned iterator.  However, it's 
still a good change, because if the caller breaks out of the iteration early, 
then they don't pay the full cost of an eager {{listStatus}} fetch (multiple S3 
calls) that the base class {{FileSystem#listLocatedStatus}} would do.

Do I understand correctly, or did I miss something recursive about this 
operation?

A few nitpicks:

{code}
   * @return the fully qualified path including URI schema and bucket name.
{code}

Should be "URI scheme"?

{code}
   * Make this protected method public so that {@link S3AGlobber can access it}.
...
   * Override superclass to use the new {@code S3AGlobber}.
{code}

Since the globber stuff is going to be done in a different patch, I suggest 
omitting these changes from this patch.

{code}
   * This essentially a nested and wrapped set of iterators, with some
{code}

Should be "is essentially"?

{code}
    builder.append("size=").append(summary.getSize()).append(" ");
    return builder.toString();
{code}

The last {{append}} of a trailing space looks unnecessary.

{code}
      result = new ArrayList<>(1);
{code}

I agree with Aaron's earlier comment about returning an array directly here and 
only allocating an {{ArrayList}} on the is-directory code path.



> S3A listFiles(recursive=true) to do a bulk listObjects instead of walking the 
> pseudo-tree of directories
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13208
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13208
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-13208-branch-2-001.patch, 
> HADOOP-13208-branch-2-007.patch, HADOOP-13208-branch-2-008.patch, 
> HADOOP-13208-branch-2-009.patch, HADOOP-13208-branch-2-010.patch, 
> HADOOP-13208-branch-2-011.patch, HADOOP-13208-branch-2-012.patch, 
> HADOOP-13208-branch-2-017.patch, HADOOP-13208-branch-2-018.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A major cost in split calculation against object stores turns out be listing 
> the directory tree itself. That's because against S3, it takes S3A two HEADs 
> and two lists to list the content of any directory path (2 HEADs + 1 list for 
> getFileStatus(); the next list to query the contents).
> Listing a directory could be improved slightly by combining the final two 
> listings. However, a listing of a directory tree will still be 
> O(directories). In contrast, a recursive {{listFiles()}} operation should be 
> implementable by a bulk listing of all descendant paths; one List operation 
> per thousand descendants. 
> As the result of this call is an iterator, the ongoing listing can be 
> implemented within the iterator itself



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to