[ 
https://issues.apache.org/jira/browse/HADOOP-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558129#action_12558129
 ] 

Doug Cutting commented on HADOOP-2566:
--------------------------------------

Globbing is implemented on top of listPaths() which is implemented on top of 
listStatus().  The primitive globbing API should not throw away that status 
information.  It should keep it so that glob clients which need it do not have 
to call getStatus() for each file that matches.  Currently the cache of 
FileStatus hides the cost of these getStatus() calls, but that cache will break 
things once files and their status can change.  So we need globStatus() before 
we can remove the cache.

FileInputFormat, for example, uses globPaths() to list files matching the input 
specification, then it uses getStatus() on each matching path when building 
splits.  This must change to call globStatus() before the cache is removed.

Long-term, globPaths() and listPaths() may perhaps still be useful as a utility 
methods implemented in terms of of globStatus() and listStatus(), but since 
most current users of these will be broken performancewise once the cache is 
removed, we should deprecate them now to strongly encourage folks to stop using 
them before that cache is removed, to give fair warning.


> need FileSystem#globStatus method
> ---------------------------------
>
>                 Key: HADOOP-2566
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2566
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Doug Cutting
>            Assignee: Hairong Kuang
>             Fix For: 0.16.0
>
>
> To remove the cache of FileStatus in DFSPath (HADOOP-2565) without hurting 
> performance, we must use file enumeration APIs that return FileStatus[] 
> rather than Path[].  Currently we have FileSystem#globPaths(), but that 
> method should be deprecated and replaced with a FileSystem#globStatus().
> We need to deprecate FileSystem#globPaths() in 0.16 in order to remove the 
> cache in 0.17.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to