[ 
https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515986#comment-14515986
 ] 

Sanjay Radia commented on HADOOP-9984:
--------------------------------------

IThe following proposal on symlinks is based on discussions with Jason, Nathan 
and Daryn a few months ago. 
The recent disabling of symlinks (HDFS-11852) has prompted me to finally this 
comment out.

Symlink is a very frequently asked for feature and ran into trouble mostly 
because the the original listStatus was not well designed.
This issue has been heavily discussed and we have gone back and forth.  The 
proposal below is basically Jason Lowe's proposal as mostly described in 
 
https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002

An additional  issue concerns cross-namespace links that should be discussed in 
a separate comment.
Further,  Colins has raised a Hive concern in an email thread that I will also 
cover in a separate comment.


Summary of proposal:

* 1)  Existing listStatus() API will follows symlinks to maintain compatibility 
for isDir()  and throws exception if it cannot.
* 2) Add a new listStatus2() api that does the right thing (ie. not follow 
symlinks)
* 3) Change all other libraries such as  glob, cli and  tools to use the new 
API listStatus2
* 4) Deprecate the existing listStatus.

Details:
* 1) For the current API: listStatus()  returns  FileStatus[].
** a) List Status will follow the symlink. If any of the symlinks are not 
followable (i.e no permissions or dangling) then the listStatus throws an 
exception.
** b) The list of chidren in FileStatus is for those of the symlink and NOT the 
target 
** c) everything else FileStatus\[i]  (filesize, isDir, owner, perms, etc.) 
need to be from the resolved target of the symlinks. E.g.  FileStatus\[i].isDir 
will turn the status of the symlink target.   If it can't resolve a symlink 
then we must throw an error since we can't return partial results nor can we 
indicate per FileStatus entry that an error occurred.  (Note it would have been 
much nicer for isDir to throw the exception but that is not possible since it 
does not declare any exception and the only other option is runtime exception 
which is bad.)

* 2) Create a New API: listStatus2() (a better name? listDir) that returns 
FileStatusExtended[]
** a) This returns the raw list with symlinks *not* followed.
** b) FileStatusExtended has a method called getFileType() that returns an 
enum. Optionally it could have a method called isDir(), isFile(), isSymlink()

* 3) Fix all internal utilities and libraries (ls, glob, distcp)  to do the 
correct thing using API 1 or 2 as needed.

* 4) Deprecate the existing listStatus() API.

The reasoning behind the above proposal (Jason Lowe's words): 

As discussed in HADOOP-9912, listStatus is effectively a combination of 
readdir() and stat() from POSIX.  readdir() does not follow symlinks but stat() 
does.  That means we need to return the original names in the child directory, 
i.e.: what readdir() does, but the FileStatus infomation returned by listStatus 
needs to be what the symlink points to except for the name part, i.e.: what 
stat() does.o

And yes, throwing an exception for bad (dangling) symlinks is severe, but it 
seems like the lesser of evils.  We don't know what the application will do if 
we expose the raw symlink to it or hide it, which are basically our only 
choices if we don't throw.  Either approach could lead to silent dataloss or 
other badness because we don't know what the app is going to do.  That's why 
we'd deprecate the original API because it doesn't allow us to return errors 
for individual entries in the listStatus results -- it's all or nothing.


> FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by 
> default
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-9984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9984
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs
>    Affects Versions: 2.1.0-beta
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, 
> HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch, 
> HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch, 
> HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch
>
>
> During the process of adding symlink support to FileSystem, we realized that 
> many existing HDFS clients would be broken by listStatus and globStatus 
> returning symlinks.  One example is applications that assume that 
> !FileStatus#isFile implies that the inode is a directory.  As we discussed in 
> HADOOP-9972 and HADOOP-9912, we should default these APIs to returning 
> resolved paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to