[ https://issues.apache.org/jira/browse/HADOOP-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515986#comment-14515986 ]
Sanjay Radia commented on HADOOP-9984: -------------------------------------- IThe following proposal on symlinks is based on discussions with Jason, Nathan and Daryn a few months ago. The recent disabling of symlinks (HDFS-11852) has prompted me to finally this comment out. Symlink is a very frequently asked for feature and ran into trouble mostly because the the original listStatus was not well designed. This issue has been heavily discussed and we have gone back and forth. The proposal below is basically Jason Lowe's proposal as mostly described in https://issues.apache.org/jira/browse/HADOOP-9912?focusedCommentId=13772002&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13772002 An additional issue concerns cross-namespace links that should be discussed in a separate comment. Further, Colins has raised a Hive concern in an email thread that I will also cover in a separate comment. Summary of proposal: * 1) Existing listStatus() API will follows symlinks to maintain compatibility for isDir() and throws exception if it cannot. * 2) Add a new listStatus2() api that does the right thing (ie. not follow symlinks) * 3) Change all other libraries such as glob, cli and tools to use the new API listStatus2 * 4) Deprecate the existing listStatus. Details: * 1) For the current API: listStatus() returns FileStatus[]. ** a) List Status will follow the symlink. If any of the symlinks are not followable (i.e no permissions or dangling) then the listStatus throws an exception. ** b) The list of chidren in FileStatus is for those of the symlink and NOT the target ** c) everything else FileStatus\[i] (filesize, isDir, owner, perms, etc.) need to be from the resolved target of the symlinks. E.g. FileStatus\[i].isDir will turn the status of the symlink target. If it can't resolve a symlink then we must throw an error since we can't return partial results nor can we indicate per FileStatus entry that an error occurred. (Note it would have been much nicer for isDir to throw the exception but that is not possible since it does not declare any exception and the only other option is runtime exception which is bad.) * 2) Create a New API: listStatus2() (a better name? listDir) that returns FileStatusExtended[] ** a) This returns the raw list with symlinks *not* followed. ** b) FileStatusExtended has a method called getFileType() that returns an enum. Optionally it could have a method called isDir(), isFile(), isSymlink() * 3) Fix all internal utilities and libraries (ls, glob, distcp) to do the correct thing using API 1 or 2 as needed. * 4) Deprecate the existing listStatus() API. The reasoning behind the above proposal (Jason Lowe's words): As discussed in HADOOP-9912, listStatus is effectively a combination of readdir() and stat() from POSIX. readdir() does not follow symlinks but stat() does. That means we need to return the original names in the child directory, i.e.: what readdir() does, but the FileStatus infomation returned by listStatus needs to be what the symlink points to except for the name part, i.e.: what stat() does.o And yes, throwing an exception for bad (dangling) symlinks is severe, but it seems like the lesser of evils. We don't know what the application will do if we expose the raw symlink to it or hide it, which are basically our only choices if we don't throw. Either approach could lead to silent dataloss or other badness because we don't know what the app is going to do. That's why we'd deprecate the original API because it doesn't allow us to return errors for individual entries in the listStatus results -- it's all or nothing. > FileSystem#globStatus and FileSystem#listStatus should resolve symlinks by > default > ---------------------------------------------------------------------------------- > > Key: HADOOP-9984 > URL: https://issues.apache.org/jira/browse/HADOOP-9984 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs > Affects Versions: 2.1.0-beta > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Priority: Critical > Attachments: HADOOP-9984.001.patch, HADOOP-9984.003.patch, > HADOOP-9984.005.patch, HADOOP-9984.007.patch, HADOOP-9984.009.patch, > HADOOP-9984.010.patch, HADOOP-9984.011.patch, HADOOP-9984.012.patch, > HADOOP-9984.013.patch, HADOOP-9984.014.patch, HADOOP-9984.015.patch > > > During the process of adding symlink support to FileSystem, we realized that > many existing HDFS clients would be broken by listStatus and globStatus > returning symlinks. One example is applications that assume that > !FileStatus#isFile implies that the inode is a directory. As we discussed in > HADOOP-9972 and HADOOP-9912, we should default these APIs to returning > resolved paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332)