[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-10-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786814#comment-13786814
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

I agree.  We should dupe this to HADOOP-9984 and close.  Any objections?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-10-04 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786335#comment-13786335
 ] 

Daryn Sharp commented on HADOOP-9912:
-

I'm not sure this patch makes sense anymore.  It's what originally kickstarted 
the symlink discussion before we did a deeper dive.

If globStatus is resolving symlinks along the way, it's never going to see a 
link so it doesn't need to know if a link is a link to a directory.  Or am I 
missing something?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-10-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785811#comment-13785811
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

I believe this should be fixed by HADOOP-9984.  Let's wait for that patch to 
land and then triage.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773484#comment-13773484
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

By the way, as far as I know, HADOOP-9984 is the only one of the JIRAs in this 
area that needs to get into branch-2.1-beta.  HADOOP-9981 is not in 2.1-beta

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773302#comment-13773302
 ] 

Arun C Murthy commented on HADOOP-9912:
---

Thanks [~cmccabe].

How about breaking HADOOP-9972 into two independent pieces:

# Fix globStatus/listStatus to resolve and throw an exception when it can't.
# Add new apis being discussed in HADOOP-9972.

This way #1 can be expedited and we can unblock hadoop-2.2 (GA). #2 can come in 
hadoop-2.3. In hadoop-2.2 we can put appropriate notices that symlinks aren't 
yet ready for primetime.

Thoughts? 

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773395#comment-13773395
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

Sure.  We can change the default behavior in a separate JIRA.  I filed 
HADOOP-9984 to change the default behavior of FileSystem#globStatus and 
FileSystem#listStatus to be resolving symlinks.

bq. Plus multiple exceptions are now erroneously being reported as FNF, etc. 
Ex. I know of at least AccessControlException (I think Colin fixed?) and 
StandbyException but I think we encountered more.

This was already fixed in HADOOP-9929.

bq. The new implementation is causing increased RPC load by listing parent 
directories when no patterns are present in the component.

We're discussing this (plus a fix) on HADOOP-9981.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773377#comment-13773377
 ] 

Daryn Sharp commented on HADOOP-9912:
-

We also need to revert most of HADOOP-9817, or at least change the new 
{{Globber}} class to behave identically (although resolving symlinks) to how it 
did before being split out of FileSystem.

The new implementation is causing increased RPC load by listing parent 
directories when no patterns are present in the component.  We've had apps go 
OOM, or run extremely slowly, because parent dirs had lots of items.

Plus multiple exceptions are now erroneously being reported as FNF, etc.  Ex. I 
know of at least AccessControlException (I think Colin fixed?) and 
StandbyException but I think we encountered more.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773250#comment-13773250
 ] 

Arun C Murthy commented on HADOOP-9912:
---

IAC, any thoughts on how far we are from coming to a resolution? Everyone 
agrees with [~jlowe]'s proposal? How quickly can we resolve it? Or, should we 
revert HADOOP-9418 and move on?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773246#comment-13773246
 ] 

Arun C Murthy commented on HADOOP-9912:
---

Excellent points [~jlowe], I here-by withdraw my crazy motion. :)

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773310#comment-13773310
 ] 

Suresh Srinivas commented on HADOOP-9912:
-

+1 for splitting this into two parts. 

As high priority, lets fix the globStatus/listStatus as proposed. Newer and 
cleaner APIs should be done in a separate jira. This is lower priority in my 
opinion and should be done separately and is not a blocker for 2.X GA.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773271#comment-13773271
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

+1 for jason's proposal.  I'd like to do this as part of HADOOP-9972... would 
appreciate feedback there

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-20 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13773027#comment-13773027
 ] 

Jason Lowe commented on HADOOP-9912:


bq. Another crazy thought I'd like to throw out - what if we just returned 
false for isDir if we cannot resolve the symlink rather than throw an exception?

This sounds equivalent to the earlier proposal where "bad" symlinks are 
returned as the raw symlink.  isDir() and isFile() both return false for 
symlinks, and old clients are not aware of isFile() since it was added with 
symlink support.

An old client of listStatus will interpret the link as a file since isDir() is 
false, but we don't know if that's the proper thing to do since we don't know 
the client's intent.  If a directory walker is concerned about directories and 
not files at some point in the traverse, it could end up silently skipping a 
"bad" symlink when it should have failed.  i.e.: symlink to directory in remote 
filesystem but filesystem is temporarily unavailable, symlink to directory in 
permission-protected tree, symlink intended to point to a directory but typo'd 
the target when link was created, etc.

I'm not sure how common that case really is in practice.  Our recent proposal 
is trying to err on the side of caution so we don't accidentally drop data when 
we should have failed.  It does mean some scenarios for old clients will fail 
when they should have succeeded despite "bad" symlinks, but it seems better to 
report a failure that can be corrected (i.e.: fix the "bad" symlink and re-run 
the app) than to potentially skip desired inputs.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-19 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772592#comment-13772592
 ] 

Arun C Murthy commented on HADOOP-9912:
---

Another option is to revert HADOOP-9418 in branch-2.1 so it doesn't make it to 
hadoop-2.2 too. We add everything back when we tie out details, so symlinks is 
*officially* supported in hadoop-2.3 (with all sorts of ugly caveats).

I agree this isn't ideal, but seems like HADOOP-9418 came in too late anyway to 
branch-2.1.

Thoughts?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-19 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772589#comment-13772589
 ] 

Arun C Murthy commented on HADOOP-9912:
---

If we do agree that following symlinks by default, what does it imply for the 
beta series. Is it something we can get done quickly? I'm trying to suss the 
effort involved and what impact it has on hadoop-2.x GA. Thanks.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-19 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772588#comment-13772588
 ] 

Arun C Murthy commented on HADOOP-9912:
---

bq. At this point we think the best choice for the existing listStatus API is 
to follow symlinks when it can but throw an exception when it cannot. 

Makes sense to me. 

I agree it's unfortunate that "bad" symlinks, but also that this is 
significantly better than the alternative of silently missing them since apps 
might follow the hadoop-1.x semantics of isDir.



Another crazy thought I'd like to throw out - what if we just returned false 
for isDir if we cannot resolve the symlink rather than throw an exception? 

This has the benefit of treating the unresolved symlink as a 'file' which means 
apps can still move them etc. while not blowing them up.

Is it too crazy? :)

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-19 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13772002#comment-13772002
 ] 

Jason Lowe commented on HADOOP-9912:


Thanks [~eli] for the excellent writeup and thanks [~cmccabe] for proposing a 
design of new APIs.

Regarding the issue of what to do with the existing APIs, I had an offline 
discussion with [~daryn], [~kihwal], and [~nroberts].  At this point we think 
the best choice for the existing listStatus API is to follow symlinks when it 
can but throw an exception when it cannot.  It's a conservative approach where 
we're basically saying we have no idea how the caller will react if it sees the 
raw symlink since it never exposed one in previous versions.  Therefore we 
aren't assuming it will end well if we start exposing symlinks in the results 
and change the previously-assumed semantics of isDir() on those results.

Yes, this does mean that "bad" symlinks (dangling, pointing within a protected 
directory tree, referencing another filesystem that is currently unavailable) 
could cause programs to blow up when they encounter them in a directory.  
However we think that is preferable to the possible alternative of silently 
missing data because the app misinterpreted the raw symlink and skipped it when 
it should have caused an error.  Ideally the existing globStatus API would use 
a symlink-aware listStatus API internally so it could avoid erroring-out on 
symlinks that would be filtered out by the specified pattern/filter.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.1.0-beta
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-17 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13769869#comment-13769869
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

I posted a proposed API based on our WebEx discussion at 
https://issues.apache.org/jira/browse/HADOOP-9972

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-09 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762149#comment-13762149
 ] 

Eli Collins commented on HADOOP-9912:
-

Jason, Daryn, Kihwal, Colin, Andrew and myself discussed last Friday. Here are 
my notes:

There are two types of clients and three behaviors the APIs could reasonably 
support:
# Clients that are symlink-aware and want to see status objects for links, not 
have them auto-resolved. An example is the shell (it should list the link) or 
distcp (so it can optionally follow or not follow symlinks).
# Clients that are not symlink aware (ie most existing programs). This case is 
further broken down into:
## Clients that want symlink resolution exceptions exposed. Eg suppose user X 
moves a directory D and replaces it with a symlink S to that directory, but 
accidentally changed the permissions to D so that user Y can no longer access D 
via S. If user Y regularly recursively copies X's parent directory for backup 
then the copy should now fail, otherwise Y has no indication that is no longer 
backing up the data it needs to. 
## Clients that want symlinks resolution exceptions swallowed. Eg suppose a job 
uses a /\*/D glob path and there is a symlink /S that is either dangling or 
points somewhere the client doesn't have permission, should the job start 
failing because a root-level symlink is introduced that the user can list but 
not resolve?  It seems like some clients would want an option to swallow such 
resolution failures. This is arguably weaker than the previous example since if 
you want /*/D you might also have reasonably meant to access whatever /S/D 
referred to in which case you'd want the job to fail.

Also..

- FileSystem and FileContext should be consistent 
- We need to make a call as to whether symlinks for local FileSystem are:
-- Just for exposing symlinks in the underlying local file system
-- Supporting HDFS style symlinks (eg URIs that can span file systems)
-- I originally introduced them in HADOOP-6421 for to create/expose symlinks in 
the local file system (and for testing purposes)
- The easiest way to fix the Pig breakage in the near term while we figure this 
out is to revert HADOOP-9987 

So the next steps are:
- Articulate an API that supports all three usage patterns, it should covers 
all APIs that return FileStatus objects, not just listStatus. I volunteered to 
writeup a strawman proposal.
- Figure out which behavior should be the default. We need to finish figuring 
out the compatibility implications of the proposal, all options are 
incompatible at some level but we should favor the one that breaks 
compatibility the least for most existing programs (which do not use symlinks).


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761048#comment-13761048
 ] 

Hudson commented on HADOOP-9912:


FAILURE: Integrated in Hadoop-Mapreduce-trunk #1541 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1541/])
Revert HADOOP-9877 because of breakage reported in HADOOP-9912 (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1520713)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/FileContextMainOperationsBaseTest.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFsShellReturnCode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestHDFSFileContextMainOperations.java


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761033#comment-13761033
 ] 

Hudson commented on HADOOP-9912:


SUCCESS: Integrated in Hadoop-Hdfs-trunk #1515 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1515/])
Revert HADOOP-9877 because of breakage reported in HADOOP-9912 (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1520713)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/FileContextMainOperationsBaseTest.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFsShellReturnCode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestHDFSFileContextMainOperations.java


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760992#comment-13760992
 ] 

Hudson commented on HADOOP-9912:


SUCCESS: Integrated in Hadoop-Yarn-trunk #325 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/325/])
Revert HADOOP-9877 because of breakage reported in HADOOP-9912 (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1520713)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/FileContextMainOperationsBaseTest.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFsShellReturnCode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestHDFSFileContextMainOperations.java


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760735#comment-13760735
 ] 

Andrew Wang commented on HADOOP-9912:
-

Above Hudson spam is because I reverted HADOOP-9877 temporarily while we sort 
out these globStatus/listStatus semantics issues.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760729#comment-13760729
 ] 

Hudson commented on HADOOP-9912:


SUCCESS: Integrated in Hadoop-trunk-Commit #4382 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4382/])
Revert HADOOP-9877 because of breakage reported in HADOOP-9912 (wang: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1520713)
* /hadoop/common/trunk/hadoop-common-project/hadoop-common/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/FileContextMainOperationsBaseTest.java
* 
/hadoop/common/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestFsShellReturnCode.java
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/TestHDFSFileContextMainOperations.java


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760425#comment-13760425
 ] 

Andrew Wang commented on HADOOP-9912:
-

Hi all,

I'm all in favor of a compatibility mode for listStatus and not breaking 
existing programs, but I'm not sure there actually is a compatible solution. 
Specifically, there are three cases I'm wondering about:

- Symlink loops. If we're auto-resolving, does our directory walker infinite 
loop?
- Dangling symlinks. What happens when we hit one of these? An exception? Prune 
it from the results?
- Symlink to another FileSystem. An HDFS symlink could link to another HDFS, or 
the local filesystem, or theoretically any implementing filesystem (e.g. S3, 
Swift). Would you really want to walk across filesystems transparently?

Please prove me wrong :)

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760476#comment-13760476
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

HDFS symlinks support in FileContext has been in many official releases of 
Hadoop 2.  "I doubt they're being used" is not really a good reason to break 
the existing behavior.  That's why we have been trying to keep as close as 
possible to the FileContext behavior in our port of symlinks to FileSystem.

As Andrew mentioned, it is not always possible to resolve symlinks.  There can 
be infinite symlink loops, or dangling symlinks.  "Just make them go away, I 
don't want to see them at all" is not a viable strategy.  Symlinks exist and 
their semantics are different than directories or files.

There may be an occasional program that needs a tiny change to be compatible 
with symlinks.  I think this is likely to be extremely rare, since 
getFileStatus resolves symlinks fully, and symlinks are mostly transparent to 
the application.  If system administrators don't want to change anything, they 
don't have to-- they can just continue not using symlinks on their clusters.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760496#comment-13760496
 ] 

Jason Lowe commented on HADOOP-9912:


bq. There may be an occasional program that needs a tiny change to be 
compatible with symlinks. I think this is likely to be extremely rare, since 
getFileStatus resolves symlinks fully, and symlinks are mostly transparent to 
the application.

For symlinks to files, I agree most programs will "just work."  However for 
symlinks to directories, getFileStatus isn't applicable since directory walkers 
are going to rely on the status returned from listStatus rather than doing 
another getFileStatus on each of the results from listStatus.  That's why 
Pig/MapReduce break, and I suspect many other walkers would as well.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760486#comment-13760486
 ] 

Jason Lowe commented on HADOOP-9912:


bq. I sent the calendar invite out to everyone who filled in the Doodle poll. 
[...] Let me know if you didn't receive an invite.

Thanks for organizing it, Andrew.  For some reason I have yet to see the 
invitation.  Could you please try sending it to me again?

As for the three cases:

bq. Symlink loops. If we're auto-resolving, does our directory walker infinite 
loop?

Yes, the directory walker would infinite loop.  This is similar to any simple 
directory walker on other filesystems.  The tradeoff here is all walkers work, 
unmodified, for the common cases where there isn't a loop, or they break even 
in the common case and have to update for symlink detection with no guarantee 
they will bother to do the bookkeeping for loop detection.

bq. Dangling symlinks. What happens when we hit one of these? An exception? 
Prune it from the results?

That case is covered in the proposal above.  If a symlink cannot be resolved 
then it would be returned as a symlink in the results.

bq. Symlink to another FileSystem. An HDFS symlink could link to another HDFS, 
or the local filesystem, or theoretically any implementing filesystem (e.g. S3, 
Swift). Would you really want to walk across filesystems transparently?

Yes, it would traverse to the other filesystem, just as it does on other 
filesystems (e.g.: local filesystems on Linux).  Isn't that the whole point of 
the symlink, otherwise why is it there?  I understand there will be classes of 
tools that will need to be symlink aware and not follow them in certain 
situations, but I think users would expect a symlink to be followed by most 
tools when they set it up that way.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760241#comment-13760241
 ] 

Jason Lowe commented on HADOOP-9912:


Thanks for the behavior matrix, Colin.  I think the issue of 
compatible/incompatible is about *expectations* of the FileSystem listStatus 
API.  FileSystem hasn't supported symlinks until very recently, and as a result 
I doubt many, if any, symlinks were being used in HDFS.  It required custom 
Java code to manipulate them and nothing written with FileSystem would work 
with them.

I am under the impression that we want symlinks to "just work" for the majority 
of existing applications.  If that's the case then we need to avoid exposing 
raw symlinks as results from the existing FileSystem APIs as callers aren't 
expecting to deal with them.  A directory walker is the classic case of this, 
as it will expect isDir() to tell it when to traverse subdirectories and 
symlinks to directories breaks that assumption.

A proposal to keep the existing FileSystem users working with symlinks in HDFS:

- listStatus resolves symlinks when possible.  If the symlink cannot be 
resolved (e.g.: dangling, permission-restricted target path, etc.) it will 
return the status of the symlink since it cannot stat the symlink target.
- A separate API, either an overload of listStatus with an extra flag to 
control symlink resolution or a separate listLinkStatus, can be used for 
callers that always want the symlink status and not the status of the symlink 
target.  I would not expect the majority of existing listStatus callers to want 
to see symlinks and have to resolve them.  This is akin to the 
getFileStatus/getFileLinkStatus pairing.  Existing callers of getFileStatus 
never expected symlinks so that's why it always follows them and a new API was 
added to examine the symlink itself rather than adding a new status API to 
always follow the symlink.

For me it's all about what callers are expecting FileSystem's listStatus 
semantics to be.  I believe that existing callers are *not* expecting symlinks 
to be returned since FileSystem never supported them in the past and I doubt 
they were being used in HDFS in general.  Most callers are expecting listStatus 
to be a readdir and stat, and stat follows symlinks.  If listStatus does not 
resolve symlinks then it breaks existing Pig and MapReduce code, and I believe 
that's an indication it will break a lot more code out there.  The code that 
breaks can be updated to understand symlinks, but I believe in practice that 
means symlinks to directories will be fragile for a long time.  Each tool that 
encounters them will have to be updated to check for them and behave 
accordingly.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760282#comment-13760282
 ] 

Kihwal Lee commented on HADOOP-9912:


bq. on HDFS, the behavior of listStatus has always been not resolving symlinks. 
Ever since Eli added the feature a few years ago. It has never dereferenced.

Perhaps a bit of context might help.  Yes, that was the design decision made in 
HDFS-245 and HADOOP-6421, for accessing HDFS through *FileContext*. It was an 
incompatible change and users were expected to update their code to be 
symlink-aware as they migrate from FileSystem to FileContext.

The migration has been slow and we have finally decided to support symlinks in 
FileSystem. A number of people worked hard to implement something equivalent to 
the symlink feature in FileContext.  The change was obviously semantically 
incompatible.

Impact of incompatibility in FileSystem now is quite different from doing it 
back in HADOOP-6421/HDFS-245 to FileContext. So, we cannot simply say this is 
the right way since it is consistent with what we did years ago in a completely 
different context.

Whatever decision we make about symlinks, we should not overlook the fact that 
the situation is different now.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch, new-hdfs.txt, new-local.txt, 
> old-hdfs.txt, old-local.txt
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-06 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760020#comment-13760020
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

Hi all.  There's been some confusion about old and new behavior, so I created a 
test program so you can see for yourself.  Then I ran it on a very old branch-2 
derived Hadoop (CDH 4.0.0, in fact), as well as current trunk.  The old branch 
is from June 2012.

The summary:
* on HDFS, the behavior of listStatus *has always been* not resolving symlinks. 
 Ever since Eli added the feature a few years ago.  It has never dereferenced.
* on HDFS, {{globStatus}} was previously inconsistent about whether symlinks 
that were the last path component were dereferenced-- sometimes they were, 
other times not.  Now they are never dereferenced.
* on LocalFileSystem, {{getFileLinkStatus}} previously did the exact same thing 
as {{getFileStatus}} (oops!)  This *bug* was fixed, and now calling 
{{getFileLinkStatus}} on a symlink allows you to identify it as a symlink.
* {{LocalFS}} had a bug then, which it apparently stil has, where 
{{listStatus}} doesn't list dangling symlinks at all.  Hopefully we'll be able 
to fix this bug without people asking for the old broken behavior.
* There are some other irregularities in {{LocalFS}}.  In general symlink 
support seems very poor in LocalFS.

the test program is up on github at https://github.com/cmccabe/HADOOP-9912_test

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-05 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759958#comment-13759958
 ] 

Suresh Srinivas commented on HADOOP-9912:
-

As I had indicated  (on doodle) I cannot make it 2PM meeting. As I said, 
incompatible behavior in existing java APIs is not acceptable. Consider adding 
newer APIs, if you think the current behavior can be improved upon.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-05 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759949#comment-13759949
 ] 

Andrew Wang commented on HADOOP-9912:
-

I settled on 2PM since it'll get Eli and Jason in the same call (both seem to 
have the strongest opinions on this), but I sent the calendar invite out to 
everyone who filled in the Doodle poll. Minutes will be posted to this JIRA 
afterwards, and we can always do another call if we really need to. Let me know 
if you didn't receive an invite.

It also would be nice if Jason or Daryn could post a summary with their 
proposal (as Suresh is requesting), since it'll let us go into the call with a 
clear objective.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-05 Thread Suresh Srinivas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759835#comment-13759835
 ] 

Suresh Srinivas commented on HADOOP-9912:
-

Given whole bunch of discussion in this thread and other related threads, it 
requires significant time to catchup. Tt would be great if someone could 
summarize old behavior and new changed behavior, before tomorrow's meeting (or 
we can spend the time in meeting tomorrow). If the current behavior is not 
desirable, lets introduce a new API with the right behavior and leave the old 
API compatible.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-05 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759464#comment-13759464
 ] 

Andrew Wang commented on HADOOP-9912:
-

Hi everyone, sorry for the late notice, but I just created a Doodle poll to 
settle on a time. Either before 11 or after 2 seems to work best for myself, 
Eli, and Colin, so please add your availability as well. I'll email the call-in 
details either when I get a good number of responses (e.g. Binglin, Daryn, 
Jason).

http://doodle.com/xihfagzb9azfpc5r

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-05 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759057#comment-13759057
 ] 

Jason Lowe commented on HADOOP-9912:


+1 for a Friday webex.

The main issue with listStatus not following symlinks is that all directory 
walkers written for FileSystem must make the change suggested for Pig above.  
It's not just a Pig issue, it's an issue for any code written for FileSystem 
that expects to walk a directory tree.  listStatus is not just readdir, it's 
readdir *and* stat.  It's the stat that's messing things up, because it's 
acting like an lstat when most code is expecting stat.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-04 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758653#comment-13758653
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

Agree with Binglin that we don't want to follow symlinks in listStatus.  In 
general, I'd like to avoid changing HDFS or LocalFileSystem semantics at all.  
I also would like to avoid having LocalFileSystem semantics diverge from HDFS.

Does a webex on Friday (Sep 6) sound good?  If so, I will set up a dial-in 
number and post it here.

My suggestion for Pig would be to simply follow symlinks in Pig whenever you 
find them.  That is, if {{FileStatus#isSymlink}} is true, call 
{{FileSystem#getFileStatus}} on the path to get the target.  That should work 
on both the new and old Hadoop, and require no other changes.  (Unless there's 
something I'm missing here, which is possible...)

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757455#comment-13757455
 ] 

Binglin Chang commented on HADOOP-9912:
---

I grep all the usage in hadoop code base, here are some listStatus usage we 
need to consider:
listStatus is used in 
  FileSystem.getContentSummary use listStatus to traverse dir, I think it 
should not follow symlink
  Distcp use listStatus to traverse src dir, should not follow too..
  FileUtil.copy copy directory recursively, not sure
When consider semantics and compatibility, these should also be considered.


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757361#comment-13757361
 ] 

Eli Collins commented on HADOOP-9912:
-

Webex sounds good to me too.

bq. Is it unreasonable to have listStatus resolve symlinks and provide a 
separate API or flag for symlink-aware clients?

IMO listStatus is equivalent to readdir and should therefore not resolve paths 
(lists each entry as either file/dir/link). If users need an API that list the 
status' in a directory and resolves each we (or they) can write a helper 
function that does the same thing but resolves links. This would not be less 
optimal in terms of performance since links are resolved by the client, and 
it's not clear if good semantics exist (do you fail if a link fails to resolve? 
do dangling links stay links and everything else is resolved?) in which case 
it's good to not have this behavior as part of the core API.

If we change FileSystem#listStatus to resolve links then we need to change 
FileContext#listStatus as well and that has supported but not resolved links 
for several releases. And does the iterable version of listStatus resolve links 
by default now too? Clearly FileSystem has more compatibility concerns than 
FileSystem but I don't see an option where we preserve compatibility. We're 
balancing compatibility against friendly semantics (would a typical caller 
expect that they need to pass a flag to listStatus to prevent it from resolving 
links?) and while I agree we should help the transition by providing an API 
it's not clear to me it should be the default, and if we do provide a helper 
that's not the default would it be easier for frameworks like Pig to just 
update the relevant code to check the FileStatus? They'll need to do this 
anyway if they have assumptions like  HADOOP-6585 and it seems like they might 
want to do something different for links to directories than links to files in 
which case one helper might not work for everyone.

I agree with Andrew that we don't want to set the symlink bit for a non-symlink 
(resolved) FileStatus as that would definitely break/confuse some things.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756964#comment-13756964
 ] 

Daryn Sharp commented on HADOOP-9912:
-

bq.  The shell needs to be able to "see" symlinks, not just the files they 
point to

This is why I proposed we return the resolved stat with the symlink bit set.  
Symlink-aware code, like the shell's ls can call getFileLinkStatus if they need 
to.  Like users, the rest of the shell's commands probably won't care about 
symlinks either.

Adding a flag for {{globStatus}} will fix the specific case observed by pig, 
but the same issue still applies for any user code that calls {{listStatus}}.  
We probably need for both to have a flag.



> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757071#comment-13757071
 ] 

Jason Lowe commented on HADOOP-9912:


+1 for a WebEx if we can reach a consensus faster that way.

Does anyone mind if we reopen and revert HADOOP-9877 while we determine what to 
do long term with globStatus/listStatus?  Replicated joins for Pig are dead in 
the water right now while we are hashing this out.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756985#comment-13756985
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

would it be faster to hash this out on a WebEx?  I could organize one on 
Thursday afternoon or Friday.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757015#comment-13757015
 ] 

Andrew Wang commented on HADOOP-9912:
-

I don't mind Daryn's proposal of a {{resolveLink}} flag defaulting to true for 
{{globStatus}} and {{listStatus}}. It's a little gross since (as Eli noted) 
Hadoop 2 GA APIs should all be symlink-aware by default (and we should be 
allowed to break compat here), but compatibility is compatibility.

However, I don't want the {{isSymlink}} bit being set for a resolved 
{{FileStatus}}, since isSymlink/isDir/isFile should be exclusive properties. 
There isn't much you can do with the knowledge a FileStatus was reached through 
a symlink. Symlink-aware clients will just use {{resolveLink=false}} and walk 
it properly themselves.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756877#comment-13756877
 ] 

Jason Lowe commented on HADOOP-9912:


Thanks for chiming in, Eli.  I understand that there are going to be times when 
we can't avoid exposing symlinks to older clients, but we should try to avoid 
that when reasonable to do so.  I believe the common use-case for symlinks will 
be links within the same filesystem, and if listStatus proceeds to resolve 
symlinks that it can resolve then existing directory walkers should work as-is. 
 Using a symlink in HDFS to another directory in the same filesystem should 
"just work," but that's not going to be the case if listStatus behaves as it 
does today.

Is it unreasonable to have listStatus resolve symlinks and provide a separate 
API or flag for symlink-aware clients?  Understandably listStatus will still 
have to expose symlinks that cannot be resolved (e.g.: dangling links or links 
to permission-restricted areas), but that seems preferable to breaking most of 
the directory walking code built for FileSystem.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Colin Patrick McCabe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756934#comment-13756934
 ] 

Colin Patrick McCabe commented on HADOOP-9912:
--

The shell needs to be able to "see" symlinks, not just the files they point to. 
 However, we don't want to break Pig and other programs that are relying on the 
old behavior.

Jason wrote:
bq. Is it unreasonable to have listStatus resolve symlinks and provide a 
separate API or flag for symlink-aware clients?

I think you meant to say {{globStatus}}?  If so, I agree... let's add a flag to 
{{globStatus}}, {{resolveLinks}}, which controls whether symlinks are fully 
resolved.  We can default it to {{true}}, for maximum compatibility.

Daryn wrote:
bq. ... calls in Globber are trapping IOException and returning null. This 
unexpectedly causes file not found exceptions.

This is a separate topic.  Check out HADOOP-9929.  The solution to this is 
going to be complicated, since the probably want to throw an exception only 
when listing a single file rather than a glob.  (Imagine if your ls /* threw an 
exception because you lacked permission for one directory in / out of 1000... 
not good.)

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Eli Collins (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756808#comment-13756808
 ] 

Eli Collins commented on HADOOP-9912:
-

bq. While I agree that from a purity perspective that returning the link status 
is arguably correct, in practice it's likely to break a lot of code if they try 
to use symlinks which will impede the use of symlinks.

It's not just a purity perspective. While we've attempted to make symlinks 
mostly transparent to users they are not going to be completely transparent. 
For example, in HADOOP-6585 we added isFile because some clients assume !isDir 
means the file status is a file, which was a valid assumption that's now 
broken. So while we tried to avoid this, using symlinks does introduce some 
incompatibilities that other frameworks and users need to be aware of that we 
are just not going to be able to hack around.

The challenge here is that for some users auto-resolving is not the right 
behavior, and the clients can't easily undo it (you might have hopped file 
systems). In which case you want to not resolve and modify clients to be 
symlink aware. The challenge here of course is keeping new programs working on 
older systems, which is the idea behind backporting symlinks to FileSystem - 
all v2 GA APIs should support symlinks.

listStatus is going to be inconsistent across HDFS and local file system 
because the local file system doesn't really implement symlinks (just passes 
through to the underlying file system).

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-09-03 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756725#comment-13756725
 ] 

Daryn Sharp commented on HADOOP-9912:
-

This {{Globber}} re-write is problematic, and the optimization we are talking 
about is very broken regardless of whether we call file or link status.  Both 
of these calls in {{Globber}} are trapping {{IOException}} and returning null.  
This unexpectedly causes file not found exceptions.

Let's say I have "/dir/noperms/file".  If I do a ls on /dir/noperms, I get a 
permission denied.  If I do ls on /dir/noperms/file I get no such file or 
directory.

If I do a ls on a standby, the {{StandbyException}} is trapped, and I get no 
such file or directory.



> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754807#comment-13754807
 ] 

Binglin Chang commented on HADOOP-9912:
---

bq. If the NN resolves links it allows the user to unexpectedly access stuff 
outside the mount point.

Your comments reminds me another potential issue about permission checking.
How to handle permission denied in listStatus if some(not all) of its entry is 
permission denied? getFileStatus doesn't have this issue because it is either 
success or failed. 

I did a check and found out, it just simply skip the symlink, I think it is not 
the right.

{code}
decster:~/hadoop> ll test/
total 8
drwxr-xr-x  2 decster  staff   68 Aug 30 23:29 aa
drwx--  3 root staff  102 Aug 30 23:30 bb
lrwxr-xr-x  1 decster  staff5 Aug 30 23:32 cc -> bb/cc
decster:~/hadoop> ll test/cc
lrwxr-xr-x  1 decster  staff  5 Aug 30 23:32 test/cc -> bb/cc
decster:~/hadoop> ll test/cc/bb
ls: test/cc/bb: Permission denied
decster:~/hadoop> bin/hadoop fs -ls file:///Users/decster/hadoop/test/
13/08/30 23:38:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Found 2 items
drwxr-xr-x   - decster staff 68 2013-08-30 23:29 
file:///Users/decster/hadoop/test/aa
drwx--   - rootstaff102 2013-08-30 23:30 
file:///Users/decster/hadoop/test/bb
{code}


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754889#comment-13754889
 ] 

Binglin Chang commented on HADOOP-9912:
---

bq. Jason makes an excellent point about how the crux of the problem is 
listStatus combines both readdir + stat

In this case, we really should follow linux/bsd practice(which can prevent most 
problematic behavior), implement a readdir primitive in FS, make listStatus 
none primitive and implement listStatus on top of it, currently there is no 
readdir primitive and listStatus is considered and used as readdir cause it is 
the only option.


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754884#comment-13754884
 ] 

Daryn Sharp commented on HADOOP-9912:
-

To further clarify, the resolved link status would still have the isLink bit 
set in additional to the bit for whatever it resolved to.  This would allow 
FsShell's ls to check isLink, and issue getFileLinkStatus if necessary.  It's a 
smaller evil than inconveniencing all user code.





> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754874#comment-13754874
 ] 

Daryn Sharp commented on HADOOP-9912:
-

Yes, it's becoming clear this is very complicated issue...  While I agree that 
from a purity perspective that returning the link status is arguably correct, 
in practice it's likely to break a lot of code if they try to use symlinks 
which will impede the use of symlinks.

The use case where user code cares if a path is symlink is probably nearly 
non-existent.  {{FsShell}} ls cares, but I question if it's reasonable to 
require additional overhead for the vast majority of use cases.  Ie. everyone 
that wants to do file/dir tests on directory contents will have to check 
{{isLink}}, issue another stat, and then re-test.

Jason and I spoke offline, and we have another proposal.  Return the resolved 
status if possible, else return the link status.  This makes sense for a 
dangling symlink because it's nothing but a symlink.  For the permission denied 
scenario, it's a bit more ambiguous but if the link target has path components 
after a no-permission path component because you again don't know what it is.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754716#comment-13754716
 ] 

Daryn Sharp commented on HADOOP-9912:
-

There is of course the related problem that having the NN resolve any links is 
wrong.  Viewfs is essentially a chrooted mount.  If the NN resolves links it 
allows the user to unexpectedly access stuff outside the mount point.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754713#comment-13754713
 ] 

Daryn Sharp commented on HADOOP-9912:
-

Jason makes an excellent point about how the crux of the problem is 
{{listStatus}} combines both {{readdir}} + {{stat}}.

There may be another alternative to a new API call.  I haven't thought it 
through, but perhaps a middle-ground to address compatibility and to better 
handle symlinks is to return the resolved link's file status but to set the 
{{isLink}} bit in the status.  This allows the extremely few use cases that 
care about whether something is a symlink to call {{getFileLinkStatus}}.  It's 
an extra RPC, but probably 99% of the time the user doesn't care that something 
is a symlink.  The only "common" case is probably {{FSShell}} ls.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754695#comment-13754695
 ] 

Jason Lowe commented on HADOOP-9912:


bq. If you look at readdir as an example, it does not automatically dereference 
by default. Neither does ls, unless you use the -L flag on Linux. I think 
that's the expected default behavior, showing the actual contents of the 
directory. It's possible to build a directory walking program via the current 
listStatus, it just requires dereferencing any links to see if the target is a 
directory. This appears to be what ls -R does.

Thanks for the rational, Andrew.  However I don't believe {{ls}} is a good 
example.  {{ls -l}} is symlink-aware and therefore expecting to find them.  If 
you strace it, you'll notice it's using {{getdents}}, {{lstat}}, and 
{{readlink}}.  We can't really look to POSIX for an equivalent, since 
listStatus is a combination of readdir *and* stat. The equivalent directory 
walker for POSIX calls readdir and then stat on each dir entry (not lstat, 
since it's not symlink-aware or wants to follow symlinks) to determine if each 
entry is another directory (because for POSIX, the type of directory entry is 
not included with the dirent).

If listStatus is a combination of readdir and lstat then it breaks existing 
code that is not symlink-aware and expects isDir/isDirectory to return true for 
directories and isFile() to return true for files.  Lots of code has been 
written for FileSystem, and since FileSystem did not support symlinks until 
very recently, all of that code is not symlink-aware.  To make listStatus 
expose symlinks to those callers is going to be problematic, just as it is for 
Pig here.  That's why there are symlink-aware forms of stat calls so that code 
that desires to be aware of symlinks can detect them, and older code or code 
that just wants to follow them calls the original forms.

The proposed fix handles the issue for Pig with a local filesystem, but someone 
who uses Pig against an input directory that happens to be a symlink in HDFS is 
going to have the same issue.  My apologies if I'm missing something, but the 
more I think about it, the more I'm convinced that listStatus returning 
symlinks is not correct.  It's going to break existing code since almost all of 
that code is not expecting symlinks.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-30 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754540#comment-13754540
 ] 

Binglin Chang commented on HADOOP-9912:
---

Another reason for adding new API is that listLinkStatus(both for RLFS and 
HDFS) is really useful in many cases, like Andrew said, readdir 
API(linux/bsd/mac) follows the same rule, FSShell ls can list 
symlinks(HDFS-4019) like system ls cmd.


> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754436#comment-13754436
 ] 

Binglin Chang commented on HADOOP-9912:
---

Thanks Andrew for the explanation and proposal. I can give some additional 
inputs, hope that helps: 
 
bq. why did the behavior of globStatus and symlinks change with HADOOP-9877 
which appears to be a snapshot-related JIRA.

The reason is listStatus can't get hidden directory, so 
getFileStatus/getFileLinkStatus needs to be added to get hidden directory 
status. Before this change only listStatus is used in globStatus, so there were 
not any consistency issue within same FS, but after adding 
getFileStatus/getFileLinkStatus, problem occurs. 

Basically, the existence of hidden directory require usage of 
getFileStatus/getFileLinkStatus, the requirement of wildcards require usage of 
listStatus, but:
In RLFS, listStatus resolve symlink
In HDFS, listStatus doesn't resolve symlink
If we only use one API, we don't have inconsistency issue within one FS, but 
have inconsistency issue across FS, so before HADOOP-9877 problem exists but 
not so serious, after HADOOP-9877, I break consistency within RLFS in order to 
remain consistency in HDFS... Sorry for not realizing this problem earlier.

@[~andrew.wang]:
About the proposal, I think it is better to leave listStatus compatible(both 
HDFS and RLFS, for cross FS symlink), and add new listLinkStatus API, I guess 
symlink support(both RLFS and HDFS) does not have wild adoption currently, add 
new feature in new API makes sense.




> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754180#comment-13754180
 ] 

Andrew Wang commented on HADOOP-9912:
-

Daryn, Jason, thanks for the input:

bq. I need to look at this further but if DFS.listStatus isn't resolving, then 
we've got to think hard about the semantics of symlinks. 99% of the time, the 
user expects a symlink to be transparent.
bq. Aren't there separate calls if one wants to know the true details of a link 
rather than what the link references?

If you look at {{readdir}} as an example, it does not automatically dereference 
by default. Neither does {{ls}}, unless you use the {{-L}} flag on Linux. I 
think that's the expected default behavior, showing the actual contents of the 
directory. It's possible to build a directory walking program via the current 
{{listStatus}}, it just requires dereferencing any links to see if the target 
is a directory. This appears to be what {{ls -R}} does.

I think my proposal to fix RLFS still makes sense (let RLFS be inconsistent and 
compatible), and then we can think about adding a {{ls -L}} style convenience 
flag or a new call for auto-deref of listing and glob results.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754139#comment-13754139
 ] 

Jason Lowe commented on HADOOP-9912:


bq. In HDFS, listStatus only transparently resolves symlinks in the input path. 
It doesn't resolve the results of the listing, and this is the correct behavior.

Isn't that going to break clients who are not symlink-aware?  That means we 
can't have a tree of files with a symlink to a directory in it.  A 
symlink-unaware tree walker client will not realize that the symlink is 
actually pointing to a directory and should be traversed since the file status 
will say it's not a directory. That's what's happening with Pig now.  Aren't 
there separate calls if one wants to know the true details of a link rather 
than what the link references?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754132#comment-13754132
 ] 

Daryn Sharp commented on HADOOP-9912:
-

I need to look at this further but if {{DFS.listStatus}} isn't resolving, then 
we've got to think hard about the semantics of symlinks.  99% of the time, the 
user expects a symlink to be transparent.

Lots of code uses {{listStatus}} or {{globStatus}} and expects to perform 
file/dir checks.  Now that code will be required to check if the path is a 
symlink, if yes, re-stat.  This will greatly inhibit the use of symlinks which 
is why I think a new api is required.  

Either way we go, we can't have the inconsistency I cited for how globbing is 
now returning different results based on whether the symlink was matched by a 
static or globbed path component.  It must always be a resolved status or an 
unresolved status.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Andrew Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754115#comment-13754115
 ] 

Andrew Wang commented on HADOOP-9912:
-

Let's be constructive and figure out the right fix. Jason, thanks for the 
attached test case, that helped me understand the issue.

bq. listStatus resolves symlinks. globStatus is supposed to be equivalent to 
listStatus with wildcard support...Symlinks should be transparent to users 
unless they specifically want to know if a path is a symlink.

In HDFS, {{listStatus}} only transparently resolves symlinks in the input path. 
It doesn't resolve the results of the listing, and this is the correct 
behavior. {{globStatus}} behaves the same way, in that it returns FileStatuses 
for Paths that match the glob, and it doesn't resolve these results. You can 
(and should) see symlinks returned by listStatus and globStatus in HDFS.

I also wouldn't say {{globStatus}} is equivalent to {{listStatus}}, since it 
doesn't list directories. If you want listStatus with matching, you can use 
{{listStatus(Path, PathFilter)}}.

In RLFS there is automatic symlink resolution, so {{listStatus}} results are 
resolved, and it seems like Pig depends on this behavior. Because of 
HADOOP-9877), {{globStatus}} went from always calling {{listStatus}} to calling 
{{getFileLinkStatus}} for non-wildcard glob components. Thus, when passed a 
{{Path}} that's a symlink, {{globStatus}} says it's a symlink.

bq. Why does .snapshot support require a getFileLinkStatus? Does getFileStatus 
not work for a .snapshot directory?

It does work, but it's incorrect. globStatus is not supposed to return resolved 
statuses. It's unfortunate that RLFS has been auto-resolving all this time, but 
since apps apparently depend on it, all we can do is embrace it.

How about this: we add a fixup step that, for symlink results on a 
LocalFileSystem, resolves them (but still keeping the link path). This means no 
more symlinks in RLFS {{globStatus}} results. It's a bit obnoxious to do 
(globStatus could symlink through HDFS to a link on a local filesystem), but it 
seems like a reasonable solution.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754053#comment-13754053
 ] 

Jason Lowe commented on HADOOP-9912:


bq. This issue is not related to .snapshot support, this issue is caused by add 
symlink support to HDFS and LocalFileSystem but not handle consistency well.

If this has nothing to do with snapshot support, then why did the behavior of 
globStatus and symlinks change with HADOOP-9877 which appears to be a 
snapshot-related JIRA?

listStatus needs to follow symlinks, even in the HDFS case, otherwise symlinks 
are not very useful.  If symlinks never auto-resolve, then every client will 
have to be symlink-aware and manually resolve the link for the symlink feature 
to be useful in practice.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753707#comment-13753707
 ] 

Binglin Chang commented on HADOOP-9912:
---

Just checked again: 
In LocalFileSystem listStatus resolves symlinks.
In HDFS listStatus does not resolve symlinks.
I did find this conflict when I was doing HADOOP-9877, and followed HDFS 
convention and uses getFileLinkStatus.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753679#comment-13753679
 ] 

Binglin Chang commented on HADOOP-9912:
---

This issue is not related to .snapshot support, this issue is caused by add 
symlink support to HDFS and LocalFileSystem but not handle consistency well.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753666#comment-13753666
 ] 

Binglin Chang commented on HADOOP-9912:
---

@Daryn I am confused, I originally use getFileStatus, later changed, please see 
[this 
comment|https://issues.apache.org/jira/browse/HADOOP-9877?focusedCommentId=13741497&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13741497]

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-29 Thread Daryn Sharp (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753620#comment-13753620
 ] 

Daryn Sharp commented on HADOOP-9912:
-

bq. The intended behavior of Globber.glob (which calls listStatus) is to return 
symlink rather than symlink target I believe

bq. I guess for a long time, pig is using this behavior(listStatus return 
symlink target rather than symlink), I am afraid this behavior is wrong and is 
inconsistent with HDFS. 

Wrong. Wrong. Wrong.  {{listStatus}} resolves symlinks.  {{globStatus}} is 
supposed to be equivalent to {{listStatus}} with wildcard support.  All 
existing code depends on these semantics, and rightly so.  Symlinks should be 
transparent to users unless they specifically want to know if a path is a 
symlink.  That's why there is a counterpart to {{getFileStatus}} called 
{{getFileLinkStatus}} which does not resolve symlinks.

HADOOP-9877 fundamentally broke the semantics of {{globStatus}} based on 
whether the last path component is a glob or static.  The result is:
* /path/symlink - the static component "symlink" results in a file status of 
the symlink, breaking isFile/isDir/etc
* /path/sym*link - the glob component "symlink" returns the file status of the 
resolved link, working as expected

{{globStatus}} _must_ consistently return resolved paths.  The semantics 
altered by HADOOP-9877 will break lots of code.  I'm pretty sure that includes 
{{FsShell}}.  We cannot break lot standing semantics just for snapshots.

Why does .snapshot support require a {{getFileLinkStatus}}?  Does 
{{getFileStatus}} not work for a .snapshot directory?

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-28 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753223#comment-13753223
 ] 

Binglin Chang commented on HADOOP-9912:
---

After analyze the code, here is what happens:

1. The intended behavior of Globber.glob (which calls listStatus) is to return 
symlink rather than symlink target I believe. glob and listStatus for HDFS 
follows this rule, so glob and listStatus for RawLocalFileSystem should follow 
this rule as well, but because java lacks symlink support, listStatus for 
RawLocalFileSystem will list symlink targets (o.a.h.fs.Stat only fix 
getFileStatus & getFileLinkStatus, not listStatus). 

2. I guess for a long time, pig is using this behavior(listStatus return 
symlink target rather than symlink), I am afraid this behavior is wrong and is 
inconsistent with HDFS. I guess we can only choose: remain old behavior(then 
listStatus behavior is inconsistent across FileSystem), or adopt new behavior, 
or perhaps and new interface: listLinkStatus vs listStatus...

3. About test success on mac, it is because o.a.h.fs.Stat currently don't 
support Mac, and old implementation doesn't support symlink very well.

 

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-28 Thread Binglin Chang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753083#comment-13753083
 ] 

Binglin Chang commented on HADOOP-9912:
---

Wired, the test passed on my laptop(on macox).. will look more into it.

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
> Attachments: HADOOP-9912-testcase.patch
>
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HADOOP-9912) globStatus of a symlink to a directory does not report symlink as a directory

2013-08-28 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752848#comment-13752848
 ] 

Rohini Palaniswamy commented on HADOOP-9912:


Replicated joins in pig is broken with this. We are doing 
FileSystem.listStatus() calls in pig and use FileInputFormat as well which does 
listStatus. 

> globStatus of a symlink to a directory does not report symlink as a directory
> -
>
> Key: HADOOP-9912
> URL: https://issues.apache.org/jira/browse/HADOOP-9912
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: fs
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Priority: Blocker
>
> globStatus for a path that is a symlink to a directory used to report the 
> resulting FileStatus as a directory but recently this has changed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira