Re: DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-16 Thread Colin McCabe
On Tue, Jun 16, 2015 at 3:02 AM, Steve Loughran ste...@hortonworks.com wrote:

 On 15 Jun 2015, at 21:22, Colin P. McCabe cmcc...@apache.org wrote:

 One possibility is that we could randomize the order of returned
 results in HDFS (at least within a given batch of results returned
 from the NN).  This is similar to how the Go programming language
 randomizes the order of iteration over hash table keys, to avoid code
 being written which relies on a specific implementation-defined
 ordering.

 Regardless of whether we do that, though, there is a bunch of code
 even in Hadoop common that doesn't properly deal with unsorted
 listStatus / globStatus... such as hadoop fs -ls

 something we could make an option for tests...be fun to see what happens. I 
 wouldn't inflict it on production, as people would only hate us for breaking 
 things. Again

Well, we do inflict it on production.  LocalFileSystem has always
returned unsorted results.  And most stuff that works with HDFS is
capable of running against LocalFileSystem.

Colin


Re: DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-16 Thread Steve Loughran

 On 15 Jun 2015, at 21:22, Colin P. McCabe cmcc...@apache.org wrote:
 
 One possibility is that we could randomize the order of returned
 results in HDFS (at least within a given batch of results returned
 from the NN).  This is similar to how the Go programming language
 randomizes the order of iteration over hash table keys, to avoid code
 being written which relies on a specific implementation-defined
 ordering.
 
 Regardless of whether we do that, though, there is a bunch of code
 even in Hadoop common that doesn't properly deal with unsorted
 listStatus / globStatus... such as hadoop fs -ls

something we could make an option for tests...be fun to see what happens. I 
wouldn't inflict it on production, as people would only hate us for breaking 
things. Again


Re: DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-15 Thread Colin P. McCabe
On Mon, Jun 1, 2015 at 3:21 AM, Steve Loughran ste...@hortonworks.com wrote:

 HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the 
 FS javadoc and contract tests to say the order you get things back from a 
 listStatus() isn't guaranteed to be alphanumerically sorted

 That's one of those assumptions which we all have, but which, when you think 
 about it, doesn't have to be guaranteed.

 I'm going to commit the patch with the updated docs. Before I do that, does 
 anyone have any objection -that is, is there some fundamental constraint 
 which requires it to come back sorted? Such as the FS APIs and other apps 
 which do expect that sorting, and which are going to break if the rules 
 change? If so, they may need to be looked at.

 -Steve

We had a discussion about this on HADOOP-10798.  Although HDFS always
returns listStatus results in alphabetically sorted order because of
implementation issues, the local filesystem does not return things in
alphabetically sorted order.

I think it's fine in principle to specify that listStatus returns
things in undefined order.  After all, as Allen mentioned, this is
what POSIX does.  I do think that in practice, this will result in a
lot of HDFS-only code getting written where there is a hidden
assumption that listStatus, globStatus, etc. sort their responses.
This might make portability more difficult.

I'm not sure if there is a good way around this problem.  Requiring
results to be returned in sorted order would be really harmful to
performance for things like Ceph and Lustre-- we'd essentially be
forcing a ton of client-side buffering and a sort.  But having HDFS do
sorted order and other FSes not do it would certainly make portability
more difficult.

One possibility is that we could randomize the order of returned
results in HDFS (at least within a given batch of results returned
from the NN).  This is similar to how the Go programming language
randomizes the order of iteration over hash table keys, to avoid code
being written which relies on a specific implementation-defined
ordering.

Regardless of whether we do that, though, there is a bunch of code
even in Hadoop common that doesn't properly deal with unsorted
listStatus / globStatus... such as hadoop fs -ls

cheers,
Colin


Re: DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-01 Thread Vinayakumar B
I think the patch just updates the doc as of now, not changing any code to
affect the existing usage.

Sorting depends on the underlying implementations.

Linux *ls *implementation returns alphanumerically sorted array by default
( Current implementation might have assumed from here to sort by default,
just guessing ...) . But have some other options to sort on different
attributes.

Java's *File.listFiles() *javadoc specifies as follows: *There is no
guarantee that the name strings in the resulting array will appear in any
specific order; they are not, in particular, guaranteed to appear in
alphabetical order. *
So the current change is inline with Java's FileSystem API atleast.

So IMO, its fine to commit the javadoc update.

-Vinay

On Mon, Jun 1, 2015 at 3:51 PM, Steve Loughran ste...@hortonworks.com
wrote:


 HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches
 the FS javadoc and contract tests to say the order you get things back
 from a listStatus() isn't guaranteed to be alphanumerically sorted

 That's one of those assumptions which we all have, but which, when you
 think about it, doesn't have to be guaranteed.

 I'm going to commit the patch with the updated docs. Before I do that,
 does anyone have any objection -that is, is there some fundamental
 constraint which requires it to come back sorted? Such as the FS APIs and
 other apps which do expect that sorting, and which are going to break if
 the rules change? If so, they may need to be looked at.

 -Steve



DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-01 Thread Steve Loughran

HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the 
FS javadoc and contract tests to say the order you get things back from a 
listStatus() isn't guaranteed to be alphanumerically sorted

That's one of those assumptions which we all have, but which, when you think 
about it, doesn't have to be guaranteed.

I'm going to commit the patch with the updated docs. Before I do that, does 
anyone have any objection -that is, is there some fundamental constraint which 
requires it to come back sorted? Such as the FS APIs and other apps which do 
expect that sorting, and which are going to break if the rules change? If so, 
they may need to be looked at.

-Steve


Re: DISCUSS: is the order in FS.listStatus() required to be sorted?

2015-06-01 Thread Allen Wittenauer

The POSIX spec for readdir 
(http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html) doesn’t 
spell out a sort order, so it should be assumed that the ordering isn’t 
guaranteed.

Chris Siebenmann has written a few relative blog posts on the topic 
that might be of interest here:

* https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirHistory
* https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirOrder

So I think it’s OK to break the _API_ here ...

** HOWEVER **

POSIX ls 
(http://pubs.opengroup.org/onlinepubs/95399/utilities/ls.html) DOES require 
its output be sorted.  So breaking the sort order of 'hadoop fs -ls’ would be 
*extremely* bad.  We need to make sure that doesn’t change.

On Jun 1, 2015, at 4:11 AM, Vinayakumar B vinayakum...@apache.org wrote:

 I think the patch just updates the doc as of now, not changing any code to
 affect the existing usage.
 
 Sorting depends on the underlying implementations.
 
 Linux *ls *implementation returns alphanumerically sorted array by default
 ( Current implementation might have assumed from here to sort by default,
 just guessing ...) . But have some other options to sort on different
 attributes.
 
 Java's *File.listFiles() *javadoc specifies as follows: *There is no
 guarantee that the name strings in the resulting array will appear in any
 specific order; they are not, in particular, guaranteed to appear in
 alphabetical order. *
 So the current change is inline with Java's FileSystem API atleast.
 
 So IMO, its fine to commit the javadoc update.
 
 -Vinay
 
 On Mon, Jun 1, 2015 at 3:51 PM, Steve Loughran ste...@hortonworks.com
 wrote:
 
 
 HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches
 the FS javadoc and contract tests to say the order you get things back
 from a listStatus() isn't guaranteed to be alphanumerically sorted
 
 That's one of those assumptions which we all have, but which, when you
 think about it, doesn't have to be guaranteed.
 
 I'm going to commit the patch with the updated docs. Before I do that,
 does anyone have any objection -that is, is there some fundamental
 constraint which requires it to come back sorted? Such as the FS APIs and
 other apps which do expect that sorting, and which are going to break if
 the rules change? If so, they may need to be looked at.
 
 -Steve