Re: DISCUSS: is the order in FS.listStatus() required to be sorted?
On Tue, Jun 16, 2015 at 3:02 AM, Steve Loughran ste...@hortonworks.com wrote: On 15 Jun 2015, at 21:22, Colin P. McCabe cmcc...@apache.org wrote: One possibility is that we could randomize the order of returned results in HDFS (at least within a given batch of results returned from the NN). This is similar to how the Go programming language randomizes the order of iteration over hash table keys, to avoid code being written which relies on a specific implementation-defined ordering. Regardless of whether we do that, though, there is a bunch of code even in Hadoop common that doesn't properly deal with unsorted listStatus / globStatus... such as hadoop fs -ls something we could make an option for tests...be fun to see what happens. I wouldn't inflict it on production, as people would only hate us for breaking things. Again Well, we do inflict it on production. LocalFileSystem has always returned unsorted results. And most stuff that works with HDFS is capable of running against LocalFileSystem. Colin
Re: DISCUSS: is the order in FS.listStatus() required to be sorted?
On 15 Jun 2015, at 21:22, Colin P. McCabe cmcc...@apache.org wrote: One possibility is that we could randomize the order of returned results in HDFS (at least within a given batch of results returned from the NN). This is similar to how the Go programming language randomizes the order of iteration over hash table keys, to avoid code being written which relies on a specific implementation-defined ordering. Regardless of whether we do that, though, there is a bunch of code even in Hadoop common that doesn't properly deal with unsorted listStatus / globStatus... such as hadoop fs -ls something we could make an option for tests...be fun to see what happens. I wouldn't inflict it on production, as people would only hate us for breaking things. Again
Re: DISCUSS: is the order in FS.listStatus() required to be sorted?
On Mon, Jun 1, 2015 at 3:21 AM, Steve Loughran ste...@hortonworks.com wrote: HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the FS javadoc and contract tests to say the order you get things back from a listStatus() isn't guaranteed to be alphanumerically sorted That's one of those assumptions which we all have, but which, when you think about it, doesn't have to be guaranteed. I'm going to commit the patch with the updated docs. Before I do that, does anyone have any objection -that is, is there some fundamental constraint which requires it to come back sorted? Such as the FS APIs and other apps which do expect that sorting, and which are going to break if the rules change? If so, they may need to be looked at. -Steve We had a discussion about this on HADOOP-10798. Although HDFS always returns listStatus results in alphabetically sorted order because of implementation issues, the local filesystem does not return things in alphabetically sorted order. I think it's fine in principle to specify that listStatus returns things in undefined order. After all, as Allen mentioned, this is what POSIX does. I do think that in practice, this will result in a lot of HDFS-only code getting written where there is a hidden assumption that listStatus, globStatus, etc. sort their responses. This might make portability more difficult. I'm not sure if there is a good way around this problem. Requiring results to be returned in sorted order would be really harmful to performance for things like Ceph and Lustre-- we'd essentially be forcing a ton of client-side buffering and a sort. But having HDFS do sorted order and other FSes not do it would certainly make portability more difficult. One possibility is that we could randomize the order of returned results in HDFS (at least within a given batch of results returned from the NN). This is similar to how the Go programming language randomizes the order of iteration over hash table keys, to avoid code being written which relies on a specific implementation-defined ordering. Regardless of whether we do that, though, there is a bunch of code even in Hadoop common that doesn't properly deal with unsorted listStatus / globStatus... such as hadoop fs -ls cheers, Colin
Re: DISCUSS: is the order in FS.listStatus() required to be sorted?
I think the patch just updates the doc as of now, not changing any code to affect the existing usage. Sorting depends on the underlying implementations. Linux *ls *implementation returns alphanumerically sorted array by default ( Current implementation might have assumed from here to sort by default, just guessing ...) . But have some other options to sort on different attributes. Java's *File.listFiles() *javadoc specifies as follows: *There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order. * So the current change is inline with Java's FileSystem API atleast. So IMO, its fine to commit the javadoc update. -Vinay On Mon, Jun 1, 2015 at 3:51 PM, Steve Loughran ste...@hortonworks.com wrote: HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the FS javadoc and contract tests to say the order you get things back from a listStatus() isn't guaranteed to be alphanumerically sorted That's one of those assumptions which we all have, but which, when you think about it, doesn't have to be guaranteed. I'm going to commit the patch with the updated docs. Before I do that, does anyone have any objection -that is, is there some fundamental constraint which requires it to come back sorted? Such as the FS APIs and other apps which do expect that sorting, and which are going to break if the rules change? If so, they may need to be looked at. -Steve
DISCUSS: is the order in FS.listStatus() required to be sorted?
HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the FS javadoc and contract tests to say the order you get things back from a listStatus() isn't guaranteed to be alphanumerically sorted That's one of those assumptions which we all have, but which, when you think about it, doesn't have to be guaranteed. I'm going to commit the patch with the updated docs. Before I do that, does anyone have any objection -that is, is there some fundamental constraint which requires it to come back sorted? Such as the FS APIs and other apps which do expect that sorting, and which are going to break if the rules change? If so, they may need to be looked at. -Steve
Re: DISCUSS: is the order in FS.listStatus() required to be sorted?
The POSIX spec for readdir (http://pubs.opengroup.org/onlinepubs/009695399/functions/readdir.html) doesn’t spell out a sort order, so it should be assumed that the ordering isn’t guaranteed. Chris Siebenmann has written a few relative blog posts on the topic that might be of interest here: * https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirHistory * https://utcc.utoronto.ca/~cks/space/blog/unix/ReaddirOrder So I think it’s OK to break the _API_ here ... ** HOWEVER ** POSIX ls (http://pubs.opengroup.org/onlinepubs/95399/utilities/ls.html) DOES require its output be sorted. So breaking the sort order of 'hadoop fs -ls’ would be *extremely* bad. We need to make sure that doesn’t change. On Jun 1, 2015, at 4:11 AM, Vinayakumar B vinayakum...@apache.org wrote: I think the patch just updates the doc as of now, not changing any code to affect the existing usage. Sorting depends on the underlying implementations. Linux *ls *implementation returns alphanumerically sorted array by default ( Current implementation might have assumed from here to sort by default, just guessing ...) . But have some other options to sort on different attributes. Java's *File.listFiles() *javadoc specifies as follows: *There is no guarantee that the name strings in the resulting array will appear in any specific order; they are not, in particular, guaranteed to appear in alphabetical order. * So the current change is inline with Java's FileSystem API atleast. So IMO, its fine to commit the javadoc update. -Vinay On Mon, Jun 1, 2015 at 3:51 PM, Steve Loughran ste...@hortonworks.com wrote: HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the FS javadoc and contract tests to say the order you get things back from a listStatus() isn't guaranteed to be alphanumerically sorted That's one of those assumptions which we all have, but which, when you think about it, doesn't have to be guaranteed. I'm going to commit the patch with the updated docs. Before I do that, does anyone have any objection -that is, is there some fundamental constraint which requires it to come back sorted? Such as the FS APIs and other apps which do expect that sorting, and which are going to break if the rules change? If so, they may need to be looked at. -Steve