On Mon, Jun 1, 2015 at 3:21 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the > FS javadoc and contract tests to say "the order you get things back from a > listStatus() isn't guaranteed to be alphanumerically sorted" > > That's one of those assumptions which we all have, but which, when you think > about it, doesn't have to be guaranteed. > > I'm going to commit the patch with the updated docs. Before I do that, does > anyone have any objection -that is, is there some fundamental constraint > which requires it to come back sorted? Such as the FS APIs and other apps > which do expect that sorting, and which are going to break if the rules > change? If so, they may need to be looked at. > > -Steve
We had a discussion about this on HADOOP-10798. Although HDFS always returns listStatus results in alphabetically sorted order because of implementation issues, the local filesystem does not return things in alphabetically sorted order. I think it's fine in principle to specify that listStatus returns things in undefined order. After all, as Allen mentioned, this is what POSIX does. I do think that in practice, this will result in a lot of HDFS-only code getting written where there is a hidden assumption that listStatus, globStatus, etc. sort their responses. This might make portability more difficult. I'm not sure if there is a good way around this problem. Requiring results to be returned in sorted order would be really harmful to performance for things like Ceph and Lustre-- we'd essentially be forcing a ton of client-side buffering and a sort. But having HDFS do sorted order and other FSes not do it would certainly make portability more difficult. One possibility is that we could randomize the order of returned results in HDFS (at least within a given batch of results returned from the NN). This is similar to how the Go programming language randomizes the order of iteration over hash table keys, to avoid code being written which relies on a specific implementation-defined ordering. Regardless of whether we do that, though, there is a bunch of code even in Hadoop common that doesn't properly deal with unsorted listStatus / globStatus... such as "hadoop fs -ls" cheers, Colin