On Mon, Jun 1, 2015 at 3:21 AM, Steve Loughran <ste...@hortonworks.com> wrote:
>
> HADOOP-12009 (https://issues.apache.org/jira/browse/HADOOP-12009) patches the 
> FS javadoc and contract tests to say "the order you get things back from a 
> listStatus() isn't guaranteed to be alphanumerically sorted"
>
> That's one of those assumptions which we all have, but which, when you think 
> about it, doesn't have to be guaranteed.
>
> I'm going to commit the patch with the updated docs. Before I do that, does 
> anyone have any objection -that is, is there some fundamental constraint 
> which requires it to come back sorted? Such as the FS APIs and other apps 
> which do expect that sorting, and which are going to break if the rules 
> change? If so, they may need to be looked at.
>
> -Steve

We had a discussion about this on HADOOP-10798.  Although HDFS always
returns listStatus results in alphabetically sorted order because of
implementation issues, the local filesystem does not return things in
alphabetically sorted order.

I think it's fine in principle to specify that listStatus returns
things in undefined order.  After all, as Allen mentioned, this is
what POSIX does.  I do think that in practice, this will result in a
lot of HDFS-only code getting written where there is a hidden
assumption that listStatus, globStatus, etc. sort their responses.
This might make portability more difficult.

I'm not sure if there is a good way around this problem.  Requiring
results to be returned in sorted order would be really harmful to
performance for things like Ceph and Lustre-- we'd essentially be
forcing a ton of client-side buffering and a sort.  But having HDFS do
sorted order and other FSes not do it would certainly make portability
more difficult.

One possibility is that we could randomize the order of returned
results in HDFS (at least within a given batch of results returned
from the NN).  This is similar to how the Go programming language
randomizes the order of iteration over hash table keys, to avoid code
being written which relies on a specific implementation-defined
ordering.

Regardless of whether we do that, though, there is a bunch of code
even in Hadoop common that doesn't properly deal with unsorted
listStatus / globStatus... such as "hadoop fs -ls"

cheers,
Colin

Reply via email to