Thanks a lot for the reply Steve! If you don't see a way to fix this in Spark itself, then I will try to improve the docs.
Antonin On 06/05/2020 17:19, Steve Loughran wrote: > > > On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <li...@antonin.delpeuch.eu > <mailto:li...@antonin.delpeuch.eu>> wrote: > > Hi, > > Sorry to dig out this thread but this bug is still present. > > The fix proposed in this thread (creating a new FileSystem > implementation > which sorts listed files) was rejected, with the suggestion that it > is the > FileInputFormat's responsibility to sort the file names if preserving > partition order is desired: > https://github.com/apache/spark/pull/4204 > > Given that Spark RDDs are supposed to preserve the order of the > collections > they represent, this would still deserve to be fixed in Spark, I > think. As a > user, I expect that if I use saveAsTextFile and then load the > resulting file > with sparkContext.textFile, I obtain a dataset in the same order. > > Because Spark uses the FileInputFormats exposed by Hadoop, that > would mean > either patching Hadoop for it to sort file names directly (which is > likely > going to fail since Hadoop might not care about the ordering in > general), > > > Don't see any guarantees in Hadoop about the order of listLocatedStatus > -and for the local FS you get what the OS gives you. > > What isn't easy is to take an entire listing and sort it -not if it is > potentially millions of entries. That issue is why the newer FS list > APIs all return a RemoteIterator<>: incremental paging of values so > reducing payload of single RPC messages between HDFS client & namenode > (HDFS) or allowing for paged/incremental lists against object stores. > You can't provide incremental pages of results *and sort those results > at the same time* > > Which, given they're my problem, means I wouldn't be happy with adding > "sort all listings" as a new restriction on FS semantics. > > > > or > create subclasses of all Hadoop formats used in Spark, adding the > required > sorting to the listStatus method. This strikes me as less elegant than > implementing a new FileSystem as suggested by Reynold, though. > > > Again, you've got some scale issues to deal with -but as FileInputFormat > builds a list it's already in trouble if you point it at a sufficiently > large directory tree > > Best thing to do would be to add entries to a treemap during the > recursive treewalk and then serve it up ordered from there -no need to > do a sort @ the end. > > But: trying to subclass all Hadoop formats is itself troublesome. If you > go that way: make it an optional interface. And/or talk to the mapreduce > project about actually providing a base implementation > > > > Another way to "fix" this would be to mention in the docs that order > is not > preserved in this scenario, which could hopefully avoid bad surprises to > others (just like we already have a caveat about nondeterminism of order > after shuffles). > > I would be happy to try submitting a fix for this, if there is a > consensus > around the correct course of action. > > Even if it's not the final desired goal, it's a correct description of > the current state of the application ... --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org