Thanks a lot for the reply Steve!

If you don't see a way to fix this in Spark itself, then I will try to
improve the docs.

Antonin

On 06/05/2020 17:19, Steve Loughran wrote:
> 
> 
> On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <li...@antonin.delpeuch.eu
> <mailto:li...@antonin.delpeuch.eu>> wrote:
> 
>     Hi,
> 
>     Sorry to dig out this thread but this bug is still present.
> 
>     The fix proposed in this thread (creating a new FileSystem
>     implementation
>     which sorts listed files) was rejected, with the suggestion that it
>     is the
>     FileInputFormat's responsibility to sort the file names if preserving
>     partition order is desired:
>     https://github.com/apache/spark/pull/4204
> 
>     Given that Spark RDDs are supposed to preserve the order of the
>     collections
>     they represent, this would still deserve to be fixed in Spark, I
>     think. As a
>     user, I expect that if I use saveAsTextFile and then load the
>     resulting file
>     with sparkContext.textFile, I obtain a dataset in the same order.
> 
>     Because Spark uses the FileInputFormats exposed by Hadoop, that
>     would mean
>     either patching Hadoop for it to sort file names directly (which is
>     likely
>     going to fail since Hadoop might not care about the ordering in
>     general), 
> 
> 
> Don't see any guarantees in Hadoop about the order of listLocatedStatus
> -and for the local FS you get what the OS gives you.
> 
> What isn't easy is to take an entire listing and sort it -not if it is
> potentially millions of entries. That issue is why the newer FS list
> APIs all return a RemoteIterator<>: incremental paging of values so
> reducing payload of single RPC messages between HDFS client & namenode
> (HDFS) or allowing for paged/incremental lists against object stores.
> You can't provide incremental pages of results *and sort those results
> at the same time*
> 
> Which, given they're my problem, means I wouldn't be happy with adding
> "sort all listings" as a new restriction on FS semantics.
> 
>  
> 
>     or
>     create subclasses of all Hadoop formats used in Spark, adding the
>     required
>     sorting to the listStatus method. This strikes me as less elegant than
>     implementing a new FileSystem as suggested by Reynold, though.
> 
> 
> Again, you've got some scale issues to deal with -but as FileInputFormat
> builds a list it's already in trouble if you point it at a sufficiently
> large directory tree
> 
> Best thing to do would be to add entries to a treemap during the
> recursive treewalk and then serve it up ordered from there -no need to
> do a sort @ the end.
> 
> But: trying to subclass all Hadoop formats is itself troublesome. If you
> go that way: make it an optional interface. And/or talk to the mapreduce
> project about actually providing a base implementation
> 
>  
> 
>     Another way to "fix" this would be to mention in the docs that order
>     is not
>     preserved in this scenario, which could hopefully avoid bad surprises to
>     others (just like we already have a caveat about nondeterminism of order
>     after shuffles).
> 
>     I would be happy to try submitting a fix for this, if there is a
>     consensus
>     around the correct course of action.
> 
> Even if it's not the final desired goal, it's a correct description of
> the current state of the application ...


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to