Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9517#discussion_r44247117
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
    @@ -461,13 +461,29 @@ private[sql] class ParquetRelation(
               // You should enable this configuration only if you are very 
sure that for the parquet
               // part-files to read there are corresponding summary files 
containing correct schema.
     
    +          // As filed in SPARK-11500, the order of files to touch is a 
matter, which might affect
    +          // the ordering of the output columns. There are several things 
to mention here.
    +          //
    +          //  1. If mergeRespectSummaries config is false, then it merges 
schemas by reducing from
    +          //     the first part-file so that the columns of the first file 
show first.
    +          //
    +          //  2. If mergeRespectSummaries config is true, then there 
should be, at least,
    +          //     "_metadata"s for all given files. So, we can ensure the 
columns of the first file
    +          //     show first.
    +          //
    +          //  3. If shouldMergeSchemas is false, but when multiple files 
are given, there is
    +          //     no guarantee of the output order, since there might not 
be a summary file for the
    +          //     first file, which ends up putting ahead the columns of 
the other files. However,
    +          //     this should be okay since not enabling shouldMergeSchemas 
means (assumes) all the
    +          //     files have the same schemas.
    +
               val needMerged: Seq[FileStatus] =
                 if (mergeRespectSummaries) {
                   Seq()
                 } else {
                   dataStatuses
                 }
    -          (metadataStatuses ++ commonMetadataStatuses ++ needMerged).toSeq
    +          needMerged ++ metadataStatuses ++ commonMetadataStatuses
    --- End diff --
    
    Yes, I think I should sort them.
    It looks it is not really recommended just to use it as it is, although 
they looks sorted, assuming from [this 
link](http://lucene.472066.n3.nabble.com/FileSystem-contract-of-listStatus-td3475540.html).
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to