[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650817#comment-16650817 ]
Daniel Darabos commented on SPARK-20144: ---------------------------------------- Thanks, those are good questions. # The global option is not great, but it's the simplest. The code is already controlled by two global options. ({{spark.sql.files.maxPartitionBytes}} and {{spark.sql.files.openCostInBytes}}.) Why not one more? # I'm not sure what {{LOAD DATA INPATH}} does. (Sorry...) But sure, users can put random-name files in the directory and mess stuff up. Best protection against that is not putting random-name files in the directory. :D # The whole problem is not Parquet-specific. It affects all file types. The {{part-00001}} naming comes from Hadoop's [FileOutputFormat|https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java#L270]. It's been like this forever and will never change. (I'd say it's more than a convention.) > spark.read.parquet no long maintains ordering of the data > --------------------------------------------------------- > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2 > Reporter: Li Jin > Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org