[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650722#comment-16650722 ]
Daniel Darabos commented on SPARK-20144: ---------------------------------------- Yeah, I'm not too happy about the alphabetical ordering either. I thought I could simply not sort, and get the "original" order. But at the point where I made my change, the files are already in a jumbled order. Maybe it's the file system listing order, which could be anything. 99% of the time I'm just reading back a single partitioned Parquet file. In this case the alphabetical ordering is the right ordering. ({{part-00001}}, {{part-00002}}, ...) The rows of the resulting DataFrame will be in the same order as originally. So I think this issue is satisfied by the change. (The test also demonstrates this.) The 1% case (for me) is when I'm reading back multiple Parquet files with a glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}}, {{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at least the partitions within each Parquet file will be contiguous. That's still an improvement. > spark.read.parquet no long maintains ordering of the data > --------------------------------------------------------- > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2 > Reporter: Li Jin > Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org