[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16815755#comment-16815755 ]
David Greenberg commented on SPARK-20144: ----------------------------------------- Hello, this issue is also a major one for me. Almost all of the data I work with is has a natural sort order, and I store it in CSV, parquet, and orc. Unfortunately, some of my datasets are very large, and so I waste a lot of compute time loading those datasets out of storage due to Spark throwing out serialization information at load & store time. I would really like to see a solution to this problem, as it's fairly expensive to our bottom line when using spark. > spark.read.parquet no long maintains ordering of the data > --------------------------------------------------------- > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2 > Reporter: Li Jin > Priority: Major > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org