[ https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830889#comment-15830889 ]
Jason Moore commented on SPARK-17436: ------------------------------------- Hi [~ran.h...@optimalplus.com], [~srowen], How sure are we that this is not a problem anymore? I've been testing some of my application code that relies on preserving ordering (for performance) on 2.1 (build from head of branch-2.1) and it still seems that after sorting within partitions, saving to parquet, then re-reading from parquet, that the rows are not correctly ordered. {code} val random = new scala.util.Random val raw = sc.parallelize((1 to 1000000).map { _ => random.nextInt }).toDF("A").repartition(100) raw.sortWithinPartitions("A").write.mode("overwrite").parquet("maybeSorted.parquet") val maybeSorted = spark.read.parquet("maybeSorted.parquet") val isSorted = maybeSorted.mapPartitions { rows => val isSorted = rows.map(_.getInt(0)).sliding(2).forall { case List(x, y) => x <= y } Iterator(isSorted) }.reduce(_ && _) {code} > dataframe.write sometimes does not keep sorting > ----------------------------------------------- > > Key: SPARK-17436 > URL: https://issues.apache.org/jira/browse/SPARK-17436 > Project: Spark > Issue Type: Bug > Affects Versions: 1.6.1, 1.6.2, 2.0.0 > Reporter: Ran Haim > Priority: Minor > > update > *************** > It seems that in spark 2.1 code, the sorting issue is resolved. > The sorter does consider inner sorting in the sorting key - but I think it > will be faster to just insert the rows to a list in a hash map. > *************** > When using partition by, datawriter can sometimes mess up an ordered > dataframe. > The problem originates in > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer. > In the writeRows method when too many files are opened (configurable), it > starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows > again from the sorter and writes them to the corresponding files. > The problem is that the sorter actually sorts the rows using the partition > key, and that can sometimes mess up the original sort (or secondary sort if > you will). > I think the best way to fix it is to stop using a sorter, and just put the > rows in a map using key as partition key and value as an arraylist, and then > just walk through all the keys and write it in the original order - this will > probably be faster as there no need for ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org