[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830889#comment-15830889
 ] 

Jason Moore commented on SPARK-17436:
-------------------------------------

Hi [~ran.h...@optimalplus.com], [~srowen],

How sure are we that this is not a problem anymore?  I've been testing some of 
my application code that relies on preserving ordering (for performance) on 2.1 
(build from head of branch-2.1) and it still seems that after sorting within 
partitions, saving to parquet, then re-reading from parquet, that the rows are 
not correctly ordered.

{code}
val random = new scala.util.Random
val raw = sc.parallelize((1 to 1000000).map { _ => random.nextInt 
}).toDF("A").repartition(100)
raw.sortWithinPartitions("A").write.mode("overwrite").parquet("maybeSorted.parquet")
val maybeSorted = spark.read.parquet("maybeSorted.parquet")
val isSorted = maybeSorted.mapPartitions { rows =>
  val isSorted = rows.map(_.getInt(0)).sliding(2).forall { case List(x, y) => x 
<= y }
  Iterator(isSorted)
}.reduce(_ && _)
{code}

> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>            Priority: Minor
>
> update
> ***************
> It seems that in spark 2.1 code, the sorting issue is resolved.
> The sorter does consider inner sorting in the sorting key - but I think it 
> will be faster to just insert the rows to a list in a hash map.
> ***************
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to