Ran Haim created SPARK-17436:
--------------------------------

             Summary: dataframe.write sometimes does not keep sorting
                 Key: SPARK-17436
                 URL: https://issues.apache.org/jira/browse/SPARK-17436
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.0.0, 1.6.2, 1.6.1
            Reporter: Ran Haim


When using partition by,  datawriter can sometimes mess up an ordered dataframe.

The problem originates in 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
In the writeRows method when too many files are opened (configurable), it 
starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
again from the sorter and writes them to the corresponding files.
The problem is that the sorter actually sorts the rows using the partition key, 
and that can sometimes mess up the original sort (or secondary sort if you 
will).

I think the best way to fix it is to stop using a sorter, and just put the rows 
in a map using key as partition key and value as an arraylist, and then just 
walk through all the keys and write it in the original order - this will 
probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to