[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

Sean Owen (JIRA) Sat, 19 Nov 2016 03:37:12 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679106#comment-15679106
 ]


Sean Owen commented on SPARK-17436:
-----------------------------------

[~ran.h...@optimalplus.com] I'm not sure how to proceed on this. Above you say 
that sort-then-partition doesn't preserve the sort, which is correct, it does 
not. For example, if I sort people by name, then partition by age, there's no 
way in general that they can stay sorted by name. Younger people may have names 
alphabetically after older people. Here you seem to be talking about 
partitioning then sorting. That leaves it sorted, though may change the 
partitioning. Do you mean partitioning by the same value you sort on? or 
partitioning one way and sorting within partitions another way?

If you're not able to open a PR, can you give a short example illustrating the 
issue?

> dataframe.write sometimes does not keep sorting
> -----------------------------------------------
>
>                 Key: SPARK-17436
>                 URL: https://issues.apache.org/jira/browse/SPARK-17436
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

Reply via email to