[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

Melitta Dragaschnig (Jira) Thu, 05 Mar 2020 02:39:27 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051990#comment-17051990
 ]


Melitta Dragaschnig commented on SPARK-15798:
---------------------------------------------

Hi all,


I am a frequent user of tresata's spark-sorted library (thank you [~koert]!) to 
get the Secondary Sort functionality (for large groups, in order to avoid 
memory issues), so I tried to figure out whether there are plans to merge this 
useful functionality into the core library.

 

After checking the progression of this Jira issue and seeing that it's marked 
as Incomplete, has been closed but no Fix versions are given, my conclusion was 
that presently it is not provided by the core library, and it's advisable to 
continue using spark-sorted for the time being. Is my assumption correct?

Also any further information on additional ways to stay informed about the 
development of this topic would be greatly appreciated!

> Secondary sort in Dataset/DataFrame
> -----------------------------------
>
>                 Key: SPARK-15798
>                 URL: https://issues.apache.org/jira/browse/SPARK-15798
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: koert kuipers
>            Priority: Major
>              Labels: bulk-closed
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

Reply via email to