[ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051990#comment-17051990 ]
Melitta Dragaschnig commented on SPARK-15798: --------------------------------------------- Hi all, I am a frequent user of tresata's spark-sorted library (thank you [~koert]!) to get the Secondary Sort functionality (for large groups, in order to avoid memory issues), so I tried to figure out whether there are plans to merge this useful functionality into the core library. After checking the progression of this Jira issue and seeing that it's marked as Incomplete, has been closed but no Fix versions are given, my conclusion was that presently it is not provided by the core library, and it's advisable to continue using spark-sorted for the time being. Is my assumption correct? Also any further information on additional ways to stay informed about the development of this topic would be greatly appreciated! > Secondary sort in Dataset/DataFrame > ----------------------------------- > > Key: SPARK-15798 > URL: https://issues.apache.org/jira/browse/SPARK-15798 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: koert kuipers > Priority: Major > Labels: bulk-closed > > Secondary sort for Spark RDDs was discussed in > https://issues.apache.org/jira/browse/SPARK-3655 > Since the RDD API allows for easy extensions outside the core library this > was implemented separately here: > https://github.com/tresata/spark-sorted > However it seems to me that with Dataset an implementation in a 3rd party > library of such a feature is not really an option. > Dataset already has methods that suggest a secondary sort is present, such as > in KeyValueGroupedDataset: > {noformat} > def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): > Dataset[U] > {noformat} > This operation pushes all the data to the reducer, something you only would > want to do if you need the elements in a particular order. > How about as an API sortBy methods in KeyValueGroupedDataset and > RelationalGroupedDataset? > {noformat} > dataFrame.groupBy("a").sortBy("b").fold(...) > {noformat} > (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should > :)) > {noformat} > dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org