I would love this feature On Thu, 22 Dec 2016, 18:45 assaf.mendelson, <assaf.mendel...@rsa.com> wrote:
> It seems that this aggregation is for dataset operations only. I would > have hoped to be able to do dataframe aggregation. Something along the line > of: sort_df(df).agg(my_agg_func) > > > > In any case, note that this kind of sorting is less efficient than the > sorting done in window functions for example. Specifically here what is > happening is that first the data is shuffled and then the entire partition > is sorted. It is possible to do it another way (although I have no idea how > to do it in spark without writing a UDAF which is probably very > inefficient). The other way would be to collect everything by key in each > partition, sort within the key (which would be a lot faster since there are > fewer elements) and then merge the results. > > > > I was hoping to find something like: Efficient sortByKey to work with… > > > > *From:* Koert Kuipers [via Apache Spark Developers List] > [mailto:ml-node+[hidden > email] <http:///user/SendEmail.jtp?type=node&node=20334&i=0>] > *Sent:* Thursday, December 22, 2016 7:14 AM > *To:* Mendelson, Assaf > *Subject:* Re: Aggregating over sorted data > > > > it can also be done with repartition + sortWithinPartitions + > mapPartitions. > > perhaps not as convenient but it does not rely on undocumented behavior. > > i used this approach in spark-sorted. see here: > > > https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala > > On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <[hidden email] > <http:///user/SendEmail.jtp?type=node&node=20332&i=0>> wrote: > > > I agreed that to make sure this work, you might need to know the Spark > internal implementation for APIs such as `groupBy`. > > But without any more changes to current Spark implementation, I think this > is the one possible way to achieve the required function to aggregate on > sorted data per key. > > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > http://www.spark.tc/ > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20331.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > > To unsubscribe e-mail: [hidden email] > <http:///user/SendEmail.jtp?type=node&node=20332&i=1> > > > > > ------------------------------ > > *If you reply to this email, your message will be added to the discussion > below:* > > > http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20332.html > > To start a new topic under Apache Spark Developers List, email [hidden > email] <http:///user/SendEmail.jtp?type=node&node=20334&i=1> > To unsubscribe from Apache Spark Developers List, click here. > NAML > <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > ------------------------------ > View this message in context: RE: Aggregating over sorted data > <http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20334.html> > Sent from the Apache Spark Developers List mailing list archive > <http://apache-spark-developers-list.1001551.n3.nabble.com/> at > Nabble.com. >