Re: Aggregating over sorted data

trsell Thu, 22 Dec 2016 06:07:59 -0800

I would love this feature

On Thu, 22 Dec 2016, 18:45 assaf.mendelson, <assaf.mendel...@rsa.com> wrote:


> It seems that this aggregation is for dataset operations only. I would
> have hoped to be able to do dataframe aggregation. Something along the line
> of: sort_df(df).agg(my_agg_func)
>
>
>
> In any case, note that this kind of sorting is less efficient than the
> sorting done in window functions for example. Specifically here what is
> happening is that first the data is shuffled and then the entire partition
> is sorted. It is possible to do it another way (although I have no idea how
> to do it in spark without writing a UDAF which is probably very
> inefficient). The other way would be to collect everything by key in each
> partition, sort within the key (which would be a lot faster since there are
> fewer elements) and then merge the results.
>
>
>
> I was hoping to find something like: Efficient sortByKey to work with…
>
>
>
> *From:* Koert Kuipers [via Apache Spark Developers List] 
> [mailto:ml-node+[hidden
> email] <http:///user/SendEmail.jtp?type=node&node=20334&i=0>]
> *Sent:* Thursday, December 22, 2016 7:14 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: Aggregating over sorted data
>
>
>
> it can also be done with repartition + sortWithinPartitions +
> mapPartitions.
>
> perhaps not as convenient but it does not rely on undocumented behavior.
>
> i used this approach in spark-sorted. see here:
>
>
> https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala
>
> On Wed, Dec 21, 2016 at 9:44 PM, Liang-Chi Hsieh <[hidden email]
> <http:///user/SendEmail.jtp?type=node&node=20332&i=0>> wrote:
>
>
> I agreed that to make sure this work, you might need to know the Spark
> internal implementation for APIs such as `groupBy`.
>
> But without any more changes to current Spark implementation, I think this
> is the one possible way to achieve the required function to aggregate on
> sorted data per key.
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20331.html
>
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
>
> To unsubscribe e-mail: [hidden email]
> <http:///user/SendEmail.jtp?type=node&node=20332&i=1>
>
>
>
>
> ------------------------------
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20332.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] <http:///user/SendEmail.jtp?type=node&node=20334&i=1>
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
> ------------------------------
> View this message in context: RE: Aggregating over sorted data
> <http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20334.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.
>

Re: Aggregating over sorted data

Reply via email to