Re: How to merge multiple rows

Patrick McCarthy Wed, 22 Aug 2018 13:32:33 -0700

You didn't specify which API, but in pyspark you could do

import pyspark.sql.functions as F


df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show()

+---+------------+
| ID|     DETAILS|
+---+------------+
|  1|[A1, A2, A3]|
|  3|        [B2]|
|  2|        [B1]|
+---+------------+


If you want to sort by PART I think you'll need a UDF.

On Wed, Aug 22, 2018 at 4:12 PM, Jean Georges Perrin <j...@jgp.net> wrote:

> How do you do it now?
>
> You could use a withColumn(“newDetails”, <concatenation of details_1,
> details_2...>)
>
> jg
>
>
> > On Aug 22, 2018, at 16:04, msbreuer <msbre...@gmail.com> wrote:
> >
> > A dataframe with following contents is given:
> >
> > ID PART DETAILS
> > 1    1 A1
> > 1    2 A2
> > 1    3 A3
> > 2    1 B1
> > 3    1 C1
> >
> > Target format should be as following:
> >
> > ID DETAILS
> > 1 A1+A2+A3
> > 2 B1
> > 3 C1
> >
> > Note, the order of A1-3 is important.
> >
> > Currently I am using this alternative:
> >
> > ID DETAIL_1 DETAIL_2 DETAIL_3
> > 1 A1       A2       A3
> > 2 B1
> > 3 C1
> >
> > What would be the best method to do such transformation an a large
> dataset?
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How to merge multiple rows

Reply via email to