You didn't specify which API, but in pyspark you could do import pyspark.sql.functions as F
df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show() +---+------------+ | ID| DETAILS| +---+------------+ | 1|[A1, A2, A3]| | 3| [B2]| | 2| [B1]| +---+------------+ If you want to sort by PART I think you'll need a UDF. On Wed, Aug 22, 2018 at 4:12 PM, Jean Georges Perrin <j...@jgp.net> wrote: > How do you do it now? > > You could use a withColumn(“newDetails”, <concatenation of details_1, > details_2...>) > > jg > > > > On Aug 22, 2018, at 16:04, msbreuer <msbre...@gmail.com> wrote: > > > > A dataframe with following contents is given: > > > > ID PART DETAILS > > 1 1 A1 > > 1 2 A2 > > 1 3 A3 > > 2 1 B1 > > 3 1 C1 > > > > Target format should be as following: > > > > ID DETAILS > > 1 A1+A2+A3 > > 2 B1 > > 3 C1 > > > > Note, the order of A1-3 is important. > > > > Currently I am using this alternative: > > > > ID DETAIL_1 DETAIL_2 DETAIL_3 > > 1 A1 A2 A3 > > 2 B1 > > 3 C1 > > > > What would be the best method to do such transformation an a large > dataset? > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >