You may consider writing all your data to a nosql datastore such as hbase, using user id as key.
There is a sql solution using max and inner case and finally union the results, but that may be expensive On Tue, 16 May 2017 at 12:13 am, Didac Gil <didacgil9...@gmail.com> wrote: > Or maybe you could also check using the collect_list from the SQL functions > > val compacter = Data1.groupBy(“UserID") > > .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures")) > > > > On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote: > > For example > Data1(has 1 billion records) > user_id1 feature1 > user_id1 feature2 > > Data2(has 1 billion records) > user_id1 feature3 > > Data3(has 1 billion records) > user_id1 feature4 > user_id1 feature5 > ... > user_id1 feature100 > > I want to get the result as follow > user_id1 feature1 feature2 feature3 feature4 feature5...feature100 > > Is there a more efficient way except join? > > Thanks! > > Didac Gil de la Iglesia > PhD in Computer Science > didacg...@gmail.com > Spain: +34 696 285 544 > Sweden: +46 (0)730229737 > Skype: didac.gil.de.la.iglesia > > -- Best Regards, Ayan Guha