Or maybe you could also check using the collect_list from the SQL functions val compacter = Data1.groupBy(“UserID") .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))
> On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote: > > For example > Data1(has 1 billion records) > user_id1 feature1 > user_id1 feature2 > > Data2(has 1 billion records) > user_id1 feature3 > > Data3(has 1 billion records) > user_id1 feature4 > user_id1 feature5 > ... > user_id1 feature100 > > I want to get the result as follow > user_id1 feature1 feature2 feature3 feature4 feature5...feature100 > > Is there a more efficient way except join? > > Thanks! Didac Gil de la Iglesia PhD in Computer Science didacg...@gmail.com Spain: +34 696 285 544 Sweden: +46 (0)730229737 Skype: didac.gil.de.la.iglesia
signature.asc
Description: Message signed with OpenPGP