I guess that if your user_id field is the key, you could use the updateStateByKey function.
I did not test it, but it could be something along these lines: def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)] = { val state = accumulatedInput.getOrElse((“”)) //In case the current Key was not found before, the features list is empty val feature = input._1 //We get the feature value of this new entry val newFeature = state._1 +” “+feature Some((newFeature)) //The new accumulated value for the features is returned } val updatedData = Data1.updateStateByKey(yourCombineFunction) //This would “iterate” among all the entries in your Dataset and, for each row, will update the “accumulatedFeatures” Good luck > On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote: > > For example > Data1(has 1 billion records) > user_id1 feature1 > user_id1 feature2 > > Data2(has 1 billion records) > user_id1 feature3 > > Data3(has 1 billion records) > user_id1 feature4 > user_id1 feature5 > ... > user_id1 feature100 > > I want to get the result as follow > user_id1 feature1 feature2 feature3 feature4 feature5...feature100 > > Is there a more efficient way except join? > > Thanks! Didac Gil de la Iglesia PhD in Computer Science didacg...@gmail.com Spain: +34 696 285 544 Sweden: +46 (0)730229737 Skype: didac.gil.de.la.iglesia
signature.asc
Description: Message signed with OpenPGP