You may consider writing all your data to a nosql datastore such as hbase,
using user id as key.

There is a sql solution using max and inner case and finally union the
results, but that may be expensive
On Tue, 16 May 2017 at 12:13 am, Didac Gil <didacgil9...@gmail.com> wrote:

> Or maybe you could also check using the collect_list from the SQL functions
>
> val compacter = Data1.groupBy(“UserID")
>   
> .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))
>
>
>
> On 15 May 2017, at 15:15, Jone Zhang <joyoungzh...@gmail.com> wrote:
>
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
>
> Data2(has 1 billion records)
> user_id1  feature3
>
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
>
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>
> Is there a more efficient way except join?
>
> Thanks!
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacg...@gmail.com
> Spain:     +34 696 285 544
> Sweden: +46 (0)730229737
> Skype: didac.gil.de.la.iglesia
>
> --
Best Regards,
Ayan Guha

Reply via email to