Here is a similar but not exact way I did something similar to what you
did. I had two data files in different formats the different columns needed
to be different features. I wanted to feed them into spark's:
You may consider writing all your data to a nosql datastore such as hbase,
using user id as key.
There is a sql solution using max and inner case and finally union the
results, but that may be expensive
On Tue, 16 May 2017 at 12:13 am, Didac Gil wrote:
> Or maybe you
Or maybe you could also check using the collect_list from the SQL functions
val compacter = Data1.groupBy(“UserID")
.agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))
> On 15 May 2017, at 15:15, Jone Zhang wrote:
>
> For example
>
I guess that if your user_id field is the key, you could use the
updateStateByKey function.
I did not test it, but it could be something along these lines:
def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)]
= {
val state = accumulatedInput.getOrElse((“”))
For example
Data1(has 1 billion records)
user_id1 feature1
user_id1 feature2
Data2(has 1 billion records)
user_id1 feature3
Data3(has 1 billion records)
user_id1 feature4
user_id1 feature5
...
user_id1 feature100
I want to get the result as follow
user_id1 feature1 feature2 feature3