Seems to me that the obvious candidate is loading both master and delta, using 
join or cogroup then write the new master.

Through some clever sharding and key management you might achieve some 
efficiency gains, but I’d say start here if your numbers are in the hundreds of 
thousands… should run under a minute with the correct resources…

-adrian

From: TEST ONE
Date: Tuesday, September 29, 2015 at 3:00 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Merging two avro RDD/DataFrames


I have a daily update of modified users (~100s) output as avro from ETL. I’d 
need to find and merge with existing corresponding members in a master avro 
file (~100,000s) The merge operation involves merging a ‘profiles’ 
Map<String,String> between the matching records.


What would be the recommended pattern to handle record merging with Spark?


Thanks,

kc

Reply via email to