Seems to me that the obvious candidate is loading both master and delta, using
join or cogroup then write the new master.
Through some clever sharding and key management you might achieve some
efficiency gains, but I’d say start here if your numbers are in the hundreds of
thousands… should run under a minute with the correct resources…
-adrian
From: TEST ONE
Date: Tuesday, September 29, 2015 at 3:00 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Merging two avro RDD/DataFrames
I have a daily update of modified users (~100s) output as avro from ETL. I’d
need to find and merge with existing corresponding members in a master avro
file (~100,000s) The merge operation involves merging a ‘profiles’
Map<String,String> between the matching records.
What would be the recommended pattern to handle record merging with Spark?
Thanks,
kc