in general)

Sumit Khanna Thu, 28 Jul 2016 23:50:33 -0700

Hey,

the very first run :


glossary :

delta_df := current run / execution changes dataframe.

def deduplicate :
apply windowing function and group by

def partitionDataframe(delta_df) :
get unique keys of that data frame and then return an array of data frames
each containing just that very same key as the column.
this will give the above dataframe partitoned as say by date column or
gender column or age group column etc etc.

0. deduplicate(delta_df : delta_df [ with all unique primary  /
deduplicating key column ]
1. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)]
2. write the dataframe to corresponding parent hdfs path + partiton dir_

subsequent runs :

for each partition :
0. partitionDataframe(delta_df) : Array[delta_df(i to # partitons)]
1. load df from previous hdfs location of that partition
2. filter the above df(p) where p is the partiton no. such that keys not
present in delta_df(p) of current run. i.e get df(p)[primary column] not in
delta_df(p). done via a basic ! in UDF.
3. delta_df.unionAll(filtered df above).
4. persist the output of 3. as df.write.mode.format.

Is this the right way of doing the upserts partiton wise?  all in all it is
taking 2 hours for inserting / upserting 5ooK records in parquet format in
some hdfs location where each location gets mapped to one partition.

My spark conf specs are :

yarn cluster mode. single node.
spark.executor.memory 8g
spark.rpc.netty.dispatcher.numThreads 2

Thanks,
Sumit

correct / efficient manner to upsert / update in hdfs (via spark / in general)

Reply via email to