Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
Thanks Sumit, please post back how your test with Hbase go. On Fri, Jul 29, 2016 at 8:06 PM, Sumit Khanna wrote: > Hey Ayan, > > A. Create a table TGT1 as (select key,info from delta UNION ALL select > key,info from TGT where key not in (select key from SRC)). Rename

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Hey Ayan, A. Create a table TGT1 as (select key,info from delta UNION ALL select key,info from TGT where key not in (select key from SRC)). Rename TGT1 to TGT. Not in can be written other variations using Outer Join B. Assuming SRC and TGT have a timestamp, B.1. Select latest records

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread ayan guha
This is a classic case compared to hadoop vs DWH implmentation. Source (Delta table): SRC. Target: TGT Requirement: Pure Upsert, ie just keep the latest information for each key. Options: A. Create a table TGT1 as (select key,info from delta UNION ALL select key,info from TGT where key not in

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION udf broadcasted to all the worker nodes. Which I think is an efficient move enough. Thanks, On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna wrote: > Hey, > > the very first run : > > glossary : > >

correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Hey, the very first run : glossary : delta_df := current run / execution changes dataframe. def deduplicate : apply windowing function and group by def partitionDataframe(delta_df) : get unique keys of that data frame and then return an array of data frames each containing just that very same