Thanks Sumit, please post back how your test with Hbase go.
On Fri, Jul 29, 2016 at 8:06 PM, Sumit Khanna wrote:
> Hey Ayan,
>
> A. Create a table TGT1 as (select key,info from delta UNION ALL select
> key,info from TGT where key not in (select key from SRC)). Rename
Hey Ayan,
A. Create a table TGT1 as (select key,info from delta UNION ALL select
key,info from TGT where key not in (select key from SRC)). Rename TGT1 to
TGT. Not in can be written other variations using Outer Join
B. Assuming SRC and TGT have a timestamp,
B.1. Select latest records
This is a classic case compared to hadoop vs DWH implmentation.
Source (Delta table): SRC. Target: TGT
Requirement: Pure Upsert, ie just keep the latest information for each key.
Options:
A. Create a table TGT1 as (select key,info from delta UNION ALL select
key,info from TGT where key not in
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION
udf broadcasted to all the worker nodes. Which I think is an efficient move
enough.
Thanks,
On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna
wrote:
> Hey,
>
> the very first run :
>
> glossary :
>
>
Hey,
the very first run :
glossary :
delta_df := current run / execution changes dataframe.
def deduplicate :
apply windowing function and group by
def partitionDataframe(delta_df) :
get unique keys of that data frame and then return an array of data frames
each containing just that very same