Why don't you simply copy whole of delta data (Table A) into a stage table
(temp table in your case) and insert depending on a *WHERE NOT EXISTS* check
on primary key/composite key which already exists in the table B?

That's faster and does the reconciliation job smoothly enough.

Others, any better input?

On Wed 29 May, 2019, 10:50 PM Tomasz Krol, <patric...@gmail.com> wrote:

> Hey Guys,
>
> I am wondering what would be your approach to following scenario:
>
> I have two tables - one (Table A) is relatively small (e.g 50GB) and
> second one (Table B) much bigger (e.g. 3TB). Both are parquet tables.
>
>  I want to ADD all records from Table A to Table B which dont exist in
> Table B yet. I use only one field (e.g. key) to check existence for
> specific record.
>
> Then I want to UPDATE (by values from Table A) all records in Table B
> which also exist in Table A. To determine if specific record exist I use
> also the same "key" field.
>
> To achieve above I run following sql queries:
>
> 1. Find existing records and insert into temp table
>
> insert into temp_table select a.cols from Table A a left semi join Table B
> b on a.key = b.key
>
> 2. Find new records and insert them into temp table
>
> insert into temp_table select a.cols from Table A a left anti join Table B
> b on a.key = b.key
>
> 3. Find existing records in Table B which dont exist in   Table A
>
> insert into temp_table select b.cols from Table B b left anti join Table A
> a a.key = b. key
>
> In that way I built Table B updated with records from Table A.
> However, the problem here is the step 3, because I am inserting almost 3
> TB of data that takes obviously some time.
> I was trying different approaches but no luck.
>
> I am wondering whats your ideas how can we perform this scenario
> efficiently in Spark?
>
> Cheers
>
> Tom
> --
> Tomasz Krol
> patric...@gmail.com
>

Reply via email to