Hey Aakash,

That will work for records which dont exist yet in the target table. What
about records which have to be updated ?

As I mentioned, I want to do an upsert. That means, I want to add not
existing records and update those which already exist.

Thanks

Tom

On Wed 29 May 2019 at 18:39, Aakash Basu <aakash.spark....@gmail.com> wrote:

> Why don't you simply copy whole of delta data (Table A) into a stage table
> (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check
> on primary key/composite key which already exists in the table B?
>
> That's faster and does the reconciliation job smoothly enough.
>
> Others, any better input?
>
> On Wed 29 May, 2019, 10:50 PM Tomasz Krol, <patric...@gmail.com> wrote:
>
>> Hey Guys,
>>
>> I am wondering what would be your approach to following scenario:
>>
>> I have two tables - one (Table A) is relatively small (e.g 50GB) and
>> second one (Table B) much bigger (e.g. 3TB). Both are parquet tables.
>>
>>  I want to ADD all records from Table A to Table B which dont exist in
>> Table B yet. I use only one field (e.g. key) to check existence for
>> specific record.
>>
>> Then I want to UPDATE (by values from Table A) all records in Table B
>> which also exist in Table A. To determine if specific record exist I use
>> also the same "key" field.
>>
>> To achieve above I run following sql queries:
>>
>> 1. Find existing records and insert into temp table
>>
>> insert into temp_table select a.cols from Table A a left semi join Table
>> B b on a.key = b.key
>>
>> 2. Find new records and insert them into temp table
>>
>> insert into temp_table select a.cols from Table A a left anti join Table
>> B b on a.key = b.key
>>
>> 3. Find existing records in Table B which dont exist in   Table A
>>
>> insert into temp_table select b.cols from Table B b left anti join Table
>> A a a.key = b. key
>>
>> In that way I built Table B updated with records from Table A.
>> However, the problem here is the step 3, because I am inserting almost 3
>> TB of data that takes obviously some time.
>> I was trying different approaches but no luck.
>>
>> I am wondering whats your ideas how can we perform this scenario
>> efficiently in Spark?
>>
>> Cheers
>>
>> Tom
>> --
>> Tomasz Krol
>> patric...@gmail.com
>>
> --
Tomasz Krol
patric...@gmail.com

Reply via email to