Why don't you simply copy whole of delta data (Table A) into a stage table (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check on primary key/composite key which already exists in the table B?
That's faster and does the reconciliation job smoothly enough. Others, any better input? On Wed 29 May, 2019, 10:50 PM Tomasz Krol, <patric...@gmail.com> wrote: > Hey Guys, > > I am wondering what would be your approach to following scenario: > > I have two tables - one (Table A) is relatively small (e.g 50GB) and > second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. > > I want to ADD all records from Table A to Table B which dont exist in > Table B yet. I use only one field (e.g. key) to check existence for > specific record. > > Then I want to UPDATE (by values from Table A) all records in Table B > which also exist in Table A. To determine if specific record exist I use > also the same "key" field. > > To achieve above I run following sql queries: > > 1. Find existing records and insert into temp table > > insert into temp_table select a.cols from Table A a left semi join Table B > b on a.key = b.key > > 2. Find new records and insert them into temp table > > insert into temp_table select a.cols from Table A a left anti join Table B > b on a.key = b.key > > 3. Find existing records in Table B which dont exist in Table A > > insert into temp_table select b.cols from Table B b left anti join Table A > a a.key = b. key > > In that way I built Table B updated with records from Table A. > However, the problem here is the step 3, because I am inserting almost 3 > TB of data that takes obviously some time. > I was trying different approaches but no luck. > > I am wondering whats your ideas how can we perform this scenario > efficiently in Spark? > > Cheers > > Tom > -- > Tomasz Krol > patric...@gmail.com >