Hey Aakash, That will work for records which dont exist yet in the target table. What about records which have to be updated ?
As I mentioned, I want to do an upsert. That means, I want to add not existing records and update those which already exist. Thanks Tom On Wed 29 May 2019 at 18:39, Aakash Basu <aakash.spark....@gmail.com> wrote: > Why don't you simply copy whole of delta data (Table A) into a stage table > (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check > on primary key/composite key which already exists in the table B? > > That's faster and does the reconciliation job smoothly enough. > > Others, any better input? > > On Wed 29 May, 2019, 10:50 PM Tomasz Krol, <patric...@gmail.com> wrote: > >> Hey Guys, >> >> I am wondering what would be your approach to following scenario: >> >> I have two tables - one (Table A) is relatively small (e.g 50GB) and >> second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. >> >> I want to ADD all records from Table A to Table B which dont exist in >> Table B yet. I use only one field (e.g. key) to check existence for >> specific record. >> >> Then I want to UPDATE (by values from Table A) all records in Table B >> which also exist in Table A. To determine if specific record exist I use >> also the same "key" field. >> >> To achieve above I run following sql queries: >> >> 1. Find existing records and insert into temp table >> >> insert into temp_table select a.cols from Table A a left semi join Table >> B b on a.key = b.key >> >> 2. Find new records and insert them into temp table >> >> insert into temp_table select a.cols from Table A a left anti join Table >> B b on a.key = b.key >> >> 3. Find existing records in Table B which dont exist in Table A >> >> insert into temp_table select b.cols from Table B b left anti join Table >> A a a.key = b. key >> >> In that way I built Table B updated with records from Table A. >> However, the problem here is the step 3, because I am inserting almost 3 >> TB of data that takes obviously some time. >> I was trying different approaches but no luck. >> >> I am wondering whats your ideas how can we perform this scenario >> efficiently in Spark? >> >> Cheers >> >> Tom >> -- >> Tomasz Krol >> patric...@gmail.com >> > -- Tomasz Krol patric...@gmail.com