Re: Upsert for hive tables

2019-06-04 Thread tkrol
Hi Magnus, Yes, I was thinking also about partitioning approach. And I think this is the best solution in this type of scenario. Also my scenario is relevant to your last paragraph, the dates which are coming are very random. I can get updated from 2012 and from 2019. Therefore, this strategy

Re: Upsert for hive tables

2019-05-30 Thread Magnus Nilsson
Since parquet don't support updates you have to backfill your dataset. If that is your regular scenario you should partition your parquet files so backfilling becomes easier. As the data is structured now you have to update everything just to upsert quite a small amount of changed data. Look at

Re: Upsert for hive tables

2019-05-30 Thread Tomasz Krol
Unfortunately, dont have timestamps in those tables:( Only key on which I can check existence of specific record. But even with the timestamp how would you make the update.? When I say update I mean to overwrite existing record. For example you have following in table A key| field1 | field2 1

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Don't you have a date/timestamp to handle updates? So, you're talking about CDC? If you've Datestamp you can check if that/those key(s) exists, if exists then check if timestamp matches, if that matches, then ignore, if that doesn't then update. On Thu 30 May, 2019, 7:11 AM Genieliu, wrote: >

Re: Upsert for hive tables

2019-05-29 Thread Genieliu
Isn't step1 and step2 producing the copy of Table A? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Upsert for hive tables

2019-05-29 Thread Aakash Basu
Why don't you simply copy whole of delta data (Table A) into a stage table (temp table in your case) and insert depending on a *WHERE NOT EXISTS* check on primary key/composite key which already exists in the table B? That's faster and does the reconciliation job smoothly enough. Others, any

Re: Upsert for hive tables

2019-05-29 Thread Tomasz Krol
Hey Aakash, That will work for records which dont exist yet in the target table. What about records which have to be updated ? As I mentioned, I want to do an upsert. That means, I want to add not existing records and update those which already exist. Thanks Tom On Wed 29 May 2019 at 18:39,

Upsert for hive tables

2019-05-29 Thread Tomasz Krol
Hey Guys, I am wondering what would be your approach to following scenario: I have two tables - one (Table A) is relatively small (e.g 50GB) and second one (Table B) much bigger (e.g. 3TB). Both are parquet tables. I want to ADD all records from Table A to Table B which dont exist in Table B