Re: [Discussion] Optimize the Update Performance

2020-05-14 Thread Akash r
Hi march, Thanks for suggesting improvemnt on update. I have gone through the paper for some highlights and here are few points with my understanding, we can work and discuss more. 1. Since we are talking about the updating the existing file instead of new carbon data file which is the current lo

Re: [Discussion] Optimize the Update Performance

2020-05-13 Thread Ajantha Bhat
Hi !, Update is still using converter step with bad record handing. If it is update by dataframe scenario no need of bad record handling, only for update by value case we can keep it. This can give significant improvement as we already observed in insert flow. I tried once to send it to new inse

Re: [Discussion] Optimize the Update Performance

2020-05-13 Thread haomarch
I have serveral ideas to optimize the update performance: 1. Reduce the storage size of tupleId: The tupleId is too long leading heavily shuffle IO overhead while join change table with target table. 2. Avoid to convert String to UTF8String in the row processing. Before write rows into delta

[Discussion] Optimize the Update Performance

2020-05-12 Thread haomarch
There is an interesting paper "L-Store: A Real-time OLTP and OLAP System", which uses an creative way to improve update performance. The Idea is: *1. Store the updated column value in the tail page*. When update any column of a record, a new tail page is created and appended to the page dictionar