Re: [Discussion] Optimize the Update Performance

Akash r Thu, 14 May 2020 12:33:29 -0700

Hi march,

Thanks for suggesting improvemnt on update.
I have gone through the paper for some highlights and here are few points
with my understanding, we can work and discuss more.

1. Since we are talking about the updating the existing file instead of new
carbon data file which
is the current logic, how are we going to focus on the concurrent query?
are we going to say update in progress for specified splits or what exactly
is the thought behind this scenario?
can you please explain

2. When you say tail page, you said it will be appended to base column
page. So i have below points
a. in tail page we will store the rowId of the base page.
        b. may be we may need to update the base column page, may be in its
metadata about the presence of tail page and updated/latest record
in tail page.
        c. writing the new value in tail page is basically writing a new
column page as tail page with new data, row id etc and maintain the base
page info
either in tail page or tail page info in base page as said in prev point.
        d. Doing this will lead to rewriting almost al metadata at page,
blocklet and block level. Like generating local dictionary for new values
as local dict will
be true since we store local dict at blocklet level, local dict should be
updated in blocklet. and then the min max metadata etc.

Do you already have points to explain these, or any doc?

3. if the operation is just delete operation, do  we need to write tail
page or just making the row id of base page as invisible?
if we write the tail page, we should store some default value like we store
for null value to indicate delete data.

4. I think for us by default it will be cumulative update right? because we
just write the tail page for the columns to be updated with base rowIDs.
for next consecutive update, need to add new record to tail page with
updated info for base page.

5. What about the scenario where writing this tail page will add more data
the existing block and it crosses carbon block size and we will create new
block, basically all sizes will increase
i havent got the clear picture about it, can you please explain in more
detail level?
i find bit of a dilemmas here with respect to carbondata.

I have one more suggestion, how about instead appending tail page to
existing column page and write it as separate page outside block file.

I think many doubts will be clear if we have a low level design for it and
we do a POC. And is it really gona increase the update speed or we are
targetting just the scan speed?

Correct me if im wrong in my understanding about any of the above points.

Thanks

Regards,
Akash

On Wed, May 13, 2020 at 7:31 AM haomarch <marchp...@126.com> wrote:

> There is an interesting paper "L-Store: A Real-time OLTP and OLAP System",
> which uses an creative way to improve update performance.
>
> The Idea is:
> *1. Store the updated column value in the tail page*.
> When update any column of a record, a new tail page is created and appended
> to the page dictionary.
> In the tail page, only the updated column value is stored, comparing with
> the current implement of carbondata in which we write the whole row even
> only a few columns are updated, L-Store's way can avoid write amplification
> effectively.
> In the tail page, the rowid and updatedcolumnid are also stored together
> with the updated columnvalue,
> based on the updatedcolumnid, the row data can be achievd by read the base
> page and tail pages during query processing.
> *2. Increment update in the tail page.*
> Assume that we update 2 columns，1 column per update. There are two ways to
> store update columns in the tail page:
>
>  2.1: Non-incremental Update:
> /        basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
>         tailpage1 <updatecolumn1, v1'>
>         tailpage2 <updatecolumn2, v2'>/
>
>  2.2: Incremental Update:
> /        basepage <updatecolumn1, v1> <updatecolumn2, v2> <updatecolumn3,
> v3>
>         tailpage1 <updatecolumn1, v1'>
>         tailpage2 <updatecolumn1, v1'> <updatecolumn2, v2'>/
>
> Non-incremental Update only stores the updated column value for this
> update,
> which has lower write amplification but worse query performance.
> incremental Update stores the update column value for this updated together
> the updated column values of previous updates, which has higher write
> amplification but better query performance.
>
> We shall study the work of L-Store, and optimize the update performance, it
> will carbondata's competitiveness
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Discussion] Optimize the Update Performance

Reply via email to