Re: [Discussion] Update feature enhancement

2020-09-13 Thread Venkata Gollamudi
Hi David,
+1

Initially when segments concept is started, it is viewed as a folder which
is incrementally added with time, so that data retention use-cases like
"delete segments before a given date" were thought of. In that case if
updated records are written into new segment, then old records will become
new records and retention model will not work on that data. So update
records were written to the same segment folder.

But later as the partition concept was introduced, that will be a clean
method to implement retention or even using a delete by time column is a
better method.
So inserting new records into the new segment makes sense.

Only disadvantage can be later supporting one column data update/replace
feature which Likun was mentioning previously.

So to generalize, update feature can support inserting the updated records
to new segment. The logic to reload indexes when segments are updated can
still be there, however when there is no insert of data to old segments,
reload of indexes needs to be avoided.

Increasing the number of segments need not be a reason for this to go
ahead, as the problem of increasing segments anyway is a problem and needs
to be solved using compaction either horizontal or vertical. Also
optimization of segment file storage either filebased or DB based(embedded
or external) for too big deployments needs to be solved independently.

Regards,
Ramana

On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat  wrote:

> Hi David. Thanks for proposing this.
>
> *+1 from my side.*
>
> I have seen users with 200K segments table stored in cloud.
> It will be really slow to reload all the segments where update happened for
> indexes like SI, min-max, MV.
>
> So, it is good to write as a new segment
> and just load new segment indexes. (try to reuse this flow
> UpdateTableModel.loadAsNewSegment
> = true)
>
> and user can compact the segments to avoid many new segments created by
> update.
> and we can also move the compacted segments to table status history I guess
> to avoid more entries in table status.
>
> Thanks,
> Ajantha
>
>
>
> On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang 
> wrote:
>
> > Hi Akash,
> >
> > 3. Update operation contain a insert operation.  Update operation
> will
> > do the same thing how the insert operation process this issue.
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion] Segment management enhance

2020-09-13 Thread Venkata Gollamudi
Hi David,

In the current design of data load operation, isolation, consistency is
achieved with following steps
1. DataLoad operation needs to check, if it can concurrently execute along
with any other operation like Update/delete( most. operations can be
allowed, even multiple parallel data loads).
2. Once operation is allowed, then lock is acquired at tablestatus file to
create a segment which it continues to load.
3. During load operation, time stamp(long) is used as transaction id for
the complete operation. This transaction id uniquely identifies the
operation.
4. When data load is complete the operation is committed to table status by
taking lock, read & update and release lock.
5. When data load fails then the operation is not committed and data can be
cleaned up during failure flow or in case of abrupt process failures, same
is cleaned up later.
6. Temporary data with the transaction id is never read or should be
discovered by any reader(Ex: never should be discovered using list files
etc and used) . All valid data should be read using only committed
transaction references.

This method of isolation and committing operation is applicable for all the
operations like dataload, insert, update or delete.
Segment ID does not have much significance in the above flow, though
sequence number is used currently, just for convenience.
Segment file locks to ensure the atomic commit is must and cannot be
avoided, even if we support complete optimistic concurrency control.

So segment ID replacing with UUID will not solve the concurrency or data
correctness or cleanup or stale data reading issues and also cannot replace
locking. There might be some other problem which you need to deep dive.
There might be code which is not following the above actions mentioned by
me (like discovering files using file listing and without filtering
required transaction id), which might be causing issues you have mentioned.

Regards,
Ramana

On Sat, Sep 5, 2020 at 8:31 AM Ajantha Bhat  wrote:

> Hi David,
>
> a) Recently we tested huge concurrent load and compactions but never faced
> two loads using same segment id issue (because of table status lock in
> recordNewLoadMetadata), so I am not sure whether we really need to update
> to UUID.
>
> b) And about other segment interfaces, we have to refactor it. It is long
> pending. Refactor such that we can support TIME TRAVEL. I have to analyze
> more on this. If somebody has already done some analysis can use thread to
> refactor the segment interface discussion.
>
> Thanks,
> Ajantha
>
> On Fri, Sep 4, 2020 at 1:11 PM Kunal Kapoor 
> wrote:
>
> > Hi David,
> > Then better we keep a mapping for the segment UUID to virtual segment
> > number in the table status file as well,
> > Any API through which the user can get the segment details should return
> > the virtual segment id instead of the UUID.
> >
> > On Fri, Sep 4, 2020 at 12:59 PM David CaiQiang 
> > wrote:
> >
> > > Hi Kunal,
> > >
> > >1. The user uses SQL API or other interfaces. This UUID is a
> > transaction
> > > id, and we already stored the timestamp and other informations in the
> > > segment metadata.
> > >This transaction id can be used in the loading/compaction/update
> > > operation. We can append this id into the log if needed.
> > >Git commit id also uses UUID, so we can consider to use it. What
> > > information do you want to get from the folder name?
> > >
> > >2. It is easy to fix the show segment command's issue. Maybe we can
> > sort
> > > segment by timestamp and UUID to generate the index id.  The user can
> > > continue to use it in other commands.
> > >
> > >
> > >
> > > -
> > > Best Regards
> > > David Cai
> > > --
> > > Sent from:
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> >
>