Re: [Discussion] Update feature enhancement

2020-11-04 Thread David CaiQiang
PR#3999 already implemented this enhancement, please know.

PR URL: https://github.com/apache/carbondata/pull/3999



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-14 Thread Ravindra Pesala
+1
Already partition loading uses the new segment to write the update delta
data.

It is better to make consistent across all. Creating new segment simplifies
the design.



On Mon, 14 Sep 2020 at 1:48 AM, Venkata Gollamudi 
wrote:

> Hi David,
>
> +1
>
>
>
> Initially when segments concept is started, it is viewed as a folder which
>
> is incrementally added with time, so that data retention use-cases like
>
> "delete segments before a given date" were thought of. In that case if
>
> updated records are written into new segment, then old records will become
>
> new records and retention model will not work on that data. So update
>
> records were written to the same segment folder.
>
>
>
> But later as the partition concept was introduced, that will be a clean
>
> method to implement retention or even using a delete by time column is a
>
> better method.
>
> So inserting new records into the new segment makes sense.
>
>
>
> Only disadvantage can be later supporting one column data update/replace
>
> feature which Likun was mentioning previously.
>
>
>
> So to generalize, update feature can support inserting the updated records
>
> to new segment. The logic to reload indexes when segments are updated can
>
> still be there, however when there is no insert of data to old segments,
>
> reload of indexes needs to be avoided.
>
>
>
> Increasing the number of segments need not be a reason for this to go
>
> ahead, as the problem of increasing segments anyway is a problem and needs
>
> to be solved using compaction either horizontal or vertical. Also
>
> optimization of segment file storage either filebased or DB based(embedded
>
> or external) for too big deployments needs to be solved independently.
>
>
>
> Regards,
>
> Ramana
>
>
>
> On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat  wrote:
>
>
>
> > Hi David. Thanks for proposing this.
>
> >
>
> > *+1 from my side.*
>
> >
>
> > I have seen users with 200K segments table stored in cloud.
>
> > It will be really slow to reload all the segments where update happened
> for
>
> > indexes like SI, min-max, MV.
>
> >
>
> > So, it is good to write as a new segment
>
> > and just load new segment indexes. (try to reuse this flow
>
> > UpdateTableModel.loadAsNewSegment
>
> > = true)
>
> >
>
> > and user can compact the segments to avoid many new segments created by
>
> > update.
>
> > and we can also move the compacted segments to table status history I
> guess
>
> > to avoid more entries in table status.
>
> >
>
> > Thanks,
>
> > Ajantha
>
> >
>
> >
>
> >
>
> > On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang 
>
> > wrote:
>
> >
>
> > > Hi Akash,
>
> > >
>
> > > 3. Update operation contain a insert operation.  Update operation
>
> > will
>
> > > do the same thing how the insert operation process this issue.
>
> > >
>
> > >
>
> > >
>
> > > -
>
> > > Best Regards
>
> > > David Cai
>
> > > --
>
> > > Sent from:
>
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> > >
>
> >
>
> --
Thanks & Regards,
Ravi


Re: [Discussion] Update feature enhancement

2020-09-13 Thread Venkata Gollamudi
Hi David,
+1

Initially when segments concept is started, it is viewed as a folder which
is incrementally added with time, so that data retention use-cases like
"delete segments before a given date" were thought of. In that case if
updated records are written into new segment, then old records will become
new records and retention model will not work on that data. So update
records were written to the same segment folder.

But later as the partition concept was introduced, that will be a clean
method to implement retention or even using a delete by time column is a
better method.
So inserting new records into the new segment makes sense.

Only disadvantage can be later supporting one column data update/replace
feature which Likun was mentioning previously.

So to generalize, update feature can support inserting the updated records
to new segment. The logic to reload indexes when segments are updated can
still be there, however when there is no insert of data to old segments,
reload of indexes needs to be avoided.

Increasing the number of segments need not be a reason for this to go
ahead, as the problem of increasing segments anyway is a problem and needs
to be solved using compaction either horizontal or vertical. Also
optimization of segment file storage either filebased or DB based(embedded
or external) for too big deployments needs to be solved independently.

Regards,
Ramana

On Sat, Sep 5, 2020 at 7:58 AM Ajantha Bhat  wrote:

> Hi David. Thanks for proposing this.
>
> *+1 from my side.*
>
> I have seen users with 200K segments table stored in cloud.
> It will be really slow to reload all the segments where update happened for
> indexes like SI, min-max, MV.
>
> So, it is good to write as a new segment
> and just load new segment indexes. (try to reuse this flow
> UpdateTableModel.loadAsNewSegment
> = true)
>
> and user can compact the segments to avoid many new segments created by
> update.
> and we can also move the compacted segments to table status history I guess
> to avoid more entries in table status.
>
> Thanks,
> Ajantha
>
>
>
> On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang 
> wrote:
>
> > Hi Akash,
> >
> > 3. Update operation contain a insert operation.  Update operation
> will
> > do the same thing how the insert operation process this issue.
> >
> >
> >
> > -
> > Best Regards
> > David Cai
> > --
> > Sent from:
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
>


Re: [Discussion] Update feature enhancement

2020-09-04 Thread Ajantha Bhat
Hi David. Thanks for proposing this.

*+1 from my side.*

I have seen users with 200K segments table stored in cloud.
It will be really slow to reload all the segments where update happened for
indexes like SI, min-max, MV.

So, it is good to write as a new segment
and just load new segment indexes. (try to reuse this flow
UpdateTableModel.loadAsNewSegment
= true)

and user can compact the segments to avoid many new segments created by
update.
and we can also move the compacted segments to table status history I guess
to avoid more entries in table status.

Thanks,
Ajantha



On Fri, Sep 4, 2020 at 1:48 PM David CaiQiang  wrote:

> Hi Akash,
>
> 3. Update operation contain a insert operation.  Update operation will
> do the same thing how the insert operation process this issue.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: [Discussion] Update feature enhancement

2020-09-04 Thread David CaiQiang
Hi Akash,

3. Update operation contain a insert operation.  Update operation will
do the same thing how the insert operation process this issue.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-04 Thread akashrn5
Hi David,

1. Yeah i already told that it will come in to picture in delete case, as
update is (delete + insert). 
2. yes, we will be loading the single merge file into cache, which can be
little bit better compared to existing one.
3. I didnt get the complete ans actually, when exactly you plan to compact
those and how to take care the increasing entries in the table status file

thanks 



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-04 Thread David CaiQiang
Hi Akash,

1. the update operation still has "deletdelta" files, it keeps the same with
previous. horizontal compaction is still needed.

2. loading one carbonindexmerge file will fast, and not impact the query
performance. (customer has faced this issue)

3. for insert/loading, it can trigger compaction to avoid small segments.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Update feature enhancement

2020-09-04 Thread akashrn5
Hi david, 

Please check below points

One advantage what we get here is , when we insert as new segment, it will
take the new insert flow without converter step and that will be faster.

But here are some points.

1. when you write for new segments for each update, the horizontal
compaction in case of update does not make sense, as it wont happen with
this idea. With this solution, horizontal compaction makes sense only in
delete case.
2. you said we avoid reloading the indexes, but we will avoid reloading the
indexes of complete segment(original segment on which update has happened),
but we still need to reload the index of newly added segment which has
updated right.
3. when you keep on adding multiple segments, we will have more number of
segments and if we does not do compaction, that's one problem and the entry
and size of metadata(table status) increases so much which is another
problem.

So how are you going to handle these cases?
correct me if i'm wrong in my understanding.

Regards,
Akash



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


[Discussion] Update feature enhancement

2020-09-02 Thread David CaiQiang
[Background]
Now update feature insert the updated rows into the old segments where the
data are updated. 
In the end, it needs to reload the indexes of related segments.

[Movitation]
If there are many updated segments, it will take a long time to reload the
indexes again.
So I suggest writing the updated rows into a new segment.
It will not impact the indexes of old segments and doesn't need to reload
indexes.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/