Re: Clean files enhancement

2021-03-01 Thread Vikram Ahuja
Hi all,
PFA the document that lists all the changes done as a part of Clean Files
Enhancement. The changes were done in 2 phases.

Regards
Vikram


Re: Clean files enhancement

2020-10-13 Thread Vikram Ahuja
Hi all,
PFA the design document.
Please provide suggestions or feedback

Vikram Ahuja

On Mon, Sep 28, 2020 at 12:23 PM Vikram Ahuja 
wrote:

> Thanks for the suggestion Ravi.
>
> We can include a property in the clean files command which can decide if
> we want to dry run.
> clean files on table t1 options('dry_run' = true) --> This will only show
> the segments which will be removed and will not clean/delete those segments
> or any data for that matter.
>
> By default, the dry_run will be set as false and the user can configure it
> when they want to use it.
>
> Rgds,
> Vikram
>
> On Mon, Sep 28, 2020 at 11:57 AM Akash r  wrote:
>
>> +1 for ravi's comment. It's better, clean and safe.
>>
>> Regards,
>> Akash R Nilugal
>>
>> On Thu, Sep 24, 2020, 8:34 PM Ravindra Pesala 
>> wrote:
>>
>> > Hi Vikram,
>> >
>> > +1
>> >
>> > It is good to remove the automatic cleanup.
>> > But I am still worried about the clean file command executed by user as
>> > well.  We need to enhance the clean file command to introduce dry run to
>> > print what segments it is going to be deleted and what is left. If user
>> ok
>> > with dry run result then he can go for actual run.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja > >
>> > wrote:
>> >
>> > > Hi Ravi and David,
>> > >
>> > >
>> > >
>> > > 1. All the automatic clean data in the case of
>> load/insert/compact/delete
>> > >
>> > > will be removed, so cleaning will only happen when the clean files
>> > command
>> > >
>> > > is called.
>> > >
>> > >
>> > >
>> > > 2. We will only add the data to trash when we try to clean data which
>> is
>> > in
>> > >
>> > > IN PROGRESS state. In case of COmpacted/Marked For Delete it will not
>> be
>> > >
>> > > moved to the trash, it will be directly deleted. The user will only be
>> > able
>> > >
>> > > to recover the In Progress segments if the user wants. @Ravi -> Is
>> this
>> > >
>> > > okay for trash usage? Only using it for in progress segments.
>> > >
>> > >
>> > >
>> > > 3. No trash management will be implemented, the data will ONLY BE
>> REMOVED
>> > >
>> > > from the trash folder immediately when the clean files command is
>> called.
>> > >
>> > > There will be no time to live, the data can be kept in the trash
>> folder
>> > >
>> > > untill the user triggers clean files command.
>> > >
>> > >
>> > >
>> > > Let me know if you have any questions.
>> > >
>> > >
>> > >
>> > > Vikram Ahuja
>> > >
>> > >
>> > >
>> > > On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
>> > > wrote:
>> > >
>> > >
>> > >
>> > > > agree with Ravindra,
>> > >
>> > > >
>> > >
>> > > > 1. stop all automatic clean data in
>> > load/insert/compact/update/delete...
>> > >
>> > > >
>> > >
>> > > > 2. when clean files command clean in-progress or uncertain data, we
>> can
>> > >
>> > > > move
>> > >
>> > > > them to data trash.
>> > >
>> > > > it can prevent delete useful data by mistake, we already find
>> this
>> > >
>> > > > issue
>> > >
>> > > > in some scenes.
>> > >
>> > > > other cases(for example clean mark_for_delete/compacted segment)
>> > > should
>> > >
>> > > > not use the data trash folder, clean data directly.
>> > >
>> > > >
>> > >
>> > > > 3. no need data trash management, suggest keeping it simple.
>> > >
>> > > > The clean file command should support empty trash immediately,
>> it
>> > > will
>> > >
>> > > > be enough.
>> > >
>> > > >
>> > >
>> > > >
>> > >
>> > > >
>> > >
>> > > > -
>> > >
>> > > > Best Regards
>> > >
>> > > > David Cai
>> > >
>> > > > --
>> > >
>> > > > Sent from:
>> > >
>> > > >
>> >
>> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>> > >
>> > > >
>> > >
>> > >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>> >
>>
>


Re: Clean files enhancement

2020-09-28 Thread Vikram Ahuja
Thanks for the suggestion Ravi.

We can include a property in the clean files command which can decide if we
want to dry run.
clean files on table t1 options('dry_run' = true) --> This will only show
the segments which will be removed and will not clean/delete those segments
or any data for that matter.

By default, the dry_run will be set as false and the user can configure it
when they want to use it.

Rgds,
Vikram

On Mon, Sep 28, 2020 at 11:57 AM Akash r  wrote:

> +1 for ravi's comment. It's better, clean and safe.
>
> Regards,
> Akash R Nilugal
>
> On Thu, Sep 24, 2020, 8:34 PM Ravindra Pesala 
> wrote:
>
> > Hi Vikram,
> >
> > +1
> >
> > It is good to remove the automatic cleanup.
> > But I am still worried about the clean file command executed by user as
> > well.  We need to enhance the clean file command to introduce dry run to
> > print what segments it is going to be deleted and what is left. If user
> ok
> > with dry run result then he can go for actual run.
> >
> > Regards,
> > Ravindra.
> >
> > On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja 
> > wrote:
> >
> > > Hi Ravi and David,
> > >
> > >
> > >
> > > 1. All the automatic clean data in the case of
> load/insert/compact/delete
> > >
> > > will be removed, so cleaning will only happen when the clean files
> > command
> > >
> > > is called.
> > >
> > >
> > >
> > > 2. We will only add the data to trash when we try to clean data which
> is
> > in
> > >
> > > IN PROGRESS state. In case of COmpacted/Marked For Delete it will not
> be
> > >
> > > moved to the trash, it will be directly deleted. The user will only be
> > able
> > >
> > > to recover the In Progress segments if the user wants. @Ravi -> Is this
> > >
> > > okay for trash usage? Only using it for in progress segments.
> > >
> > >
> > >
> > > 3. No trash management will be implemented, the data will ONLY BE
> REMOVED
> > >
> > > from the trash folder immediately when the clean files command is
> called.
> > >
> > > There will be no time to live, the data can be kept in the trash folder
> > >
> > > untill the user triggers clean files command.
> > >
> > >
> > >
> > > Let me know if you have any questions.
> > >
> > >
> > >
> > > Vikram Ahuja
> > >
> > >
> > >
> > > On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
> > > wrote:
> > >
> > >
> > >
> > > > agree with Ravindra,
> > >
> > > >
> > >
> > > > 1. stop all automatic clean data in
> > load/insert/compact/update/delete...
> > >
> > > >
> > >
> > > > 2. when clean files command clean in-progress or uncertain data, we
> can
> > >
> > > > move
> > >
> > > > them to data trash.
> > >
> > > > it can prevent delete useful data by mistake, we already find
> this
> > >
> > > > issue
> > >
> > > > in some scenes.
> > >
> > > > other cases(for example clean mark_for_delete/compacted segment)
> > > should
> > >
> > > > not use the data trash folder, clean data directly.
> > >
> > > >
> > >
> > > > 3. no need data trash management, suggest keeping it simple.
> > >
> > > > The clean file command should support empty trash immediately, it
> > > will
> > >
> > > > be enough.
> > >
> > > >
> > >
> > > >
> > >
> > > >
> > >
> > > > -
> > >
> > > > Best Regards
> > >
> > > > David Cai
> > >
> > > > --
> > >
> > > > Sent from:
> > >
> > > >
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > >
> > > >
> > >
> > >
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>


Re: Clean files enhancement

2020-09-28 Thread Akash r
+1 for ravi's comment. It's better, clean and safe.

Regards,
Akash R Nilugal

On Thu, Sep 24, 2020, 8:34 PM Ravindra Pesala  wrote:

> Hi Vikram,
>
> +1
>
> It is good to remove the automatic cleanup.
> But I am still worried about the clean file command executed by user as
> well.  We need to enhance the clean file command to introduce dry run to
> print what segments it is going to be deleted and what is left. If user ok
> with dry run result then he can go for actual run.
>
> Regards,
> Ravindra.
>
> On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja 
> wrote:
>
> > Hi Ravi and David,
> >
> >
> >
> > 1. All the automatic clean data in the case of load/insert/compact/delete
> >
> > will be removed, so cleaning will only happen when the clean files
> command
> >
> > is called.
> >
> >
> >
> > 2. We will only add the data to trash when we try to clean data which is
> in
> >
> > IN PROGRESS state. In case of COmpacted/Marked For Delete it will not be
> >
> > moved to the trash, it will be directly deleted. The user will only be
> able
> >
> > to recover the In Progress segments if the user wants. @Ravi -> Is this
> >
> > okay for trash usage? Only using it for in progress segments.
> >
> >
> >
> > 3. No trash management will be implemented, the data will ONLY BE REMOVED
> >
> > from the trash folder immediately when the clean files command is called.
> >
> > There will be no time to live, the data can be kept in the trash folder
> >
> > untill the user triggers clean files command.
> >
> >
> >
> > Let me know if you have any questions.
> >
> >
> >
> > Vikram Ahuja
> >
> >
> >
> > On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
> > wrote:
> >
> >
> >
> > > agree with Ravindra,
> >
> > >
> >
> > > 1. stop all automatic clean data in
> load/insert/compact/update/delete...
> >
> > >
> >
> > > 2. when clean files command clean in-progress or uncertain data, we can
> >
> > > move
> >
> > > them to data trash.
> >
> > > it can prevent delete useful data by mistake, we already find this
> >
> > > issue
> >
> > > in some scenes.
> >
> > > other cases(for example clean mark_for_delete/compacted segment)
> > should
> >
> > > not use the data trash folder, clean data directly.
> >
> > >
> >
> > > 3. no need data trash management, suggest keeping it simple.
> >
> > > The clean file command should support empty trash immediately, it
> > will
> >
> > > be enough.
> >
> > >
> >
> > >
> >
> > >
> >
> > > -
> >
> > > Best Regards
> >
> > > David Cai
> >
> > > --
> >
> > > Sent from:
> >
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
> > >
> >
> >
>
> --
> Thanks & Regards,
> Ravi
>


Re: Clean files enhancement

2020-09-28 Thread Kunal Kapoor
+1 for ravi's comment
Better to show what would be deleted/moved to trash.


Regards,
Kunal Kapoor

On Thu, Sep 24, 2020 at 8:34 PM Ravindra Pesala 
wrote:

> Hi Vikram,
>
> +1
>
> It is good to remove the automatic cleanup.
> But I am still worried about the clean file command executed by user as
> well.  We need to enhance the clean file command to introduce dry run to
> print what segments it is going to be deleted and what is left. If user ok
> with dry run result then he can go for actual run.
>
> Regards,
> Ravindra.
>
> On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja 
> wrote:
>
> > Hi Ravi and David,
> >
> >
> >
> > 1. All the automatic clean data in the case of load/insert/compact/delete
> >
> > will be removed, so cleaning will only happen when the clean files
> command
> >
> > is called.
> >
> >
> >
> > 2. We will only add the data to trash when we try to clean data which is
> in
> >
> > IN PROGRESS state. In case of COmpacted/Marked For Delete it will not be
> >
> > moved to the trash, it will be directly deleted. The user will only be
> able
> >
> > to recover the In Progress segments if the user wants. @Ravi -> Is this
> >
> > okay for trash usage? Only using it for in progress segments.
> >
> >
> >
> > 3. No trash management will be implemented, the data will ONLY BE REMOVED
> >
> > from the trash folder immediately when the clean files command is called.
> >
> > There will be no time to live, the data can be kept in the trash folder
> >
> > untill the user triggers clean files command.
> >
> >
> >
> > Let me know if you have any questions.
> >
> >
> >
> > Vikram Ahuja
> >
> >
> >
> > On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
> > wrote:
> >
> >
> >
> > > agree with Ravindra,
> >
> > >
> >
> > > 1. stop all automatic clean data in
> load/insert/compact/update/delete...
> >
> > >
> >
> > > 2. when clean files command clean in-progress or uncertain data, we can
> >
> > > move
> >
> > > them to data trash.
> >
> > > it can prevent delete useful data by mistake, we already find this
> >
> > > issue
> >
> > > in some scenes.
> >
> > > other cases(for example clean mark_for_delete/compacted segment)
> > should
> >
> > > not use the data trash folder, clean data directly.
> >
> > >
> >
> > > 3. no need data trash management, suggest keeping it simple.
> >
> > > The clean file command should support empty trash immediately, it
> > will
> >
> > > be enough.
> >
> > >
> >
> > >
> >
> > >
> >
> > > -
> >
> > > Best Regards
> >
> > > David Cai
> >
> > > --
> >
> > > Sent from:
> >
> > >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> >
> > >
> >
> >
>
> --
> Thanks & Regards,
> Ravi
>


Re: Clean files enhancement

2020-09-24 Thread Ravindra Pesala
Hi Vikram,

+1

It is good to remove the automatic cleanup.
But I am still worried about the clean file command executed by user as
well.  We need to enhance the clean file command to introduce dry run to
print what segments it is going to be deleted and what is left. If user ok
with dry run result then he can go for actual run.

Regards,
Ravindra.

On Mon, 21 Sep 2020 at 1:27 PM, Vikram Ahuja 
wrote:

> Hi Ravi and David,
>
>
>
> 1. All the automatic clean data in the case of load/insert/compact/delete
>
> will be removed, so cleaning will only happen when the clean files command
>
> is called.
>
>
>
> 2. We will only add the data to trash when we try to clean data which is in
>
> IN PROGRESS state. In case of COmpacted/Marked For Delete it will not be
>
> moved to the trash, it will be directly deleted. The user will only be able
>
> to recover the In Progress segments if the user wants. @Ravi -> Is this
>
> okay for trash usage? Only using it for in progress segments.
>
>
>
> 3. No trash management will be implemented, the data will ONLY BE REMOVED
>
> from the trash folder immediately when the clean files command is called.
>
> There will be no time to live, the data can be kept in the trash folder
>
> untill the user triggers clean files command.
>
>
>
> Let me know if you have any questions.
>
>
>
> Vikram Ahuja
>
>
>
> On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang 
> wrote:
>
>
>
> > agree with Ravindra,
>
> >
>
> > 1. stop all automatic clean data in load/insert/compact/update/delete...
>
> >
>
> > 2. when clean files command clean in-progress or uncertain data, we can
>
> > move
>
> > them to data trash.
>
> > it can prevent delete useful data by mistake, we already find this
>
> > issue
>
> > in some scenes.
>
> > other cases(for example clean mark_for_delete/compacted segment)
> should
>
> > not use the data trash folder, clean data directly.
>
> >
>
> > 3. no need data trash management, suggest keeping it simple.
>
> > The clean file command should support empty trash immediately, it
> will
>
> > be enough.
>
> >
>
> >
>
> >
>
> > -
>
> > Best Regards
>
> > David Cai
>
> > --
>
> > Sent from:
>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> >
>
>

-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-21 Thread Vikram Ahuja
Hi Ravi and David,

1. All the automatic clean data in the case of load/insert/compact/delete
will be removed, so cleaning will only happen when the clean files command
is called.

2. We will only add the data to trash when we try to clean data which is in
IN PROGRESS state. In case of COmpacted/Marked For Delete it will not be
moved to the trash, it will be directly deleted. The user will only be able
to recover the In Progress segments if the user wants. @Ravi -> Is this
okay for trash usage? Only using it for in progress segments.

3. No trash management will be implemented, the data will ONLY BE REMOVED
from the trash folder immediately when the clean files command is called.
There will be no time to live, the data can be kept in the trash folder
untill the user triggers clean files command.

Let me know if you have any questions.

Vikram Ahuja

On Fri, Sep 18, 2020 at 1:43 PM David CaiQiang  wrote:

> agree with Ravindra,
>
> 1. stop all automatic clean data in load/insert/compact/update/delete...
>
> 2. when clean files command clean in-progress or uncertain data, we can
> move
> them to data trash.
> it can prevent delete useful data by mistake, we already find this
> issue
> in some scenes.
> other cases(for example clean mark_for_delete/compacted segment) should
> not use the data trash folder, clean data directly.
>
> 3. no need data trash management, suggest keeping it simple.
> The clean file command should support empty trash immediately, it will
> be enough.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Clean files enhancement

2020-09-18 Thread David CaiQiang
agree with Ravindra,

1. stop all automatic clean data in load/insert/compact/update/delete...

2. when clean files command clean in-progress or uncertain data, we can move
them to data trash.
it can prevent delete useful data by mistake, we already find this issue
in some scenes.
other cases(for example clean mark_for_delete/compacted segment) should
not use the data trash folder, clean data directly.

3. no need data trash management, suggest keeping it simple.
The clean file command should support empty trash immediately, it will
be enough.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-17 Thread Ravindra Pesala
-1

I don’t see any reason why we should use trash. How does it change the
behaviour.
1. Are you still going with automatic clean up?
If yes then you are adding extra time to move the data to trash(for S3 file
system).
2. Even if you move the data and keep the time to live as 3 days in trash,
what if user realised that data is not right or lost after that time period.

Regards,
Ravindra


On Thu, 17 Sep 2020 at 3:12 PM, Vikram Ahuja 
wrote:

> Hi all,
>
> after all the suggestions the trash folder mechanism in carbondata will be
>
> implemented in 2 phases
>
> Phase1 :
>
> 1. Create a generic trash folder at table level. Trash folders will be
>
> hidden/invisible(like .trash or .recyclebin). The trash folder will be
>
> stored in the table dir.
>
> 2. If we delete any file/folder from a table it will be moved to the trash
>
> folder of that corresponding table (The call for adding to trash will be
>
> added in FileFactory delete api's)
>
> 3. A trash manager will be created, which will keep track of all the files
>
> that have been deleted and moved to the trash and will also maintain the
>
> time when it is deleted. All the trashmanager's api will be called from the
>
> FileFactory class
>
> 4. On clean files command, the trash folders will be cleared if the expiry
>
> time has been met. Each file moved to the trash will have some expiration
>
> time associated with it
>
>
>
> Phase 2: For phase 2 more enhancements are planned, and will be implemented
>
> after the phase 1 is completed. The plan for phase 2 development and
>
> changes shall be posted in this mail thread itself.
>
>
>
>
>
> Thanks
>
> Vikram Ahuja
>
>
>
>
>
> On Wed, Sep 16, 2020 at 8:43 AM PickUpOldDriver 
>
> wrote:
>
>
>
> > Hi Vikram,
>
> >
>
> > I agree to build a trash folder, +1.
>
> >
>
> > Currently, the data loading/compaction/update/merge flow has automatic
>
> > cleaning files actions, but they are written separately.  Most of them
> are
>
> > aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And
> they
>
> > rely on the precise of the table status. If you could build a general
> clean
>
> > file function, it can be applied to substitute the current automatic
>
> > deletion for stale folders.
>
> >
>
> > Besides, having a trash folder handle by Carbondata will be good, we can
>
> > find the deleted segments by this API.
>
> >
>
> > And I think we should also consider the status of INSERT_IN_PROGERSS &
>
> > INSERT_OVERWRITE _IN_PROGRESS
>
> >
>
> >
>
> >
>
> >
>
> > --
>
> > Sent from:
>
> > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> >
>
>

-- 
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-17 Thread Vikram Ahuja
Hi all,
after all the suggestions the trash folder mechanism in carbondata will be
implemented in 2 phases
Phase1 :
1. Create a generic trash folder at table level. Trash folders will be
hidden/invisible(like .trash or .recyclebin). The trash folder will be
stored in the table dir.
2. If we delete any file/folder from a table it will be moved to the trash
folder of that corresponding table (The call for adding to trash will be
added in FileFactory delete api's)
3. A trash manager will be created, which will keep track of all the files
that have been deleted and moved to the trash and will also maintain the
time when it is deleted. All the trashmanager's api will be called from the
FileFactory class
4. On clean files command, the trash folders will be cleared if the expiry
time has been met. Each file moved to the trash will have some expiration
time associated with it

Phase 2: For phase 2 more enhancements are planned, and will be implemented
after the phase 1 is completed. The plan for phase 2 development and
changes shall be posted in this mail thread itself.


Thanks
Vikram Ahuja


On Wed, Sep 16, 2020 at 8:43 AM PickUpOldDriver 
wrote:

> Hi Vikram,
>
> I agree to build a trash folder, +1.
>
> Currently, the data loading/compaction/update/merge flow has automatic
> cleaning files actions, but they are written separately.  Most of them are
> aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And they
> rely on the precise of the table status. If you could build a general clean
> file function, it can be applied to substitute the current automatic
> deletion for stale folders.
>
> Besides, having a trash folder handle by Carbondata will be good, we can
> find the deleted segments by this API.
>
> And I think we should also consider the status of INSERT_IN_PROGERSS &
> INSERT_OVERWRITE _IN_PROGRESS
>
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Clean files enhancement

2020-09-15 Thread PickUpOldDriver
Hi Vikram,

I agree to build a trash folder, +1.

Currently, the data loading/compaction/update/merge flow has automatic
cleaning files actions, but they are written separately.  Most of them are
aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And they
rely on the precise of the table status. If you could build a general clean
file function, it can be applied to substitute the current automatic
deletion for stale folders. 

Besides, having a trash folder handle by Carbondata will be good, we can
find the deleted segments by this API. 

And I think we should also consider the status of INSERT_IN_PROGERSS &
INSERT_OVERWRITE _IN_PROGRESS




--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Ravindra Pesala
+1 with Vishal proposal.
It is not safe to clean the automatically with out ensuring the data
integrity. Let’s enhance the clean command to do sanity check before
removing it. It should be the administrative work to delete the data, not
the framework automatic feature. User can call when he needs to delete the
data.

Regards,
Ravindra.

On Tue, 15 Sep 2020 at 10:50 PM, akashrn5  wrote:

> Hi David,
>
>
>
> 1. we cannot remove the code of clean up from all commands, because in case
>
> of any failures if we do not clean the stale files, there can be issues of
>
> wrong data or extra data.
>
>
>
> What i think is, we are calling the APIs which does may be say X amount of
>
> work, but we may just need some Y amount of clean up to be done (X >Y ). So
>
> may be what we can do is refactor in a proper way, just to delete or clean
>
> only the required files or folders specific to that command and not call
> the
>
> general or common clean up APIs which creates problem for us.
>
>
>
> 2. Yes, i agree that no need to clean up in progress in commads.
>
>
>
> Regards,
>
> AKash R Nilugal
>
>
>
>
>
>
>
> --
>
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> --
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-15 Thread akashrn5
Hi David,

1. we cannot remove the code of clean up from all commands, because in case
of any failures if we do not clean the stale files, there can be issues of
wrong data or extra data.

What i think is, we are calling the APIs which does may be say X amount of
work, but we may just need some Y amount of clean up to be done (X >Y ). So
may be what we can do is refactor in a proper way, just to delete or clean
only the required files or folders specific to that command and not call the
general or common clean up APIs which creates problem for us.

2. Yes, i agree that no need to clean up in progress in commads.

Regards,
AKash R Nilugal



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Kumar Vishal
Hi Vikram,
Moving to Trash/ keeping inside FACT/Part0/ folder it does not really
matter, finally after configurable time it will be deleted. Moving to Trash
will add an extra IO and time during the data loading.
Everything will work fine if tablestatus is giving correct status. Do not
delete the data physically in automatic clean files, just clean the table
status with proper backup.

For physical deletion, let User calls the clean command. Which will first
run some sanity like getting the count before deletion and then move the
segment to be deleted to some other folder[TRASH] and run the count again.
If both counts matches then delete the data. Otherwise move the data back
from TRASH in case of any mismatch. We need to enhance the current clean
command as per the above way.

-Regards
Kumar Vishal



On Tue, Sep 15, 2020 at 8:50 PM David CaiQiang  wrote:

> 1. cleaning the in_progressing segment is very dangerous, please remove
> this
> part from code.  After the user explicitly uses clean file command with an
> option "clean_in_progressing"="true", we check segment lock to clean
> segment.
>
> 2. if the status of a semgent is mark_for_delete/compacted, we can delete
> the segment directly without backup.
>
> 3. remove code which clean stale data and partial data from
> loading/compaction/update/delete feature and so on. better to use a uuid as
> segment folder, Let cleaning stale data to be an optional operation. if we
> don't clean stale data, table also can work fine.
>
> 5. trash folder can be under the table path.  each table has a separate
> trash folder. if we clean uncertain data, we can use trash folder to store
> data and use a separate folder for each transcation.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Clean files enhancement

2020-09-15 Thread David CaiQiang
1. cleaning the in_progressing segment is very dangerous, please remove this
part from code.  After the user explicitly uses clean file command with an
option "clean_in_progressing"="true", we check segment lock to clean
segment.

2. if the status of a semgent is mark_for_delete/compacted, we can delete
the segment directly without backup.

3. remove code which clean stale data and partial data from
loading/compaction/update/delete feature and so on. better to use a uuid as
segment folder, Let cleaning stale data to be an optional operation. if we
don't clean stale data, table also can work fine.

5. trash folder can be under the table path.  each table has a separate
trash folder. if we clean uncertain data, we can use trash folder to store
data and use a separate folder for each transcation.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread haomarch
+1 for this feature.

1. To provide better reliability, especially data integrity, is our top
priority. I believe the trash helps a lot when problems happen.
2. It's tough for S3 to recover data under BigData Env (too many files and
too much data), recovering is very time-cost and confidence-cost. we expect
to recover data by ourself but the user. A trash will be helpful.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Ajantha Bhat
Hi vikram, Thanks for proposing this.

a) If the file system is HDFS, *HDFS already supports trash.*
when data is deleted in HDFS. It will be moved to trash instead of
permanent delete (can also configure trash interval *fs.trash.interval*)
b) If the file system is object storage like s3a or OBS. *They support
bucket versioning*. The user should configure it to go back to the previous
snapshot.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/undelete-objects.html

*So, Basically this functionality has to be there at underlying file system
not at CarbonData layer. *
Keeping trash folder with many configurations for this and checking aging
of the trash folder can work,
but it makes system complex and adds an additional overhead of maintaining
this functionality.

Based on this,
*-1 from my side for this feature*. you can wait for other people's
opinions on this before concluding.

Thanks,
Ajantha



On Thu, Sep 10, 2020 at 4:20 PM vikramahuja1001 
wrote:

> Hi all,
> This mail is regarding enhancing the clean files command.
> Current behaviour : Currently when clean files is called, the segments
> which
> are MARKED_FOR_DELETE or are COMPACTED are deleted and their entries are
> removed from tablestatus file, Fact folder and metadata/segments folder.
>
> Enhancement behaviour idea: In this enhancement the idea is to create a
> trash folder(like Recycle Bin, with 777 config) which can be stored in /tmp
> folder(or user defined folder, a new property will be exposed). Here when
> ever a segment is cleaned , the necessary carbondata files (no other files)
> can be copied to this folder. The RecycleBin folder can have a folder for
> each table with name like DBName_TableName. We can keep the carbondata
> files
> here for 3 days(or as long as the user wants, a carbon property will be
> exposed for the same.). They can be deleted if they are not modified since
> 3
> days or as per the property. We can maintain a thread which checks the
> aging
> time and deletes the necessary carbondata files from the trash folder.
>
> Apart from that, while cleaning INSERT_IN_PROGRESS segments will be cleaned
> too, but will try to get a segment lock before cleaning the
> INSERT_IN_PROGRESS segments. If the code is able to acquire the segment
> lock, i.e., it is a stale folder, it can be cleaned. If the code is not
> able
> to acquire the segment lock that means load is in progress or any other
> operation is in progress, in that case the INSERT_IN_PROGRESS segment will
> not be cleaned.
>
> Please provide input and suggestions for this enhancement idea.
>
> Thanks
> Vikram Ahuja
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Clean files enhancement

2020-09-10 Thread vikramahuja1001
Hi all,
This mail is regarding enhancing the clean files command.
Current behaviour : Currently when clean files is called, the segments which
are MARKED_FOR_DELETE or are COMPACTED are deleted and their entries are
removed from tablestatus file, Fact folder and metadata/segments folder. 

Enhancement behaviour idea: In this enhancement the idea is to create a
trash folder(like Recycle Bin, with 777 config) which can be stored in /tmp
folder(or user defined folder, a new property will be exposed). Here when
ever a segment is cleaned , the necessary carbondata files (no other files)
can be copied to this folder. The RecycleBin folder can have a folder for
each table with name like DBName_TableName. We can keep the carbondata files
here for 3 days(or as long as the user wants, a carbon property will be
exposed for the same.). They can be deleted if they are not modified since 3
days or as per the property. We can maintain a thread which checks the aging
time and deletes the necessary carbondata files from the trash folder. 

Apart from that, while cleaning INSERT_IN_PROGRESS segments will be cleaned
too, but will try to get a segment lock before cleaning the
INSERT_IN_PROGRESS segments. If the code is able to acquire the segment
lock, i.e., it is a stale folder, it can be cleaned. If the code is not able
to acquire the segment lock that means load is in progress or any other
operation is in progress, in that case the INSERT_IN_PROGRESS segment will
not be cleaned.

Please provide input and suggestions for this enhancement idea.

Thanks
Vikram Ahuja



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/