Re: Clean files enhancement

2020-09-15 Thread PickUpOldDriver
Hi Vikram,

I agree to build a trash folder, +1.

Currently, the data loading/compaction/update/merge flow has automatic
cleaning files actions, but they are written separately.  Most of them are
aimed at deleting the stale segments(MARKED_FOR_DELETE/COMPACTED). And they
rely on the precise of the table status. If you could build a general clean
file function, it can be applied to substitute the current automatic
deletion for stale folders. 

Besides, having a trash folder handle by Carbondata will be good, we can
find the deleted segments by this API. 

And I think we should also consider the status of INSERT_IN_PROGERSS &
INSERT_OVERWRITE _IN_PROGRESS




--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: CarbonData File Deletion Hotfix

2020-09-15 Thread PickUpOldDriver
Hello March, 

I agree to take a hotfix for data deletion in loading and compaction flow,
+1.  

Deleting the INSERT_IN_PROGERSS and INSERT_OVERWRITE_IN_PROGRESS is a
dangerous activity, so these two kinds of segments should not be
automatically deleted. 

As for MARKED_FOR_DELETE and COMPACTED status segments, these are stale
segments, but we can keep them in the file system until the user/admin calls
clean file action manually.  Since the deletion requires the precision of
the table status. 

So my opinion is to remove all the automatic clean steps in
loading/compaction flow first to protect the data from being deleted
accidentally.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Ravindra Pesala
+1 with Vishal proposal.
It is not safe to clean the automatically with out ensuring the data
integrity. Let’s enhance the clean command to do sanity check before
removing it. It should be the administrative work to delete the data, not
the framework automatic feature. User can call when he needs to delete the
data.

Regards,
Ravindra.

On Tue, 15 Sep 2020 at 10:50 PM, akashrn5  wrote:

> Hi David,
>
>
>
> 1. we cannot remove the code of clean up from all commands, because in case
>
> of any failures if we do not clean the stale files, there can be issues of
>
> wrong data or extra data.
>
>
>
> What i think is, we are calling the APIs which does may be say X amount of
>
> work, but we may just need some Y amount of clean up to be done (X >Y ). So
>
> may be what we can do is refactor in a proper way, just to delete or clean
>
> only the required files or folders specific to that command and not call
> the
>
> general or common clean up APIs which creates problem for us.
>
>
>
> 2. Yes, i agree that no need to clean up in progress in commads.
>
>
>
> Regards,
>
> AKash R Nilugal
>
>
>
>
>
>
>
> --
>
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>
> --
Thanks & Regards,
Ravi


Re: Clean files enhancement

2020-09-15 Thread akashrn5
Hi David,

1. we cannot remove the code of clean up from all commands, because in case
of any failures if we do not clean the stale files, there can be issues of
wrong data or extra data.

What i think is, we are calling the APIs which does may be say X amount of
work, but we may just need some Y amount of clean up to be done (X >Y ). So
may be what we can do is refactor in a proper way, just to delete or clean
only the required files or folders specific to that command and not call the
general or common clean up APIs which creates problem for us.

2. Yes, i agree that no need to clean up in progress in commads.

Regards,
AKash R Nilugal



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Kumar Vishal
Hi Vikram,
Moving to Trash/ keeping inside FACT/Part0/ folder it does not really
matter, finally after configurable time it will be deleted. Moving to Trash
will add an extra IO and time during the data loading.
Everything will work fine if tablestatus is giving correct status. Do not
delete the data physically in automatic clean files, just clean the table
status with proper backup.

For physical deletion, let User calls the clean command. Which will first
run some sanity like getting the count before deletion and then move the
segment to be deleted to some other folder[TRASH] and run the count again.
If both counts matches then delete the data. Otherwise move the data back
from TRASH in case of any mismatch. We need to enhance the current clean
command as per the above way.

-Regards
Kumar Vishal



On Tue, Sep 15, 2020 at 8:50 PM David CaiQiang  wrote:

> 1. cleaning the in_progressing segment is very dangerous, please remove
> this
> part from code.  After the user explicitly uses clean file command with an
> option "clean_in_progressing"="true", we check segment lock to clean
> segment.
>
> 2. if the status of a semgent is mark_for_delete/compacted, we can delete
> the segment directly without backup.
>
> 3. remove code which clean stale data and partial data from
> loading/compaction/update/delete feature and so on. better to use a uuid as
> segment folder, Let cleaning stale data to be an optional operation. if we
> don't clean stale data, table also can work fine.
>
> 5. trash folder can be under the table path.  each table has a separate
> trash folder. if we clean uncertain data, we can use trash folder to store
> data and use a separate folder for each transcation.
>
>
>
> -
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


Re: Clean files enhancement

2020-09-15 Thread David CaiQiang
1. cleaning the in_progressing segment is very dangerous, please remove this
part from code.  After the user explicitly uses clean file command with an
option "clean_in_progressing"="true", we check segment lock to clean
segment.

2. if the status of a semgent is mark_for_delete/compacted, we can delete
the segment directly without backup.

3. remove code which clean stale data and partial data from
loading/compaction/update/delete feature and so on. better to use a uuid as
segment folder, Let cleaning stale data to be an optional operation. if we
don't clean stale data, table also can work fine.

5. trash folder can be under the table path.  each table has a separate
trash folder. if we clean uncertain data, we can use trash folder to store
data and use a separate folder for each transcation.



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread haomarch
+1 for this feature.

1. To provide better reliability, especially data integrity, is our top
priority. I believe the trash helps a lot when problems happen.
2. It's tough for S3 to recover data under BigData Env (too many files and
too much data), recovering is very time-cost and confidence-cost. we expect
to recover data by ourself but the user. A trash will be helpful.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Clean files enhancement

2020-09-15 Thread Ajantha Bhat
Hi vikram, Thanks for proposing this.

a) If the file system is HDFS, *HDFS already supports trash.*
when data is deleted in HDFS. It will be moved to trash instead of
permanent delete (can also configure trash interval *fs.trash.interval*)
b) If the file system is object storage like s3a or OBS. *They support
bucket versioning*. The user should configure it to go back to the previous
snapshot.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/undelete-objects.html

*So, Basically this functionality has to be there at underlying file system
not at CarbonData layer. *
Keeping trash folder with many configurations for this and checking aging
of the trash folder can work,
but it makes system complex and adds an additional overhead of maintaining
this functionality.

Based on this,
*-1 from my side for this feature*. you can wait for other people's
opinions on this before concluding.

Thanks,
Ajantha



On Thu, Sep 10, 2020 at 4:20 PM vikramahuja1001 
wrote:

> Hi all,
> This mail is regarding enhancing the clean files command.
> Current behaviour : Currently when clean files is called, the segments
> which
> are MARKED_FOR_DELETE or are COMPACTED are deleted and their entries are
> removed from tablestatus file, Fact folder and metadata/segments folder.
>
> Enhancement behaviour idea: In this enhancement the idea is to create a
> trash folder(like Recycle Bin, with 777 config) which can be stored in /tmp
> folder(or user defined folder, a new property will be exposed). Here when
> ever a segment is cleaned , the necessary carbondata files (no other files)
> can be copied to this folder. The RecycleBin folder can have a folder for
> each table with name like DBName_TableName. We can keep the carbondata
> files
> here for 3 days(or as long as the user wants, a carbon property will be
> exposed for the same.). They can be deleted if they are not modified since
> 3
> days or as per the property. We can maintain a thread which checks the
> aging
> time and deletes the necessary carbondata files from the trash folder.
>
> Apart from that, while cleaning INSERT_IN_PROGRESS segments will be cleaned
> too, but will try to get a segment lock before cleaning the
> INSERT_IN_PROGRESS segments. If the code is able to acquire the segment
> lock, i.e., it is a stale folder, it can be cleaned. If the code is not
> able
> to acquire the segment lock that means load is in progress or any other
> operation is in progress, in that case the INSERT_IN_PROGRESS segment will
> not be cleaned.
>
> Please provide input and suggestions for this enhancement idea.
>
> Thanks
> Vikram Ahuja
>
>
>
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>


CarbonData File Deletion Hotfix

2020-09-15 Thread haomarch
Backgrounds

Currently in data management scenarios(Data Loading,Segements Compaction
.etc) there exist some data deletion actions. And these actions are
dangerous because they are written in different place and some corner case
will cause data deletion accidently.


 


Current Data Deletion in Data Loading process

Firstly, introduce to  the current data loading processing 

1. Delete Stale Segments

This method will delete the segments which are not compatible with table
status.  

In loading flow, this method will scan the all the segments and add the
original segments(like Segment_1, do not contains "." in part[1]) to
staleSegments list, then delete the segments in staleSegments lists.

 


2. Delete Invalid Segments



There will be 3 steps in Delete Invalid Segments
(1) Delete Expire Lock

This method will delete the expired locks (>48h)

(2) Check if the data need to be deleted, and move segments to proper place




In current design, it will scan and remove 4 status of
Segments(MARK_FOR_DELETE, COMPACTED, INSERT_IN_PROGRESS,
INSERT_OVERWRITE_IN_PROGRESS),if it comes from loading flow to this deletion
method, it will scan the segments, if meet the requirement to be deleted,
and invisibleSegmentCnt > invisibleSegmentPreserveCnt, it will be added to
history file and then be delete.

 

 

 











(3) Delete Invalid Data

In the final step, it will delete the data file which are moved to the
history file.







3. Delete temporary files

In default setting, in loading process, CarbonData will write to temp file
first and copy to target path in the end of loading. This method will delete
the tempfiles.




Data Deletion Hotfix in Loading Process

By analysing the deletion  actions during the loading process, we are going
to make some modification to the loading flow deletion to keep data being
deleted by accident. 

There are some step to fix the problem:

(1) Replace the stale cleaning function by CleanFile actions. 

(2) Ignoring the segments which status are INSERT_IN_PROGREE and
INSERT_OVERWRITE_IN_PROGRESS, bacause the loading progress might take a long
time in a high concurrent situation. This two kind of segments will leave to
be deleted by the command of CleanFiles. Besides, there will a recycle bin
to store the deleted files temporaryly, users can find their deleted
segments at recycle bin. 









--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/