Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Ajantha Bhat Fri, 13 Nov 2020 01:14:09 -0800

Hi Everyone.
Please find the design of refactored segment interfaces in the document
attached. Also can check the same V3 version attached in the JIRA [
https://issues.apache.org/jira/browse/CARBONDATA-2827]


It is based on some recent discussions and the previous discussions of 2018
[
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Refactor-Segment-Management-Interface-td58926.html
]

*Note:*
1) As  the pre-aggreage feature is not present and MV ,SI supports
incremental loading. so, now the previous problem of commit all child
table status at once maybe not applicable. so, removed interfaces for that.
2) All these will be developed in a new module called *carbondata-acid* and
other required module depends on it.
3) Once this is implemented. we can discuss the design of time travel on
top of it. [Transaction manager implementation and writing multiple table
status files with versioning]

Please go through it and give your inputs.

Thanks,
Ajantha

On Mon, Oct 19, 2020 at 9:43 AM David CaiQiang <david.c...@gmail.com> wrote:

> I list feature list about segment as follows before starting to re-factory
> segment interface.
>
> [table related]
> 1. get lock for table
>    lock for tablestatus
>    lock for updatedTablestatus
> 2. get lastModifiedTime of table
>
> [segment related]
> 1. segment datasource
>    datasource: file format,other datasource
>    fileformat: carbon,parquet,orc,csv..
>    catalog type: segment, external segment
> 2. data load etl(load/insert/add_external_segment/insert_stage)
>    write segment for batch loading
>    add external segment by using external folder path for mixed file
> formatted table
>    append streaming segment for spark structed streaming
>    insert_stage for flink writer
> 3. data query
>    segment properties and schema
>    segment level index cache and pruning
>    cache/refresh block/blocklet index cache if needed by segment
>    read segments to a dataframe/rdd
> 4. segment management
>    new segment id for loading/insert/add_external_segment/insert_stage
>    create global segment identifier
>    show[history]/delete segment
> 5. stats
>    collect dataSize and indexSize of the segment
>    lastModifiedTime, start/end time, update start/end time
>    fileFormat
>    status
> 6. segment level lock for supporting concurrent operations
> 7. get tablestatus storage factory
>    storage solution 1): use file system by default
>    storage solution 2): use hive metastore or db
>
> [table status related]:
> 1. record new LoadMetadataDetails
>  loading/insert/compatcion start/end
>  add external segment start/end
>  insert stage
>
> 2. update LoadMetadataDetails
>   compation
>   update/delete
>   drop partition
>   delete segment
>
> 3. read LoadMetadataDetails
>   list all/valid/invalid segment
>
> 4. backup and history
>
> [segment file related]
> 1. write new segment file
>   generate segment file name
>      better to use new timestamp to generate new segment file name for each
> writing. avoid overwriting segment file with same name.
>    write semgent file
>    merge temp segment file
> 2. read segment file
>    readIndexFiles
>    readIndexMergeFiles
>    getPartitionSpec
> 3. update segment file
>    update
>    merge index
>    drop partition
>
> [clean files related]
> 1. clean stale files for the successful  segment operation
>    data deletion should delay a period of time(maybe query timeout
> interval), avoid deleting file immediately(beside of drop table/partition,
> force clean files)
>    include data file, index file, segment file, tablestatus file
>    impact operation: mergeIndex
> 2. clean stale files for failed segment operation immediately
>
>
>
>
>
> -----
> Best Regards
> David Cai
> --
> Sent from:
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>

Re: [Discussion] Taking the inputs for Segment Interface Refactoring

Reply via email to