[Background]
1. In some scenes, two loading/compaction jobs maybe write data to the same
segment, it will result in some data confusion and impact some features
which will not work fine again.
2. Loading/compaction/update/delete operations need to clean stale data
before execution. Cleaning stale data is a high-risk operation, if it has
some exception, it will delete valid data. If the system doesn't clean stale
data, in some scenes, it will be added into a new merged index file and
can be queried.
3. Loading/compaction takes a long time and lock will keep a long time also
in some scenes.
[Motivation & Goal]
We should avoid data confusion and the risk of clean stale data. Maybe we
can use UUID as a segment id to avoid these troubles. Even if we can do
loading/compaction without the segment/compaction lock.
[Modification]
1. segment id
Using UUID as segment id instead of the unique numeric value.
2. segment layout
a) move segment data folder into the table folder
b) move carbonindexmerge file into Metadata/segments folder,
tableFolder
UUID1
|_xxx.carbondata
|_xxx.carobnindex
UUID2
Metadata
|_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_UUID1_timestamp1.carbonindexmerge (segment index detail)
|_schema
|_tablestatus
LockFiles
partitionTableFolder
partkey=value1
|_xxx.carbondata
|_xxx.carobnindex
partkey=value2
Metadata
|_segemnts
|_UUID1_timestamp1.segment (segment index summary)
|_partkey=value1
|_UUID1_timestamp1.carbonindexmerge (segment index detail)
|_partkey=value2
|_schema
|_tablestatus
LockFiles
3. segment management
Extracting segment interface, it can support open/close, read/write, and
segment level index pruning API.
The segment should support multiple data source types: file format(carbon,
parquet, orc...), HBase...
4. clean stale data
it will become an optional operation.
-----
Best Regards
David Cai
--
Sent from:
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/