[Background] 1. In some scenes, two loading/compaction jobs maybe write data to the same segment, it will result in some data confusion and impact some features which will not work fine again. 2. Loading/compaction/update/delete operations need to clean stale data before execution. Cleaning stale data is a high-risk operation, if it has some exception, it will delete valid data. If the system doesn't clean stale data, in some scenes, it will be added into a new merged index file and can be queried. 3. Loading/compaction takes a long time and lock will keep a long time also in some scenes.
[Motivation & Goal] We should avoid data confusion and the risk of clean stale data. Maybe we can use UUID as a segment id to avoid these troubles. Even if we can do loading/compaction without the segment/compaction lock. [Modification] 1. segment id Using UUID as segment id instead of the unique numeric value. 2. segment layout a) move segment data folder into the table folder b) move carbonindexmerge file into Metadata/segments folder, tableFolder UUID1 |_xxx.carbondata |_xxx.carobnindex UUID2 Metadata |_segemnts |_UUID1_timestamp1.segment (segment index summary) |_UUID1_timestamp1.carbonindexmerge (segment index detail) |_schema |_tablestatus LockFiles partitionTableFolder partkey=value1 |_xxx.carbondata |_xxx.carobnindex partkey=value2 Metadata |_segemnts |_UUID1_timestamp1.segment (segment index summary) |_partkey=value1 |_UUID1_timestamp1.carbonindexmerge (segment index detail) |_partkey=value2 |_schema |_tablestatus LockFiles 3. segment management Extracting segment interface, it can support open/close, read/write, and segment level index pruning API. The segment should support multiple data source types: file format(carbon, parquet, orc...), HBase... 4. clean stale data it will become an optional operation. ----- Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/