Cao, Lionel created CARBONDATA-910:
--------------------------------------

             Summary: Implement Partition feature
                 Key: CARBONDATA-910
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-910
             Project: CarbonData
          Issue Type: New Feature
          Components: core, data-load, data-query
            Reporter: Cao, Lionel
            Assignee: Cao, Lionel


Why need partition table
Partition table provide an option to divide table into some smaller pieces. 
With partition table:
      1. Data could be better managed, organized and stored. 
      2. We can avoid full table scan in some scenario and improve query 
performance. (partition column in filter, 
      multiple partition tables join in the same partition column etc.)

Partitioning design
Range Partitioning           
       range partitioning maps data to partitions according to the range of 
partition column values, operator '<' defines non-inclusive upper bound of 
current partition.
List Partitioning
       list partitioning allows you map data to partitions with specific value 
list
Hash Partitioning
       hash partitioning maps data to partitions with hash algorithm and put 
them to the given number of partitions
Composite Partitioning(2 levels at most for now)
       Range-Range, Range-List, Range-Hash, List-Range, List-List, List-Hash, 
Hash-Range, Hash-List, Hash-Hash

DDL-Create 
Create table sales(
     itemid long, 
     logdate datetime, 
     customerid int
     ...
     ...)
[partition by range logdate(...)]
[subpartition by list area(...)]
Stored By 'carbondata'
[tblproperties(...)];

range partition: 
     partition by range logdate(<  '2016-01-01', < '2017-01-01', < 
'2017-02-01', < '2017-03-01', < '2099-01-01')
list partition:
     partition by list area('Asia', 'Europe', 'North America', 'Africa', 
'Oceania')
hash partition:
     partition by hash(itemid, 9) 
composite partition:
     partition by range logdate(<  '2016- -01', < '2017-01-01', < '2017-02-01', 
< '2017-03-01', < '2099-01-01')
     subpartition by list area('Asia', 'Europe', 'North America', 'Africa', 
'Oceania')

DDL-Rebuild, Add
Alter table sales rebuild partition by (range|list|hash)(...);
Alter table salse add partition (< '2018-01-01');    #only support range 
partitioning, list partitioning
Alter table salse add partition ('South America');

#Note: No delete operation for partition, please use rebuild. 
If need delete data, use delete statement, but the definition of partition will 
not be deleted.

Partition Table Data Store
[Option One]
Use the current design, keep partition folder out of segments
Fact
   |___Part0
   |          |___Segment_0
   |                         |___ *******-[bucketId]-.carbondata
   |                         |___ *******-[bucketId]-.carbondata
   |          |___Segment_1
   |          ...
   |___Part1
   |          |___Segment_0
   |          |___Segment_1
   |...

[Option Two]
remove partition folder, add partition id into file name and build btree in 
driver side.
Fact
   |___Segment_0
   |                  |___ *******-[bucketId]-[partitionId].carbondata
   |                  |___ *******-[bucketId]-[partitionId].carbondata
   |___Segment_1
   |___Segment_2
   ...


Pros & Cons: 
Option one would be faster to locate target files
Option two need to store more metadata of folders

Partition Table MetaData Store
partitioni info should be stored in file footer/index file and load into memory 
before user query.

Relationship with Bucket
Bucket should be lower level of partition.

Partition Table Query

Example:
Select * from sales
where logdate <= date '2016-12-01';

User should remember to add a partition filter when write SQL on a partition 
table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to