[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-05-23 Thread Shaofeng SHI (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488475#comment-16488475
 ] 

Shaofeng SHI commented on KYLIN-3221:
-

Hi Gang and Julian, I make a binary package with pr 139, and build a cube but 
found no global snapshot be taken. Then I try to re-create the cube, when click 
the "+" button for the snapshot, there is JS error, I think that blocks 
creating it:

 

scripts.min.20180524113855.js:10159 TypeError: Cannot read property 'length' of 
undefined
 at $$childScopeClass.$scope.addSnapshot (scripts.min.20180524113855.js:44805)
 at scripts.min.20180524113855.js:10955
 at callback (scripts.min.20180524113855.js:19286)
 at $$childScopeClass.$eval (scripts.min.20180524113855.js:12863)
 at $$childScopeClass.$apply (scripts.min.20180524113855.js:12961)
 at HTMLButtonElement. (scripts.min.20180524113855.js:19291)
 at HTMLButtonElement.dispatch (scripts.min.20180524113855.js:3)
 at HTMLButtonElement.r.handle (scripts.min.20180524113855.js:3)

 

Please also check the screenshot.

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
>  {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
>  {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>   
>  @JsonProperty("local_cache_enable")
>  private boolean enableLocalCache = true;
>   
>  {{@JsonProperty}}{{(}}{{"global"}}{{)}}
>  {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
>  {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
>  {{private}} {{TableSignature signature;}}
>   
>  {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
>  {{private}} {{String storageLocationIdentifier;}}
>   
>  @JsonProperty("key_columns")
>  private String[] keyColumns;  // the key columns of the table
>   
>  @JsonProperty("storage_type")
>  private String storageType;
>   
>  {{@JsonProperty}}{{(}}{{"size"}}{{)}}
>  {{private}} {{long}} {{size;}}
>   
>  {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
>  {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value 

[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-05-09 Thread Ma Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469954#comment-16469954
 ] 

Ma Gang commented on KYLIN-3221:


Already send to pull request for the change: 
[https://github.com/apache/kylin/pull/139]

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
>  {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
>  {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>   
>  @JsonProperty("local_cache_enable")
>  private boolean enableLocalCache = true;
>   
>  {{@JsonProperty}}{{(}}{{"global"}}{{)}}
>  {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
>  {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
>  {{private}} {{TableSignature signature;}}
>   
>  {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
>  {{private}} {{String storageLocationIdentifier;}}
>   
>  @JsonProperty("key_columns")
>  private String[] keyColumns;  // the key columns of the table
>   
>  @JsonProperty("storage_type")
>  private String storageType;
>   
>  {{@JsonProperty}}{{(}}{{"size"}}{{)}}
>  {{private}} {{long}} {{size;}}
>   
>  {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
>  {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
> the first 2 bytes is shard number, HBase table can be pre-split, the shard 
> size is configurable through Kylin's properties: 
> "kylin.snapshot.ext.shard-mb", default size is 500MB.
> 1 column family c, multiple columns which column name is the index of the 
> column in the table definition
> |c|
> |1|2|...|
>  
> h2. Query
> For key lookup query, directly call hbase get api to get entire row according 
> to key (call local cache if there is local cache enable)
> For queries that need fetch keys according to the derived columns, iterate 
> all rows to get related keys. (call local cache if there is local cache 
> enable)
> For queries that only hit the lookup table, iterate all rows and let calcite 
> to do aggregation 

[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-04-25 Thread Ma Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453408#comment-16453408
 ] 

Ma Gang commented on KYLIN-3221:


Also add a new action button 'Lookup Refresh' for each cube, when click the 
button, a dialog will popup, let user choose which lookup table need to be 
refreshed, and if the table is not set to global, user can choose some or all 
segments that the related snapshot need to be refreshed, then user can click 
'submit' to submit a new job to build the table snapshot independently.

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
>  {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
>  {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>   
>  @JsonProperty("local_cache_enable")
>  private boolean enableLocalCache = true;
>   
>  {{@JsonProperty}}{{(}}{{"global"}}{{)}}
>  {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
>  {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
>  {{private}} {{TableSignature signature;}}
>   
>  {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
>  {{private}} {{String storageLocationIdentifier;}}
>   
>  @JsonProperty("key_columns")
>  private String[] keyColumns;  // the key columns of the table
>   
>  @JsonProperty("storage_type")
>  private String storageType;
>   
>  {{@JsonProperty}}{{(}}{{"size"}}{{)}}
>  {{private}} {{long}} {{size;}}
>   
>  {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
>  {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
> the first 2 bytes is shard number, HBase table can be pre-split, the shard 
> size is configurable through Kylin's properties: 
> "kylin.snapshot.ext.shard-mb", default size is 500MB.
> 1 column family c, multiple columns which column name is the index of the 
> column in the table definition
> |c|
> |1|2|...|
>  
> h2. Query
> For key lookup query, directly call hbase get api to get entire row according 
> to key (call local cache if 

[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-04-25 Thread Ma Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453381#comment-16453381
 ] 

Ma Gang commented on KYLIN-3221:


According to performance test, when save big lookup table in hbase, and wants 
to get lots of keys from htable, it will take lots of time, per test, get about 
110k random keys from htable will take about 200s.

So I add a local LRU disk cache in query server to improve the lookup join 
performance, currently I use RocksDB as the disk cache, it is configurable and 
can be disable in the cube configuration. If the local cache is enable, will 
add another step in build job to warm up the cache. Per test, the rocksdb cache 
performance is good, in my vm with 4 core, 10G ram, HDD disk, it take about 4 
seconds to randomly get 110k keys, and scan performance is also very good.

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
>  {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
>  {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>   
> @JsonProperty("local_cache_enable")
>  private boolean enableLocalCache = true;
>  
>  {{@JsonProperty}}{{(}}{{"global"}}{{)}}
>  {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
>  {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
>  {{private}} {{TableSignature signature;}}
>   
>  {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
>  {{private}} {{String storageLocationIdentifier;}}
>   
>  @JsonProperty("key_columns")
>  private String[] keyColumns;  // the key columns of the table
>   
>  @JsonProperty("storage_type")
>  private String storageType;
>   
>  {{@JsonProperty}}{{(}}{{"size"}}{{)}}
>  {{private}} {{long}} {{size;}}
>   
>  {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
>  {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
> the first 2 bytes is shard number, HBase table can be pre-split, the shard 
> size is configurable through Kylin's properties: 
> 

[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-04-02 Thread liyang (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423395#comment-16423395
 ] 

liyang commented on KYLIN-3221:
---

+1 Nice proposal. The global snapshot table can be one way to support slowly 
changing dimension tables.

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
>  {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
>  {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>   
>  {{@JsonProperty}}{{(}}{{"global"}}{{)}}
>  {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
>  {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
>  {{private}} {{String tableName;}}
>   
>  {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
>  {{private}} {{TableSignature signature;}}
>   
>  {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
>  {{private}} {{String storageLocationIdentifier;}}
>   
> @JsonProperty("key_columns")
>  private String[] keyColumns;  // the key columns of the table
>  
> @JsonProperty("storage_type")
>  private String storageType;
>  
>  {{@JsonProperty}}{{(}}{{"size"}}{{)}}
>  {{private}} {{long}} {{size;}}
>   
>  {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
>  {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
> the first 2 bytes is shard number, HBase table can be pre-split, the shard 
> size is configurable through Kylin's properties: 
> "kylin.snapshot.ext.shard-mb", default size is 500MB.
> 1 column family c, multiple columns which column name is the index of the 
> column in the table definition
> |c|
> |1|2|...|
>  
> h2. Query
> For key lookup query, directly call hbase get api to get entire row according 
> to key.
> For queries that need fetch keys according to the derived columns, iterate 
> all rows to get related keys.
> For queries that only hit the lookup table, iterate all rows and let calcite 
> to do aggregation and filter.
> h2. Management
> For each lookup table, admin can view how many snapshots it has in Kylin, and 
> can view each snapshot type/size information and which cube/segments 

[jira] [Commented] (KYLIN-3221) Some improvements for lookup table

2018-02-04 Thread Billy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351822#comment-16351822
 ] 

Billy Liu commented on KYLIN-3221:
--

+1

> Some improvements for lookup table 
> ---
>
> Key: KYLIN-3221
> URL: https://issues.apache.org/jira/browse/KYLIN-3221
> Project: Kylin
>  Issue Type: Improvement
>  Components: Job Engine, Metadata, Query Engine
>Reporter: Ma Gang
>Assignee: Ma Gang
>Priority: Major
>
> There are two limitations for current look table design:
>  # lookup table size is limited, because table snapshot need to be cached in 
> Kylin server, too large snapshot table will break the server.
>  # lookup table snapshot references are stored in all segments of the cube, 
> cannot support global snapshot table, the global snapshot table means when 
> the lookup table is updated, it will take effective for all segments.
> To resolve the above limitations, we decide to do some improvements for the 
> existing lookup table design, below is the initial document, any comments and 
> suggestions are welcome.
> h2. Metadata
> Will add a new property in CubeDesc to describe how lookup tables will be 
> snapshot, it can be defined during the cube design
> |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}}
> {{private}} {{List snapshotTableDescList = 
> Collections.emptyList();}}|
>  SnapshotTableDesc defines how table is stored and whether it is global or 
> not, currently we can support two types of store:
>  # "metaStore",  table snapshot is stored in the metadata store, it is the 
> same as current design, and this is the default option.
>  # "hbaseStore', table snapshot is stored in an additional hbase table.
> |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}}
> {{private}} {{String tableName;}}
>  
> {{@JsonProperty}}{{(}}{{"store_type"}}{{)}}
> {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}}
>  
> {{@JsonProperty}}{{(}}{{"global"}}{{)}}
> {{private}} {{boolean}} {{global = }}{{false}}{{;}}|
>  
> Add 'snapshots' property in CubeInstance, to store snapshots resource path 
> for each table, when the table snapshot is set to global in cube design:
> |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}}
> {{private}} {{Map snapshots; }}{{// tableName -> 
> tableResoucePath mapping}}|
>  
> Add new meta model ExtTableSnapshot to describe the extended table snapshot 
> information, the information is stored in a new metastore path: 
> /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including 
> following info:
> |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}}
> {{private}} {{String tableName;}}
>  
> {{@JsonProperty}}{{(}}{{"signature"}}{{)}}
> {{private}} {{TableSignature signature;}}
>  
> {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}}
> {{private}} {{String storageLocationIdentifier;}}
>  
> {{@JsonProperty}}{{(}}{{"size"}}{{)}}
> {{private}} {{long}} {{size;}}
>  
> {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}}
> {{private}} {{long}} {{rowCnt;}}|
>  
> Add new section in 'Advance Setting' tab when do cube design, user can set 
> table snapshot properties for each table, and by default, it is segment level 
> and store to metadata store
> h2. Build
> If user specify 'hbaseStore' storageType for any lookup table, will use 
> MapReduce job convert the hive source table to hfiles, and then bulk load 
> hfiles to HTable. So it will add two job steps to do the lookup table 
> materialization.
> h2. HBase Lookup Table Schema
> all data are stored in raw value
> suppose the lookup table has primary keys: key1,key2
> rowkey will be:
> ||2 bytes||len1 bytes||2 bytes||len2 bytes||
> |key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value|
>  
> 1 column family c, multiple columns which column name is the index of the 
> column in the table definition
> |c|
> |1|2|...|
>  
> h2. Query
> For key lookup query, directly call hbase get api to get entire row according 
> to key.
> For queries that need fetch keys according to the derived columns, iterate 
> all rows to get related keys.
> For queries that only hit the lookup table, iterate all rows and let calcite 
> to do aggregation and filter.
> h2. Management
> For each lookup table, admin can view how many snapshots it has in Kylin, and 
> can view each snapshot type/size information and which cube/segments the 
> snapshot is referenced, the snapshot tables that have no reference can be 
> deleted.
> h2. Cleanup
> When clean up metadata store, need to remove snapshot stored in HBase. And 
> need to clean up metadata store periodically by cronjob.
> h2. Future
>  # Add coprocessor for lookup table, to improve the performance of lookup 
> table query, and queries that filter by derived columns.
>  # Add secondly index support for external snapshot table.



--
This message was sent