[ https://issues.apache.org/jira/browse/KYLIN-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423395#comment-16423395 ]
liyang commented on KYLIN-3221: ------------------------------- +1 Nice proposal. The global snapshot table can be one way to support slowly changing dimension tables. > Some improvements for lookup table > ----------------------------------- > > Key: KYLIN-3221 > URL: https://issues.apache.org/jira/browse/KYLIN-3221 > Project: Kylin > Issue Type: Improvement > Components: Job Engine, Metadata, Query Engine > Reporter: Ma Gang > Assignee: Ma Gang > Priority: Major > > There are two limitations for current look table design: > # lookup table size is limited, because table snapshot need to be cached in > Kylin server, too large snapshot table will break the server. > # lookup table snapshot references are stored in all segments of the cube, > cannot support global snapshot table, the global snapshot table means when > the lookup table is updated, it will take effective for all segments. > To resolve the above limitations, we decide to do some improvements for the > existing lookup table design, below is the initial document, any comments and > suggestions are welcome. > h2. Metadata > Will add a new property in CubeDesc to describe how lookup tables will be > snapshot, it can be defined during the cube design > |{{@JsonProperty}}{{(}}{{"snapshot_table_desc_list"}}{{)}} > {{private}} {{List<SnapshotTableDesc> snapshotTableDescList = > Collections.emptyList();}}| > SnapshotTableDesc defines how table is stored and whether it is global or > not, currently we can support two types of store: > # "metaStore", table snapshot is stored in the metadata store, it is the > same as current design, and this is the default option. > # "hbaseStore', table snapshot is stored in an additional hbase table. > |{{@JsonProperty}}{{(}}{{"table_name"}}{{)}} > {{private}} {{String tableName;}} > > {{@JsonProperty}}{{(}}{{"store_type"}}{{)}} > {{private}} {{String snapshotStorageType = }}{{"metaStore"}}{{;}} > > {{@JsonProperty}}{{(}}{{"global"}}{{)}} > {{private}} {{boolean}} {{global = }}{{false}}{{;}}| > > Add 'snapshots' property in CubeInstance, to store snapshots resource path > for each table, when the table snapshot is set to global in cube design: > |{{@JsonProperty}}{{(}}{{"snapshots"}}{{)}} > {{private}} {{Map<String, String> snapshots; }}{{// tableName -> > tableResoucePath mapping}}| > > Add new meta model ExtTableSnapshot to describe the extended table snapshot > information, the information is stored in a new metastore path: > /ext_table_snapshot/\{tableName}/\{uuid}.snapshot, the metadata including > following info: > |{{@JsonProperty}}{{(}}{{"tableName"}}{{)}} > {{private}} {{String tableName;}} > > {{@JsonProperty}}{{(}}{{"signature"}}{{)}} > {{private}} {{TableSignature signature;}} > > {{@JsonProperty}}{{(}}{{"storage_location_identifier"}}{{)}} > {{private}} {{String storageLocationIdentifier;}} > > @JsonProperty("key_columns") > private String[] keyColumns; // the key columns of the table > > @JsonProperty("storage_type") > private String storageType; > > {{@JsonProperty}}{{(}}{{"size"}}{{)}} > {{private}} {{long}} {{size;}} > > {{@JsonProperty}}{{(}}{{"row_cnt"}}{{)}} > {{private}} {{long}} {{rowCnt;}}| > > Add new section in 'Advance Setting' tab when do cube design, user can set > table snapshot properties for each table, and by default, it is segment level > and store to metadata store > h2. Build > If user specify 'hbaseStore' storageType for any lookup table, will use > MapReduce job convert the hive source table to hfiles, and then bulk load > hfiles to HTable. So it will add two job steps to do the lookup table > materialization. > h2. HBase Lookup Table Schema > all data are stored in raw value > suppose the lookup table has primary keys: key1,key2 > rowkey will be: > ||2bytes||2 bytes||len1 bytes||2 bytes||len2 bytes|| > |shard|key1 value length(len1)|key1 value|key 2 value length(len2)|key2 value| > the first 2 bytes is shard number, HBase table can be pre-split, the shard > size is configurable through Kylin's properties: > "kylin.snapshot.ext.shard-mb", default size is 500MB. > 1 column family c, multiple columns which column name is the index of the > column in the table definition > |c| > |1|2|...| > > h2. Query > For key lookup query, directly call hbase get api to get entire row according > to key. > For queries that need fetch keys according to the derived columns, iterate > all rows to get related keys. > For queries that only hit the lookup table, iterate all rows and let calcite > to do aggregation and filter. > h2. Management > For each lookup table, admin can view how many snapshots it has in Kylin, and > can view each snapshot type/size information and which cube/segments the > snapshot is referenced, the snapshot tables that have no reference can be > deleted. > h2. Cleanup > When clean up metadata store, need to remove snapshot stored in HBase. And > need to clean up metadata store periodically by cronjob. > h2. Future > # Add coprocessor for lookup table, to improve the performance of lookup > table query, and queries that filter by derived columns. > # Add secondly index support for external snapshot table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)