[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

Dayue Gao (JIRA) Thu, 13 Oct 2016 02:08:26 -0700

    [ 
https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571341#comment-15571341
 ]


Dayue Gao commented on KYLIN-2012:
----------------------------------

I found that even after KYLIN-1985, we can only allow user to append columns to 
lookup table, the reasons are:
* LookupTable use ColumnDesc's zerobasedindex to find key columns in 
SnapshotTable, if users insert/drop column in the middle of hive table, the 
indexes of ColumnDesc are not aligned with hive.
* If users drop trailing unused column of lookup table, query can fail with 
ArrayIndexOutOfBoundsException at LookupStringTable#convertRow. That's because 
#columns of SnapshotTable is larger than 
length(LookupStringTable.colIsDateTime).

> more robust approach to hive schema changes
> -------------------------------------------
>
>                 Key: KYLIN-2012
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2012
>             Project: Kylin
>          Issue Type: Bug
>          Components: Metadata, REST Service, Web 
>    Affects Versions: v1.5.3
>            Reporter: Dayue Gao
>            Assignee: Dayue Gao
>             Fix For: v1.6.0
>
>
> Our users occasionally want to change their existing cube, such as 
> adding/renaming/removing a dimension. Some of these changes require 
> modifications to its source hive table. So our user changed the table schema 
> and reloaded its metadata in Kylin, then several issues can happen depends on 
> what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading 
> table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong 
> answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such 
> as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used 
> column.
> Clearly from the table, reload a changed table is problematic in certain 
> cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load 
> -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User 
> should be able to know which cube is preventing him from reloading, and then 
> he could drop and recreate cube after reloading. However, defining a cube is 
> not an easy task (consider editing 100 measures). Force users to recreate 
> their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some 
> columns changed after reloading. Broken cube can't be built or queried, it 
> can only be edit or dropped. In fact, there is a cube status called 
> {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should 
> take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional 
> reload could make it unavailable. Similarly, a disabled but unpurged cube 
> shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

Reply via email to