[ 
https://issues.apache.org/jira/browse/KYLIN-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096073#comment-15096073
 ] 

hongbin ma commented on KYLIN-1313:
-----------------------------------

hi

I'm currently working on it on 2.x-staging. not sure if we'll backport to 1.x 
versions. For future reference I'll call the new feature "heavy deriving" in 
comparison with previous light weight deriving as that only requires to look up 
deriving relation in a lookup table snapshot.

Before heavy deriving is released we suggested not include the item_url dim 
into the cube, instead to use external KV system to extend result record from 
(X,Y,item_id) to (X,Y,item_id,item_ur). Soon we realized that it is too much 
requirement for normal users, and it's best if could provide a one-stop 
solution.

The implementation of heavy deriving is non-trivial, because we cannot maintain 
the item_id=>item_url mapping in the memory. It's also not a good idea to save 
the mapping in a external KV store as it will downgrade the performance of cube 
building and query. Our current plan is to make a trade of between 
functionality and performance: A basic assumption is introduced that the 
derived columns will not participate in any kind of filter. For item_id => 
item_url case, the user CANNOT specify filter on the item_url dim. The only 
thing heavy derived dim item_url enables is that when your final result 
contains item_id, you can simultaneously retrieve item_url as well, that's all. 
(Of courser there's a hidden assumption here: item_url is uniquely determined 
by each item_id, because it is deriving!)

With the assumption(s) in mind, we can save the item_url as a special measure. 
take a cuboid with 2 dimensions dt and item_id as an example, a tuple in the 
cuboid should exhibit pattern of:

Key: 2015-1-1,4234324
Value: Metric1,Metric2,http://items.ebay.com/4234324

where http://items.ebay.com/4234324 is the item_url for item_id 4234324.

At query time, we'll use another IDerivedColumnFiller to retrieve the item_url 
value from the cuboid tuple and return both item_id and item_url to the user. 
(if item_url is required)

We'll only append the item_url measure to only cuboids that has item_id as a 
dimension to avoid unnecessary duplication. Actually, we can proper configure 
the cuboid whitelist (https://issues.apache.org/jira/browse/KYLIN-242)to make 
sure only one copy of item_url exist in all of the cuboids.

Please let me know if this design will solve your problem, it's open for 
discussion

> Enable deriving dimensions on non PK/FK
> ---------------------------------------
>
>                 Key: KYLIN-1313
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1313
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: hongbin ma
>            Assignee: hongbin ma
>
> currently derived column has to be columns on look table, and the derived 
> host column has to be PK/FK(It's also a problem when the lookup table grows 
> every large). Sometimes columns on the fact exhibit deriving relationship 
> too. Here's an example fact table:
> (dt date, seller_id bigint, seller_name varchar(100) , item_id bigint, 
> item_url varchar(1000), count decimal, price decimal)
> seller_name is uniquely determined by each seller id, and item_url is 
> uniquely determined by each item_id. The users does not expect to do 
> filtering on columns like seller name or item_url, they just want to retrieve 
> it when they do grouping/filtering on other dimensions like selller id, item 
> id or even other dimensions like dt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to