[ 
https://issues.apache.org/jira/browse/KUDU-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16750708#comment-16750708
 ] 

Andrew Wong commented on KUDU-2671:
-----------------------------------

This seems pretty useful for many time-series ingest use cases; you can imagine 
a time-series table of customer transactions being set up with the expectation 
that at certain times of the year (e.g. holidays, big sales), there will be 
significantly more rows written than at others. As you suggest, Kudu's write 
throughput can be bottlenecked by the number of tablets being written to at 
once, and given the currently fixed hashing per table, Kudu operators today 
would need to create separate, "denser", tables during these peak times and 
somehow union the tables together to get the same semantics, which is 
operationally unpleasant.

[~yangz] you mentioned you've been using this feature for a while, do you have 
in-progress work that you could share?

> Change hash number for range partitioning
> -----------------------------------------
>
>                 Key: KUDU-2671
>                 URL: https://issues.apache.org/jira/browse/KUDU-2671
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client, java, master, server
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Priority: Major
>             Fix For: 1.8.0
>
>         Attachments: 屏幕快照 2019-01-24 下午12.03.41.png
>
>
> For our usage, the kudu schema design isn't flexible enough.
> We create our table for day range such as dt='20181112' as hive table.
> But our data size change a lot every day, for one day it will be 50G, but for 
> some other day it will be 500G. For this case, it be hard to set the hash 
> schema. If too big, for most case, it will be too wasteful. But too small, 
> there is a performance problem in the case of a large amount of data.
>  
> So we suggest a solution we can change the hash number by the history data of 
> a table.
> for example
>  # we create schema with one estimated value.
>  # we collect the data size by day range
>  # we create new day range partition by our collected day size.
> We use this feature for half a year, and it work well. We hope this feature 
> will be useful for the community. Maybe the solution isn't so complete. 
> Please help us make it better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to