[jira] [Commented] (KUDU-2671) Change hash number for range partitioning

ASF subversion and git services (Jira) Tue, 23 Nov 2021 23:20:08 -0800


    [ 
https://issues.apache.org/jira/browse/KUDU-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448407#comment-17448407
 ]


ASF subversion and git services commented on KUDU-2671:
-------------------------------------------------------

Commit 6998193e69eeda497f912d1d806470c95b591ad4 in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=6998193 ]

KUDU-2671 number of per-range hash dimensions should be fixed for now

As it turned out, updating the client's metacache, the system catalog's
logic, and the partition pruner to accommodate for partition keys with
variable size of the hash part seems to be a substantial effort on
itself.  However, we can still deliver the most frequently requested
functionality of changing the number of hash buckets per range partition
if adding the restriction on the size of the hash part of a partition
key, requiring it to be of the same size across all the range partitions
in a table.

So, this patch adds a new restriction for per-range custom schemas:
the number of hash dimensions must be the same for all the ranges in
a table.  Since the absence of hash bucketing is equivalent to having
zero hash dimensions for a table's range, that means it's not possible
to make a particular range having no hash bucketing at all if the rest
of the ranges in the table have non-trivial hash schemas.

With the introduced restriction, it's still possible to change other
parameters used to define a hash schema per range in any hash dimension:
  * the number of hash buckets (NOTE: the number of hash buckets must be
    equal or greater than two)
  * the set of columns for the hash bucketing
  * the seed for the hash function

As a part of this changelist, a few test scenarios are now disabled:
those are to be re-enabled once the rest of the code in the system
catalog, the client metacache, and the partition pruner is able to
handle different number of hash dimensions.  In addition, new test
scenarios have been added to verify that the invariant of the same
number of hash dimensions across all the range partition is properly
enforced at the server side while creating a table.

Also, I updated the comparison operator for PartitionKey: since the
number of hash dimensions isn't varying across per-range hash schemas,
it's no longer necessary to concatenate the hash and the range parts to
provide the legacy ordering of partition keys for some edge cases which
were pertinent to situations with varying number of hash dimensions for
the hash schema of a range.  The implementation of the PartitionKey's
comparison operator might change if switching to a single string under
the hood, but at this point I decided to keep them separate.  For the
sake of keeping the code future-proof and easier for review, I think
of starting using strings views (or Slice) for the range_key() and the
hash_key() methods in a follow-up changelist, regardless of how the
serialized partition key is represented under the hood in PartitionKey.

Change-Id: Ic884fa556462b85c64d77385a521d9077d33c7c1
Reviewed-on: http://gerrit.cloudera.org:8080/18045
Tested-by: Alexey Serbin <aser...@cloudera.com>
Reviewed-by: Andrew Wong <aw...@cloudera.com>


> Change hash number for range partitioning
> -----------------------------------------
>
>                 Key: KUDU-2671
>                 URL: https://issues.apache.org/jira/browse/KUDU-2671
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client, java, master, server
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Assignee: Mahesh Reddy
>            Priority: Major
>              Labels: feature, roadmap-candidate, scalability
>         Attachments: 屏幕快照 2019-01-24 下午12.03.41.png
>
>
> For our usage, the kudu schema design isn't flexible enough.
> We create our table for day range such as dt='20181112' as hive table.
> But our data size change a lot every day, for one day it will be 50G， but for 
> some other day it will be 500G. For this case, it be hard to set the hash 
> schema. If too big, for most case, it will be too wasteful. But too small, 
> there is a performance problem in the case of a large amount of data.
>  
> So we suggest a solution we can change the hash number by the history data of 
> a table.
> for example
>  # we create schema with one estimated value.
>  # we collect the data size by day range
>  # we create new day range partition by our collected day size.
> We use this feature for half a year, and it work well. We hope this feature 
> will be useful for the community. Maybe the solution isn't so complete. 
> Please help us make it better.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (KUDU-2671) Change hash number for range partitioning

Reply via email to