Hi Boris, The two examples you gave are exactly equivalent; the relative ordering of hash levels has no effect on query performance, hotspotting, or anything else. Given that 60% of your queries don't specify a specific customer_id, it does make sense to use hash(shop_id), hash(customer_id) instead of combining them in a single hash level as hash(shop_id, customer_id), however the trade-off is that the hotspotting resistance isn't as good. If the shop_id and customer_id columns aren't skewed to begin with that's not a concern, though.
- Dan On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com> wrote: > Hi guys, > Read this doc > https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning > and I have a question on this particular statement > "Scans on multilevel partitioned tables can take advantage of partition > pruning on any of the levels independently" > > Does it mean, that both strategies below would be equivalent in terms of > performance (i.e. minimum scans) > > partition by hash(shop_id), hash(customer_id) > vs. > partition by hash(customer_id), hash(shop_id) > > 60% of the queries are using both shop_id and customer_id but 40% of > queries need to pull all customers for a specific shop_id. And almost never > by customer_id alone (customer_id is not unique across shops and is > assigned per shop). > > At the same time, if I partition by customer_id first, partitions will be > distributed more evenly. > > Thanks! > Boris > > > >