Hi Boris,

The two examples you gave are exactly equivalent; the relative ordering of
hash levels has no effect on query performance, hotspotting, or anything
else.  Given that 60% of your queries don't specify a specific customer_id,
it does make sense to use hash(shop_id), hash(customer_id) instead of
combining them in a single hash level as hash(shop_id, customer_id),
however the trade-off is that the hotspotting resistance isn't as good.  If
the shop_id and customer_id columns aren't skewed to begin with that's not
a concern, though.

- Dan

On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Hi guys,
> Read this doc
> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
> and I have a question on this particular statement
> "Scans on multilevel partitioned tables can take advantage of partition
> pruning on any of the levels independently"
>
> Does it mean, that both strategies below would be equivalent in terms of
> performance (i.e. minimum scans)
>
> partition by hash(shop_id), hash(customer_id)
> vs.
> partition by hash(customer_id), hash(shop_id)
>
> 60% of the queries are using both shop_id and customer_id but 40% of
> queries need to pull all customers for a specific shop_id. And almost never
> by customer_id alone (customer_id is not unique across shops and is
> assigned per shop).
>
> At the same time, if I partition by customer_id first,  partitions will be
> distributed more evenly.
>
> Thanks!
> Boris
>
>
>
>

Reply via email to