Re: A question about a potential bug in Druid Joins

Jason Chen Fri, 25 Jun 2021 07:26:25 -0700

Hey Gian,

Thank you very much for your reply.

> "but I wonder if it might still be better-performing to do the + 0. Long 
> typed keys might be faster than string typed keys."

When we are building our models, we didn’t consider the types of columns from 
the performance aspect. We just used the original types from the Kafka schema 
(the datasource is ingested from Kafka). Usually for dimension columns, we can 
store the fields as `string`, `long`, `double`, `float`, or `arrays of 
strings`. For metrics, we will store the columns as `float` or `long`. In a 
query, we will do aggregation or filtering to dimensions. Are you referring to 
different performance of different types when doing group by or filtering, e.g. 
group by / filtering on `long` typed keys performs better than `string` typed 
keys?

Apart from group by / filtering, we will also do Druid Joins on various types 
of keys. We have long-to-long matching, string-to-long matching, and 
long-to-string matching. Will the performance be differentiated when joining 
with different types on the left and right side? Will the join perform better 
if both sides are long or string?

We also tried to `CAST` the `long` key in the left-hand-side to `varchar` in 
query time (in a post-aggregation expression, before doing join) to avoid 
triggering the bug, and it works. Compared with the `api_client_id + 0` 
solution, will the `CAST` solution perform better or worse?

Here is the query with `CAST`?
> quote_type
> > quote_type
> > WITH DIM AS (
> >   SELECT api_client_id, title
> >   FROM api_clients_inline_dimension_20210619 AS API_CLIENTS
> > ),
> > FACTS AS (
> >   SELECT CAST(api_client_id as VARCHAR) as "c_api_client_id", COUNT(*) as 
> > api_client_count
> >   FROM inline_data AS ORDERS
> >   WHERE ORDERS.__time >= TIMESTAMP '2021-06-10 00:00:00' AND ORDERS.__time 
> > < TIMESTAMP '2021-06-18 00:00:00' AND ORDERS.shop_id = 25248974
> >   GROUP BY 1
> > )
> > SELECT DIM.title, FACTS.c_api_client_id, FACTS.api_client_count
> > FROM FACTS
> > LEFT JOIN DIM ON FACTS.c_api_client_id = DIM.api_client_id

The native query for `api_client_id + 0`: 
https://gist.github.com/drinkbeer/b9010a4e4c95ba470e25fe9ec38a4a9d
The native query for `CAST`: 
https://gist.github.com/drinkbeer/13c13e934937fd6d1e6427677f08fcd2

Regards,
Jianbin Chen
On Jun 24, 2021, 8:01 PM -0400, dev@druid.apache.org, wrote:
>
> but I wonder if it might still be
> better-performing to do the + 0. Long typed keys might be faster than
> string typed keys.

Re: A question about a potential bug in Druid Joins

Reply via email to