At work we have two very large tables stored in Kudu (since we need to be able
to perform row updates efficiently).
One of them (called table33 in the example SQL below) contains two columns
called 'join_column1' and 'join_column2' that are used to join it with the
other table (called table14 below).
A typical query would look like this one:
SELECT STRAIGHT_JOIN
d.pk_date1_column,
d.columnA,
d.columnB,
d.columnC
FROM schema.table14 d
JOIN (
SELECT join_column1, join_column2 FROM schema.table33 dt
WHERE dt.pk_column = 1234567890
AND dt.pk_date2_column >= date_trunc('second',to_timestamp('06-10-2019
18:13:48', 'MM-dd-yyyy HH:mm:ss'))
AND dt.pk_date2_column <=
seconds_add(date_trunc('second',to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy
HH:mm:ss')),1)
AND dt.column3 = 'SOME_VALUE_FOR_COLUMN_3'
AND dt.column4 = 'SOME_VALUE_FOR_COLUMN_4') AS dtdt
WHERE dtdt.join_column1 = d.join_column1
AND dtdt.join_column2 = d.join_column2
AND d.pk_date1_column >=
to_date(date_trunc('month',months_sub(to_timestamp('06-10-2019 18:13:48',
'MM-dd-yyyy HH:mm:ss'), 3)))
AND d.pk_date1_column <=
to_date(date_trunc('month',months_add(to_timestamp('06-10-2019 18:13:48',
'MM-dd-yyyy HH:mm:ss'), 6)))
ORDER BY d.pk_date1_column, d.date2_column;
The idea here is to use a very narrow subselect (using the columns 'pk_column',
'column3', and 'column4' from table33) to just select very few rows (a dozen or
so), and then join with the other table (table14) to return the values of
'columnA', 'columnB', and 'columnC' for these few rows - if we 'knew' the
values of the join columns for these two rows in advance, the whole join could
be easily be replaced by an IN-list.
Some other useful information:
- the cardinality of table14 is about 14 billion rows
- the cardinality of table33 is about 33 billion rows
- the purpose of the 'STRAIGHT_JOIN' hint is to avoid to have the Impala
optimizer reorder the join (and hence get rid of any 'COMPUTE STATS' discussion)
- the SQL query above is roughly modeled after the example here:
https://impala.apache.org/docs/build/html/topics/impala_perf_joins.html
(section: 'Overriding Join Reordering with STRAIGHT_JOIN')
- the version of Impala we are using is 2.12 and the version of Kudu we are
using is 1.7.0 (this is what comes with our vendor distribution)
Running EXPLAIN on the example query above shows the following plan:
PLAN-ROOT SINK
|
05:MERGING-EXCHANGE [UNPARTITIONED]
| order by: pk_date1_column ASC, date2_column ASC
|
03:SORT
| order by: pk_date1_column ASC, date2_column ASC
|
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: d.join_column2 = join_column2, d.join_column1 = join_column1
| runtime filters: RF002 <- join_column2, RF003 <- join_column1
|
|--04:EXCHANGE [BROADCAST]
| |
| 01:SCAN KUDU [schema.table33 dt]
| kudu predicates: dt.pk_column = 1234567890, dt.column3 =
'SOME_VALUE_FOR_COLUMN_3', dt.pk_date2_column <= TIMESTAMP '2019-06-10
18:13:49', dt.pk_date2_column >= TIMESTAMP '2019-06-10 18:13:48', dt.column4 =
'SOME_VALUE_FOR_COLUMN_4'
|
00:SCAN KUDU [schema.table14 d]
kudu predicates: d.pk_date1_column <= TIMESTAMP '2019-12-01 00:00:00',
d.pk_date1_column >= TIMESTAMP '2019-03-01 00:00:00'
runtime filters: RF002 -> d.join_column2, RF003 -> d.join_column1
As you can see above, step '01:SCAN KUDU' uses the kudu predicates to filter
down the results from table33 (and when the query runs, the query profile shows
it takes only a few ms) and returns a dozen or so rows, however step '00:SCAN
KUDU' doesn't use any of the results from the other part of the join and goes
through billions of rows in table14 (taking about 10 minutes or more on our
cluster) to finally return the columns requested in the main select.
On the other hand, if we manually run the very narrow subselect, cut&paste the
join_column values for the few returned rows and replace the join with an
IN-list, Impala takes advantage of this information and returns the requested
columns in sub-second time.
This (joining two very large table to extract only a few rows) seems a very
common use case for us and we were surprised to see this significant
performance impact when using Impala on Kudu tables.
Going through the Impala JIRA we found a couple of interesting pieces of
information:
- a comment by Todd Lipcon
(https://issues.apache.org/jira/browse/IMPALA-4252?focusedCommentId=15921058&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15921058)
back in October 2016, suggesting the use of IN-lists, but it doesn't seem to
have had any reply
- a JIRA task about pushing Bloom filters to Kudu scanners
(https://issues.apache.org/jira/browse/IMPALA-3741), created in June 2016, but
it looks like it is still unresolved.
I was wondering if any of you had to face a similar issue joining two very
large Kudu tables, and if you found an efficient way with Impala to run this
kind of query (short of having to re-design those two tables, which would be a
non-trivial task for us, given the table sizes, and given the number of other
code changes it would require in all the other applications that populate and
query those tables).
We could also tell all our end users that they can no longer have this kind of
two table joins (which they have been used to for years in our current RDBMS),
and that instead they have to run two selects and then add some code for the
IN-list logic, but we were hoping first to be able to find an elegant SQL-only
solution to this problem.
Thanks in advance,
Franco Venturi