Joining two very large Kudu tables in Impala

Franco VENTURI Sat, 28 Sep 2019 13:46:44 -0700

At work we have two very large tables stored in Kudu (since we need to be able 
to perform row updates efficiently).
One of them (called table33 in the example SQL below) contains two columns 
called 'join_column1' and 'join_column2' that are used to join it with the 
other table (called table14 below).



A typical query would look like this one:


SELECT STRAIGHT_JOIN
       d.pk_date1_column,
       d.columnA,
       d.columnB,
       d.columnC
  FROM schema.table14 d
  JOIN (
SELECT join_column1, join_column2 FROM schema.table33 dt
WHERE dt.pk_column = 1234567890
   AND dt.pk_date2_column >= date_trunc('second',to_timestamp('06-10-2019 
18:13:48', 'MM-dd-yyyy HH:mm:ss'))
   AND dt.pk_date2_column <= 
seconds_add(date_trunc('second',to_timestamp('06-10-2019 18:13:48', 'MM-dd-yyyy 
HH:mm:ss')),1)
   AND dt.column3 = 'SOME_VALUE_FOR_COLUMN_3'
   AND dt.column4 = 'SOME_VALUE_FOR_COLUMN_4') AS dtdt
WHERE dtdt.join_column1 = d.join_column1
   AND dtdt.join_column2 = d.join_column2
   AND d.pk_date1_column >= 
to_date(date_trunc('month',months_sub(to_timestamp('06-10-2019 18:13:48', 
'MM-dd-yyyy HH:mm:ss'), 3)))
   AND d.pk_date1_column <= 
to_date(date_trunc('month',months_add(to_timestamp('06-10-2019 18:13:48', 
'MM-dd-yyyy HH:mm:ss'), 6)))
ORDER BY d.pk_date1_column, d.date2_column;


The idea here is to use a very narrow subselect (using the columns 'pk_column', 
'column3', and 'column4' from table33) to just select very few rows (a dozen or 
so), and then join with the other table (table14) to return the values of 
'columnA', 'columnB', and 'columnC' for these few rows - if we 'knew' the 
values of the join columns for these two rows in advance, the whole join could 
be easily be replaced by an IN-list.


Some other useful information:
  - the cardinality of table14 is about 14 billion rows
  - the cardinality of table33 is about 33 billion rows
  - the purpose of the 'STRAIGHT_JOIN' hint is to avoid to have the Impala 
optimizer reorder the join (and hence get rid of any 'COMPUTE STATS' discussion)
  - the SQL query above is roughly modeled after the example here: 
https://impala.apache.org/docs/build/html/topics/impala_perf_joins.html 
(section: 'Overriding Join Reordering with STRAIGHT_JOIN')
  - the version of Impala we are using is 2.12 and the version of Kudu we are 
using is 1.7.0 (this is what comes with our vendor distribution)


Running EXPLAIN on the example query above shows the following plan:


PLAN-ROOT SINK
|
05:MERGING-EXCHANGE [UNPARTITIONED]
|  order by: pk_date1_column ASC, date2_column ASC
|
03:SORT
|  order by: pk_date1_column ASC, date2_column ASC
|
02:HASH JOIN [INNER JOIN, BROADCAST]
|  hash predicates: d.join_column2 = join_column2, d.join_column1 = join_column1
|  runtime filters: RF002 <- join_column2, RF003 <- join_column1
|
|--04:EXCHANGE [BROADCAST]
|  |
|  01:SCAN KUDU [schema.table33 dt]
|     kudu predicates: dt.pk_column = 1234567890, dt.column3 = 
'SOME_VALUE_FOR_COLUMN_3', dt.pk_date2_column <= TIMESTAMP '2019-06-10 
18:13:49', dt.pk_date2_column >= TIMESTAMP '2019-06-10 18:13:48', dt.column4 = 
'SOME_VALUE_FOR_COLUMN_4'
|
00:SCAN KUDU [schema.table14 d]
   kudu predicates: d.pk_date1_column <= TIMESTAMP '2019-12-01 00:00:00', 
d.pk_date1_column >= TIMESTAMP '2019-03-01 00:00:00'
   runtime filters: RF002 -> d.join_column2, RF003 -> d.join_column1


As you can see above, step '01:SCAN KUDU' uses the kudu predicates to filter 
down the results from table33 (and when the query runs, the query profile shows 
it takes only a few ms) and returns a dozen or so rows, however step '00:SCAN 
KUDU' doesn't use any of the results from the other part of the join and goes 
through billions of rows in table14 (taking about 10 minutes or more on our 
cluster) to finally return the columns requested in the main select.


On the other hand, if we manually run the very narrow subselect, cut&paste the 
join_column values for the few returned rows and replace the join with an 
IN-list, Impala takes advantage of this information and returns the requested 
columns in sub-second time.


This (joining two very large table to extract only a few rows) seems a very 
common use case for us and we were surprised to see this significant 
performance impact when using Impala on Kudu tables.


Going through the Impala JIRA we found a couple of interesting pieces of 
information:
  - a comment by Todd Lipcon 
(https://issues.apache.org/jira/browse/IMPALA-4252?focusedCommentId=15921058&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15921058)
 back in October 2016, suggesting the use of IN-lists, but it doesn't seem to 
have had any reply
  - a JIRA task about pushing Bloom filters to Kudu scanners 
(https://issues.apache.org/jira/browse/IMPALA-3741), created in June 2016, but 
it looks like it is still unresolved.


I was wondering if any of you had to face a similar issue joining two very 
large Kudu tables, and if you found an efficient way with Impala to run this 
kind of query (short of having to re-design those two tables, which would be a 
non-trivial task for us, given the table sizes, and given the number of other 
code changes it would require in all the other applications that populate and 
query those tables).


We could also tell all our end users that they can no longer have this kind of 
two table joins (which they have been used to for years in our current RDBMS), 
and that instead they have to run two selects and then add some code for the 
IN-list logic, but we were hoping first to be able to find an elegant SQL-only 
solution to this problem.


Thanks in advance,
Franco Venturi

Joining two very large Kudu tables in Impala

Reply via email to