[ 
https://issues.apache.org/jira/browse/IMPALA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955678#comment-16955678
 ] 

Mauricio Aristizabal commented on IMPALA-3741:
----------------------------------------------

This just became a huge blocker for Kudu adoption for us, and I'm worried that 
this ticket hasn't had any movement in over a year.

We just wanted to move our aggregation/cube tables to Kudu, and have them be 
true aggregations: one record per combination of dimension columns gets updated 
as the measures increase (as opposed to additive aggregations in Parquet that 
we routinely re-aggregate/compact which is very resource intensive and hard to 
manage).

The problem is that to update a 1 billion record table with just 10K arriving 
changes, the min-max filter in the join between the target agg table and the 
table with the updates is pretty useless, as the updates will typically be for 
just a handful of the dimensions yes, however they are not nicely consecutive 
or even close values but all over the place.  So it ends up scanning most of 
the big table and therefore it gets slower and slower as the table grows.

So we'll have to hold off on adopting Kudu until this (and support in Kudu) is 
added, or until we switch ETL to programmatically mutate the records 
individually with the Kudu Java client (perhaps in a Spark RDD).

Please prioritize this, otherwise Kudu is good only for end-user queries with 
highly selective filters and joins, and doesn't really support ETL or 
large-scale analysis via SQL.

> Push bloom filters to Kudu scanners
> -----------------------------------
>
>                 Key: IMPALA-3741
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3741
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Kudu_Impala
>            Reporter: Matthew Jacobs
>            Priority: Major
>              Labels: kudu, performance
>
> Impala relies on bloom filters to reduce number of rows from coming out of 
> the scan node for selective joins. 
> Queries get up to 20x speedup, not having bloom filter support in Kudu will 
> create a big performance gap between Parquet and Kudu.
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/util/bloom-filter.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to