[ https://issues.apache.org/jira/browse/IMPALA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955678#comment-16955678 ]
Mauricio Aristizabal commented on IMPALA-3741: ---------------------------------------------- This just became a huge blocker for Kudu adoption for us, and I'm worried that this ticket hasn't had any movement in over a year. We just wanted to move our aggregation/cube tables to Kudu, and have them be true aggregations: one record per combination of dimension columns gets updated as the measures increase (as opposed to additive aggregations in Parquet that we routinely re-aggregate/compact which is very resource intensive and hard to manage). The problem is that to update a 1 billion record table with just 10K arriving changes, the min-max filter in the join between the target agg table and the table with the updates is pretty useless, as the updates will typically be for just a handful of the dimensions yes, however they are not nicely consecutive or even close values but all over the place. So it ends up scanning most of the big table and therefore it gets slower and slower as the table grows. So we'll have to hold off on adopting Kudu until this (and support in Kudu) is added, or until we switch ETL to programmatically mutate the records individually with the Kudu Java client (perhaps in a Spark RDD). Please prioritize this, otherwise Kudu is good only for end-user queries with highly selective filters and joins, and doesn't really support ETL or large-scale analysis via SQL. > Push bloom filters to Kudu scanners > ----------------------------------- > > Key: IMPALA-3741 > URL: https://issues.apache.org/jira/browse/IMPALA-3741 > Project: IMPALA > Issue Type: Task > Components: Backend > Affects Versions: Kudu_Impala > Reporter: Matthew Jacobs > Priority: Major > Labels: kudu, performance > > Impala relies on bloom filters to reduce number of rows from coming out of > the scan node for selective joins. > Queries get up to 20x speedup, not having bloom filter support in Kudu will > create a big performance gap between Parquet and Kudu. > https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/util/bloom-filter.h -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org