[ https://issues.apache.org/jira/browse/IMPALA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619827#comment-16619827 ]
Paul Rogers edited comment on IMPALA-7560 at 9/19/18 1:34 AM: -------------------------------------------------------------- FWIW, it turns out that Apache Drill did a similar analysis to work out rules based on the classic defaults plus some reasoning about probability: DRILL-5254 For Drill, since only the "classic" estimates (not stats) are available, the probabilities don't work out because of the conditional probability implied when a user selects one operator vs. another. But, the math reasoning might be used for this ticket if we do have stats to work with. was (Author: paul.rogers): Turns out that Apache Drill did a similar analysis to work out rules based on the classic defaults plus some reasoning about probability: DRILL-5254 For Drill, since only the "classic" estimates (not stats) are available, the probabilities don't work out because of he conditional probability of a user using one operator vs. another. But, the math reasoning might be used here if we do have stats to work with. > Better selectivity estimate for != (not equals) binary predicate > ---------------------------------------------------------------- > > Key: IMPALA-7560 > URL: https://issues.apache.org/jira/browse/IMPALA-7560 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala > 2.12.0, Impala 2.13.0 > Reporter: bharath v > Priority: Major > > Currently we use the default selectivity estimate for any binary predicate > with op other than EQ / NON_DISTINCT. > {noformat} > // Determine selectivity > // TODO: Compute selectivity for nested predicates. > // TODO: Improve estimation using histograms. > Reference<SlotRef> slotRefRef = new Reference<SlotRef>(); > if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT) > && isSingleColumnPredicate(slotRefRef, null)) { > long distinctValues = slotRefRef.getRef().getNumDistinctValues(); > if (distinctValues > 0) { > selectivity_ = 1.0 / distinctValues; > selectivity_ = Math.max(0, Math.min(1, selectivity_)); > } > } > {noformat} > This can give very conservative estimates. For example: > {noformat} > [localhost:21000] tpch> select * from nation where n_regionkey != 1; > [localhost:21000] tpch> summary; > +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+ > | Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak > Mem | Est. Peak Mem | Detail | > +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+ > | 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20* | *3* | > 143.00 KB | 16.00 MB | tpch.nation | > +--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+ > [localhost:21000] tpch> > {noformat} > Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can > give better estimate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org