[jira] [Work started] (IMPALA-8039) Incorrect selectivity estimate for not-equals predicate
[ https://issues.apache.org/jira/browse/IMPALA-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-8039 started by Paul Rogers. --- > Incorrect selectivity estimate for not-equals predicate > --- > > Key: IMPALA-8039 > URL: https://issues.apache.org/jira/browse/IMPALA-8039 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 3.1.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Suppose we write a query that uses the not-equals predicate: > {code:sql} > select * > from functional.alltypestiny > where id != 10 > {code} > How many rows will we get? Let's reason it out. Suppose we do this: > {code:sql} > select * > from functional.alltypestiny > where id = 10 > {code} > We know that {{is}} is unique and the table has 8 rows. So, in the second > query, we'll get only one row: where {{id = 10}}. Using this, we can see that > the first query will return all the rows that the second one did not, that is > {{8 - 1 = 7}}. > Let's see what the planner says: > {noformat} > PLAN-ROOT SINK > | mem-estimate=0B mem-reservation=0B thread-reservation=0 > | > 00:SCAN HDFS [functional.alltypestiny] >partitions=4/4 files=4 size=460B >predicates: id != CAST(10 AS INT) >tuple-ids=0 row-size=89B cardinality=1 > {noformat} > So, the planner says that both equality and in-equality give the same number > of rows. Clearly, this is wrong. It is, in fact, a symptom of the fact that > Impala does not attempt to calculate selectivity for other than equality. > (IMPALA-7601). > The correct selectivity estimate for inequality is: > {noformat} > sel(c != x) = 1 - 1/ndv(c) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-6590) Disable expr rewrites and codegen for VALUES() statements
[ https://issues.apache.org/jira/browse/IMPALA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764269#comment-16764269 ] Paul Rogers commented on IMPALA-6590: - We are working on steps to improve expression rewrites. The goal is to merge rewrites into expression analysis to achieve a number of benefits, including avoiding the costly pattern matching steps currently used in rewrites. Will also allow better type propagation, avoidance of the need to analyze, reset and re-analyze, etc. See IMPALA-8041 for some of the changes. > Disable expr rewrites and codegen for VALUES() statements > - > > Key: IMPALA-6590 > URL: https://issues.apache.org/jira/browse/IMPALA-6590 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0 >Reporter: Alexander Behm >Priority: Major > Labels: perf, planner, ramp-up, regression > > The analysis of statements with big VALUES clauses like INSERT INTO > VALUES is slow due to expression rewrites like constant folding. The > performance of such statements has regressed since the introduction of expr > rewrites and constant folding in IMPALA-1788. > We should skip expr rewrites for VALUES altogether since it mostly provides > no benefit but can have a large overhead due to evaluation of expressions in > the backend (constant folding). These expressions are ultimately evaluated > and materialized in the backend anyway, so there's no point in folding them > during analysis. > Similarly, there is no point in doing codegen for these exprs in the backend > union node. > *Workaround* > {code} > SET ENABLE_EXPR_REWRITES=FALSE; > SET DISABLE_CODEGEN=TRUE; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-6590) Disable expr rewrites and codegen for VALUES() statements
[ https://issues.apache.org/jira/browse/IMPALA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764266#comment-16764266 ] Philip Zeyliger commented on IMPALA-6590: - For purposes of reproduction, the following shows how not linear we are in number of columns in the VALUES statement: {code} $for i in 256 512 1024 2048 4096 8192 16384 32768; do (echo 'VALUES ('; for x in $(seq $i); do echo "cast($x as string),"; done; echo "NULL); profile;") | time impala-shell.sh -f /dev/stdin |& grep Analysis; done - Analysis finished: 35.027ms (34.359ms) - Analysis finished: 76.808ms (75.678ms) - Analysis finished: 188.936ms (186.829ms) - Analysis finished: 499.325ms (494.968ms) - Analysis finished: 1s606ms (1s598ms) - Analysis finished: 6s663ms (6s647ms) - Analysis finished: 29s844ms (29s812ms) - Analysis finished: 2m37s (2m37s) {code} My ad-hoc jstacking suggests that there's an issue below as well as calling into the native code (serially, thereby encountering possibly a lot of JNI overhead). Looking the source, SelectStmt.java:291 is in a loop for every expression in the statement, and it ends up inserting it into a List. So, the number of {{equals()}} calls is quadratic. {code} "Thread-50" #70 prio=5 os_prio=0 tid=0x0b471000 nid=0x10cc runnable [0x7ff90190a000] java.lang.Thread.State: RUNNABLE at org.apache.impala.analysis.SlotRef.localEquals(SlotRef.java:193) at org.apache.impala.analysis.SlotRef$1.matches(SlotRef.java:206) at org.apache.impala.analysis.Expr.matches(Expr.java:841) at org.apache.impala.analysis.Expr.equals(Expr.java:865) at org.apache.impala.analysis.ExprSubstitutionMap.get(ExprSubstitutionMap.java:67) at org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyzeSelectClause(SelectStmt.java:291) at org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:223) at org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:207) at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:200) at org.apache.impala.analysis.UnionStmt$UnionOperand.analyze(UnionStmt.java:88) at org.apache.impala.analysis.UnionStmt.analyzeOperands(UnionStmt.java:280) at org.apache.impala.analysis.UnionStmt.analyze(UnionStmt.java:219) at org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:448) at org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:418) at org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1282) at org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1249) at org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1219) at org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:168) {code} > Disable expr rewrites and codegen for VALUES() statements > - > > Key: IMPALA-6590 > URL: https://issues.apache.org/jira/browse/IMPALA-6590 > Project: IMPALA > Issue Type: Bug > Components: Frontend >Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0 >Reporter: Alexander Behm >Priority: Major > Labels: perf, planner, ramp-up, regression > > The analysis of statements with big VALUES clauses like INSERT INTO > VALUES is slow due to expression rewrites like constant folding. The > performance of such statements has regressed since the introduction of expr > rewrites and constant folding in IMPALA-1788. > We should skip expr rewrites for VALUES altogether since it mostly provides > no benefit but can have a large overhead due to evaluation of expressions in > the backend (constant folding). These expressions are ultimately evaluated > and materialized in the backend anyway, so there's no point in folding them > during anal
[jira] [Commented] (IMPALA-8178) Tests failing with “Memory is likely oversubscribed” on EC filesystem
[ https://issues.apache.org/jira/browse/IMPALA-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16764264#comment-16764264 ] Andrew Sherman commented on IMPALA-8178: Another failure occurred where the tests that failed were: * query_test.test_decimal_queries.TestDecimalExprs.test_exprs * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges * query_test.test_scanners_fuzz.TestScannersFuzzing.test_fuzz_uncompressed_parquet * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format: avro/none] * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format: rc/none] * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format: seq/snap/block] * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format: seq/gzip/block] so the only test that failed in both cases was query_test.test_scanners.TestTpchScanRangeLengths > Tests failing with “Memory is likely oversubscribed” on EC filesystem > - > > Key: IMPALA-8178 > URL: https://issues.apache.org/jira/browse/IMPALA-8178 > Project: IMPALA > Issue Type: Bug >Reporter: Andrew Sherman >Assignee: Andrew Sherman >Priority: Major > > In tests run against an Erasure Coding filesystem, multiple tests failed with > memory allocation errors. > In total 10 tests failed: > * query_test.test_scanners.TestParquet.test_decimal_encodings > * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges > * query_test.test_exprs.TestExprs.test_exprs [enable_expr_rewrites: 0] > * query_test.test_exprs.TestExprs.test_exprs [enable_expr_rewrites: 1] > * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_scan_node > * query_test.test_scanners.TestParquet.test_def_levels > * > query_test.test_scanners.TestTextSplitDelimiters.test_text_split_across_buffers_delimiterquery_test.test_hbase_queries.TestHBaseQueries.test_hbase_filters > * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_inline_views > * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_top_n > The first failure looked like this on the client side: > {quote} > F > query_test/test_scanners.py::TestParquet::()::test_decimal_encodings[protocol: > beeswax | exec_option: {'batch_size': 0, 'num_nodes': 0, > 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, > 'abort_on_error': 1, 'debug_action': > '-1:OPEN:SET_DENY_RESERVATION_PROBABILITY@0.5', > 'exec_single_node_rows_threshold': 0} | table_format: parquet/none] > query_test/test_scanners.py:717: in test_decimal_encodings > self.run_test_case('QueryTest/parquet-decimal-formats', vector, > unique_database) > common/impala_test_suite.py:472: in run_test_case > result = self.__execute_query(target_impalad_client, query, user=user) > common/impala_test_suite.py:699: in __execute_query > return impalad_client.execute(query, user=user) > common/impala_connection.py:174: in execute > return self.__beeswax_client.execute(sql_stmt, user=user) > beeswax/impala_beeswax.py:183: in execute > handle = self.__execute_query(query_string.strip(), user=user) > beeswax/impala_beeswax.py:360: in __execute_query > self.wait_for_finished(handle) > beeswax/impala_beeswax.py:381: in wait_for_finished > raise ImpalaBeeswaxException("Query aborted:" + error_log, None) > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EQuery aborted:ExecQueryFInstances rpc > query_id=6e44c3c949a31be2:f973c7ff failed: Failed to get minimum > memory reservation of 8.00 KB on daemon xxx.com:22001 for query > 6e44c3c949a31be2:f973c7ff due to following error: Memory limit > exceeded: Could not allocate memory while trying to increase reservation. > E Query(6e44c3c949a31be2:f973c7ff) could not allocate 8.00 KB > without exceeding limit. > E Error occurred on backend xxx.com:22001 > E Memory left in process limit: 1.19 GB > E Query(6e44c3c949a31be2:f973c7ff): Reservation=0 > ReservationLimit=9.60 GB OtherMemory=0 Total=0 Peak=0 > E Memory is likely oversubscribed. Reducing query concurrency or > configuring admission control may help avoid this error. > {quote} > On the server side log: > {quote} > I0207 18:25:19.329311 5562 impala-server.cc:1063] > 6e44c3c949a31be2:f973c7ff] Registered query > query_id=6e44c3c949a31be2:f973c7ff > session_id=93497065f69e9d01:8a3bd06faff3da5 > I0207 18:25:19.329434 5562 Frontend.java:1242] > 6e44c3c949a31be2:f973c7ff] Analyzing query: select score from > decimal_stored_as_int32 > I0207 18:25:19.329583 5562 FeSupport.java:285] > 6e44c3c949a31be2:f973c7ff] Requesting prioritized load of table(s): > test_decimal_enc