[jira] [Work started] (IMPALA-8039) Incorrect selectivity estimate for not-equals predicate

2019-02-09 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/IMPALA-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-8039 started by Paul Rogers.
---
> Incorrect selectivity estimate for not-equals predicate
> ---
>
> Key: IMPALA-8039
> URL: https://issues.apache.org/jira/browse/IMPALA-8039
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 3.1.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Suppose we write a query that uses the not-equals predicate:
> {code:sql}
> select *
> from functional.alltypestiny
> where id != 10
> {code}
> How many rows will we get? Let's reason it out. Suppose we do this:
> {code:sql}
> select *
> from functional.alltypestiny
> where id = 10
> {code}
> We know that {{is}} is unique and the table has 8 rows. So, in the second 
> query, we'll get only one row: where {{id = 10}}. Using this, we can see that 
> the first query will return all the rows that the second one did not, that is 
> {{8 - 1 = 7}}.
> Let's see what the planner says:
> {noformat}
> PLAN-ROOT SINK
> |  mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> 00:SCAN HDFS [functional.alltypestiny]
>partitions=4/4 files=4 size=460B
>predicates: id != CAST(10 AS INT)
>tuple-ids=0 row-size=89B cardinality=1
> {noformat}
> So, the planner says that both equality and in-equality give the same number 
> of rows. Clearly, this is wrong. It is, in fact, a symptom of the fact that 
> Impala does not attempt to calculate selectivity for other than equality. 
> (IMPALA-7601).
> The correct selectivity estimate for inequality is:
> {noformat}
> sel(c != x) = 1 - 1/ndv(c)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6590) Disable expr rewrites and codegen for VALUES() statements

2019-02-09 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764269#comment-16764269
 ] 

Paul Rogers commented on IMPALA-6590:
-

We are working on steps to improve expression rewrites. The goal is to merge 
rewrites into expression analysis to achieve a number of benefits, including 
avoiding the costly pattern matching steps currently used in rewrites. Will 
also allow better type propagation, avoidance of the need to analyze, reset and 
re-analyze, etc. See IMPALA-8041 for some of the changes.

> Disable expr rewrites and codegen for VALUES() statements
> -
>
> Key: IMPALA-6590
> URL: https://issues.apache.org/jira/browse/IMPALA-6590
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>Reporter: Alexander Behm
>Priority: Major
>  Labels: perf, planner, ramp-up, regression
>
> The analysis of statements with big VALUES clauses like INSERT INTO  
> VALUES is slow due to expression rewrites like constant folding. The 
> performance of such statements has regressed since the introduction of expr 
> rewrites and constant folding in IMPALA-1788.
> We should skip expr rewrites for VALUES altogether since it mostly provides 
> no benefit but can have a large overhead due to evaluation of expressions in 
> the backend (constant folding). These expressions are ultimately evaluated 
> and materialized in the backend anyway, so there's no point in folding them 
> during analysis.
> Similarly, there is no point in doing codegen for these exprs in the backend 
> union node.
> *Workaround*
> {code}
> SET ENABLE_EXPR_REWRITES=FALSE;
> SET DISABLE_CODEGEN=TRUE;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-6590) Disable expr rewrites and codegen for VALUES() statements

2019-02-09 Thread Philip Zeyliger (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764266#comment-16764266
 ] 

Philip Zeyliger commented on IMPALA-6590:
-

For purposes of reproduction, the following shows how not linear we are in 
number of columns in the VALUES statement:
{code}
$for i in 256 512 1024 2048 4096 8192 16384 32768; do (echo 'VALUES ('; for x 
in $(seq $i); do echo  "cast($x as string),"; done; echo "NULL); profile;") | 
time impala-shell.sh -f /dev/stdin |& grep Analysis; done   

   - Analysis finished: 35.027ms (34.359ms) 

   - Analysis finished: 76.808ms (75.678ms) 

   - Analysis finished: 188.936ms 
(186.829ms) 
 - Analysis finished: 
499.325ms (494.968ms)   
   - Analysis 
finished: 1s606ms (1s598ms) 
 - 
Analysis finished: 6s663ms (6s647ms)

  - Analysis finished: 29s844ms (29s812ms)
- Analysis finished: 2m37s (2m37s)
{code}

My ad-hoc jstacking suggests that there's an issue below as well as calling 
into the native code (serially, thereby encountering possibly a lot of JNI 
overhead). Looking the source, SelectStmt.java:291 is in a loop for every 
expression in the statement, and it ends up inserting it into a List. So, the 
number of {{equals()}} calls is quadratic.

{code}
"Thread-50" #70 prio=5 os_prio=0 tid=0x0b471000 nid=0x10cc runnable 
[0x7ff90190a000]
   java.lang.Thread.State: RUNNABLE
at org.apache.impala.analysis.SlotRef.localEquals(SlotRef.java:193)
at org.apache.impala.analysis.SlotRef$1.matches(SlotRef.java:206)
at org.apache.impala.analysis.Expr.matches(Expr.java:841)
at org.apache.impala.analysis.Expr.equals(Expr.java:865)
at 
org.apache.impala.analysis.ExprSubstitutionMap.get(ExprSubstitutionMap.java:67)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyzeSelectClause(SelectStmt.java:291)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.analyze(SelectStmt.java:223)
at 
org.apache.impala.analysis.SelectStmt$SelectAnalyzer.access$100(SelectStmt.java:207)
at org.apache.impala.analysis.SelectStmt.analyze(SelectStmt.java:200)
at 
org.apache.impala.analysis.UnionStmt$UnionOperand.analyze(UnionStmt.java:88)
at 
org.apache.impala.analysis.UnionStmt.analyzeOperands(UnionStmt.java:280)
at org.apache.impala.analysis.UnionStmt.analyze(UnionStmt.java:219)
at 
org.apache.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:448)
at 
org.apache.impala.analysis.AnalysisContext.analyzeAndAuthorize(AnalysisContext.java:418)
at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1282)
at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1249)
at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1219)
at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:168)
{code}

> Disable expr rewrites and codegen for VALUES() statements
> -
>
> Key: IMPALA-6590
> URL: https://issues.apache.org/jira/browse/IMPALA-6590
> Project: IMPALA
>  Issue Type: Bug
>  Components: Frontend
>Affects Versions: Impala 2.8.0, Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>Reporter: Alexander Behm
>Priority: Major
>  Labels: perf, planner, ramp-up, regression
>
> The analysis of statements with big VALUES clauses like INSERT INTO  
> VALUES is slow due to expression rewrites like constant folding. The 
> performance of such statements has regressed since the introduction of expr 
> rewrites and constant folding in IMPALA-1788.
> We should skip expr rewrites for VALUES altogether since it mostly provides 
> no benefit but can have a large overhead due to evaluation of expressions in 
> the backend (constant folding). These expressions are ultimately evaluated 
> and materialized in the backend anyway, so there's no point in folding them 
> during analysis.
> 

[jira] [Commented] (IMPALA-8178) Tests failing with “Memory is likely oversubscribed” on EC filesystem

2019-02-09 Thread Andrew Sherman (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16764264#comment-16764264
 ] 

Andrew Sherman commented on IMPALA-8178:


Another failure occurred where the tests that failed were:
* query_test.test_decimal_queries.TestDecimalExprs.test_exprs
* query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges
* 
query_test.test_scanners_fuzz.TestScannersFuzzing.test_fuzz_uncompressed_parquet
* 
query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format:
 avro/none]
* 
query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format:
 rc/none]
* 
query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format:
 seq/snap/block]
* 
query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges[table_format:
 seq/gzip/block]

so the only test that failed in both cases was 
query_test.test_scanners.TestTpchScanRangeLengths

> Tests failing with “Memory is likely oversubscribed” on EC filesystem
> -
>
> Key: IMPALA-8178
> URL: https://issues.apache.org/jira/browse/IMPALA-8178
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Andrew Sherman
>Priority: Major
>
> In tests run against an Erasure Coding filesystem, multiple tests failed with 
> memory allocation errors.
> In total 10 tests failed:
>  * query_test.test_scanners.TestParquet.test_decimal_encodings
>  * query_test.test_scanners.TestTpchScanRangeLengths.test_tpch_scan_ranges
>  * query_test.test_exprs.TestExprs.test_exprs [enable_expr_rewrites: 0]
>  * query_test.test_exprs.TestExprs.test_exprs [enable_expr_rewrites: 1]
>  * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_scan_node
>  * query_test.test_scanners.TestParquet.test_def_levels
>  * 
> query_test.test_scanners.TestTextSplitDelimiters.test_text_split_across_buffers_delimiterquery_test.test_hbase_queries.TestHBaseQueries.test_hbase_filters
>  * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_inline_views
>  * query_test.test_hbase_queries.TestHBaseQueries.test_hbase_top_n
> The first failure looked like this on the client side:
> {quote}
> F 
> query_test/test_scanners.py::TestParquet::()::test_decimal_encodings[protocol:
>  beeswax | exec_option: {'batch_size': 0, 'num_nodes': 0, 
> 'disable_codegen_rows_threshold': 0, 'disable_codegen': True, 
> 'abort_on_error': 1, 'debug_action': 
> '-1:OPEN:SET_DENY_RESERVATION_PROBABILITY@0.5', 
> 'exec_single_node_rows_threshold': 0} | table_format: parquet/none]
>  query_test/test_scanners.py:717: in test_decimal_encodings
>  self.run_test_case('QueryTest/parquet-decimal-formats', vector, 
> unique_database)
>  common/impala_test_suite.py:472: in run_test_case
>  result = self.__execute_query(target_impalad_client, query, user=user)
>  common/impala_test_suite.py:699: in __execute_query
>  return impalad_client.execute(query, user=user)
>  common/impala_connection.py:174: in execute
>  return self.__beeswax_client.execute(sql_stmt, user=user)
>  beeswax/impala_beeswax.py:183: in execute
>  handle = self.__execute_query(query_string.strip(), user=user)
>  beeswax/impala_beeswax.py:360: in __execute_query
>  self.wait_for_finished(handle)
>  beeswax/impala_beeswax.py:381: in wait_for_finished
>  raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
>  E   ImpalaBeeswaxException: ImpalaBeeswaxException:
>  EQuery aborted:ExecQueryFInstances rpc 
> query_id=6e44c3c949a31be2:f973c7ff failed: Failed to get minimum 
> memory reservation of 8.00 KB on daemon xxx.com:22001 for query 
> 6e44c3c949a31be2:f973c7ff due to following error: Memory limit 
> exceeded: Could not allocate memory while trying to increase reservation.
>  E   Query(6e44c3c949a31be2:f973c7ff) could not allocate 8.00 KB 
> without exceeding limit.
>  E   Error occurred on backend xxx.com:22001
>  E   Memory left in process limit: 1.19 GB
>  E   Query(6e44c3c949a31be2:f973c7ff): Reservation=0 
> ReservationLimit=9.60 GB OtherMemory=0 Total=0 Peak=0
>  E   Memory is likely oversubscribed. Reducing query concurrency or 
> configuring admission control may help avoid this error.
> {quote}
> On the server side log:
> {quote}
> I0207 18:25:19.329311  5562 impala-server.cc:1063] 
> 6e44c3c949a31be2:f973c7ff] Registered query 
> query_id=6e44c3c949a31be2:f973c7ff 
> session_id=93497065f69e9d01:8a3bd06faff3da5
> I0207 18:25:19.329434  5562 Frontend.java:1242] 
> 6e44c3c949a31be2:f973c7ff] Analyzing query: select score from 
> decimal_stored_as_int32
> I0207 18:25:19.329583  5562 FeSupport.java:285] 
> 6e44c3c949a31be2:f973c7ff] Requesting prioritized load of table(s): 
>