[jira] [Created] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows
Michael Smith created IMPALA-12383: -- Summary: Aggregation with num_nodes=1 and limit returns too many rows Key: IMPALA-12383 URL: https://issues.apache.org/jira/browse/IMPALA-12383 Project: IMPALA Issue Type: Bug Components: Backend, Frontend Affects Versions: Impala 4.1.0 Reporter: Michael Smith With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too many rows: {code} > select distinct l_orderkey from tpch.lineitem limit 10; ... Fetched 16 row(s) in 0.12s > select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3; ... Fetched 7 row(s) in 0.14s {code} This looks like it's caused by changes in IMPALA-2581, which attempts to push down limits to pre-aggregation. In SingleNodePlanner, there is no pre-aggregation, which the patch appears to have failed to account for. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors
Abhishek Rawat created IMPALA-12382: --- Summary: Coordinator could schedule fragments on gracefully shutdown executors Key: IMPALA-12382 URL: https://issues.apache.org/jira/browse/IMPALA-12382 Project: IMPALA Issue Type: Improvement Reporter: Abhishek Rawat Statestore does failure detection based on consecutive heartbeat failures. This is by default configured to be 10 (statestore_max_missed_heartbeats) at 1 second intervals (statestore_heartbeat_frequency_ms). This could however take much longer than 10 seconds overall, especially if statestore is busy and due to rpc timeout duration. In the following example it took 50 seconds for failure detection: {code:java} I0817 12:32:06.824721 86 statestore.cc:1157] Unable to send heartbeat message to subscriber impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010, received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected exception: No more data to read., type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala18THeartbeatResponseE, send: done I0817 12:32:06.824741 86 failure-detector.cc:91] 1 consecutive heartbeats failed for 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'. State is OK . . . I0817 12:32:56.800251 83 statestore.cc:1157] Unable to send heartbeat message to subscriber impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010, received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected exception: No more data to read., type: N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala18THeartbeatResponseE, send: done I0817 12:32:56.800267 83 failure-detector.cc:91] 10 consecutive heartbeats failed for 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'. State is FAILED I0817 12:32:56.800276 83 statestore.cc:1168] Subscriber 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010' has failed, disconnected or re-registered (last known registration ID: c84bf70f03acda2b:b34a812c5e96e687){code} As a result there is a window when statestore is determining node failure and coordinator might schedule fragments on that particular executor(s). The exec RPC will fail and if transparent query retries is enabled, coordinator will immediately retry the query and it will fail again. Ideally in such situations coordinator should be notified sooner about a failed executor. Statestore could send priority topic update to coordinator when it enters failure detection logic. This should reduce the chances of coordinator scheduling query fragment on a failed executor. The other argument could be to tune the heartbeat frequency and interval parameters. But, it's hard to find configuration which works for all cases. And, so while the default values are reasonable, under certain conditions they could be unreasonable as seen in the above example. It might make sense to especially handle the case where executors are shutdown gracefully and in such case statestore shouldn't do failure detection and instead fail these executor immediately. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-12372) Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3
[ https://issues.apache.org/jira/browse/IMPALA-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-12372. Fix Version/s: Impala 4.3.0 Resolution: Fixed > Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3 > --- > > Key: IMPALA-12372 > URL: https://issues.apache.org/jira/browse/IMPALA-12372 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.3.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.3.0 > > > As part of supporting Redhat 9 / Ubuntu 22, those platforms use OpenSSL3 and > compilation will produce warnings that fail our build (due to -Werror). The > original change turned off those deprecation warnings for all platforms. > This is overly broad. We should try to turn off those warnings only for > platforms that use OpenSSL3. Otherwise, we are blind to other locations that > are using deprecated functions. This came up when investigating using > googletest 1.12.1 (which deprecated some calls we use). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12381) Add jdbc related properties to JDBC data source object
Wenzhe Zhou created IMPALA-12381: Summary: Add jdbc related properties to JDBC data source object Key: IMPALA-12381 URL: https://issues.apache.org/jira/browse/IMPALA-12381 Project: IMPALA Issue Type: Sub-task Components: Backend, Frontend Reporter: Wenzhe Zhou Currently jdbc related properties are specified as table properties when creating table as below: CREATE TABLE alltypes_jdbc_datasource ( id INT, name STRING) PRODUCED BY DATA SOURCE JdbcDataSource ( '{"database.type":"POSTGRES", "jdbc.url":"jdbc:postgresql://localhost:5432/functional", "jdbc.driver":"org.postgresql.Driver", "dbcp.username":"hiveuser", "dbcp.password":"password", "table":"alltypes"}'); It's more convenient to move jdbc related properties to data source object as below so that user don't need to specify those properties for each table. CREATE DATA SOURCE JdbcDataSource LOCATION '/test-warehouse/data-sources/jdbc-data-source.jar' CLASS 'org.apache.impala.extdatasource.jdbc.JdbcDataSource' DATABSE-TYPE 'POSTGRES' JDBC-URL 'jdbc:postgresql://localhost:5432/functional' JDBC-DRIVER 'org.postgresql.Driver' DBCP-USERNAME 'hiveuser' DBCP-PASSWORD 'password' API_VERSION 'V1'; -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12380) Securing dbcp.password for JDBC external data source
Wenzhe Zhou created IMPALA-12380: Summary: Securing dbcp.password for JDBC external data source Key: IMPALA-12380 URL: https://issues.apache.org/jira/browse/IMPALA-12380 Project: IMPALA Issue Type: Sub-task Reporter: Wenzhe Zhou In the first patch of JDBC external data source (https://gerrit.cloudera.org/#/c/17842/) "dbcp.password" is provided as clear text in the table property. We should allow user to store password in a Java keystore file on HDFS and protect the keystore file for the authorized users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12379) Detect available jdbc drivers without restarting Impala
Wenzhe Zhou created IMPALA-12379: Summary: Detect available jdbc drivers without restarting Impala Key: IMPALA-12379 URL: https://issues.apache.org/jira/browse/IMPALA-12379 Project: IMPALA Issue Type: Sub-task Components: Frontend Reporter: Wenzhe Zhou JDBC external data source should be able detect any jdbc driver jars in classpath (include mysql, postgres, impala, oracle, etc) without restarting. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12378) Auto Ship JDBC external data source
Wenzhe Zhou created IMPALA-12378: Summary: Auto Ship JDBC external data source Key: IMPALA-12378 URL: https://issues.apache.org/jira/browse/IMPALA-12378 Project: IMPALA Issue Type: Sub-task Components: Frontend, Infrastructure Reporter: Wenzhe Zhou The library of JDBC external data source should be auto shipped in Impala binaries so that user don’t need to add the jar file manually. However jdbc driver jars are provided by user. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12377) Improve 'select count(*)' for external data source
Wenzhe Zhou created IMPALA-12377: Summary: Improve 'select count(*)' for external data source Key: IMPALA-12377 URL: https://issues.apache.org/jira/browse/IMPALA-12377 Project: IMPALA Issue Type: Sub-task Components: Backend, Frontend Reporter: Wenzhe Zhou The code to handle 'select count(*)' in backend function DataSourceScanNode::GetNext() are not efficient. Even there are no column data returned from external data source, it still try to materialize rows and add rows to RowBatch one by one up to the number of row count. It also call GetNextInputBatch() multiple times (count / batch_size), while GetNextInputBatch() invoke JNI function. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12376) DataSourceScanNode drop some returned rows if FLAGS_data_source_batch_size is greater than default value
Wenzhe Zhou created IMPALA-12376: Summary: DataSourceScanNode drop some returned rows if FLAGS_data_source_batch_size is greater than default value Key: IMPALA-12376 URL: https://issues.apache.org/jira/browse/IMPALA-12376 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Wenzhe Zhou Assignee: Wenzhe Zhou Backend DataSourceScanNode (be/src/exec/data-source-scan-node.cc) does not handle eos properly in function DataSourceScanNode::GetNext(). Rows, which are returned from external data source, could be dropped if FLAGS_data_source_batch_size is set with value which is greater than default value 1024. In following code: if (row_batch->AtCapacity() || input_batch_->eos || ReachedLimit()) { *eos = input_batch_->eos || ReachedLimit(); eos could be set as true when some rows in input batch are not processed if row_batch->AtCapacity() return true. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12375) DataSource ojects are not persistent
Wenzhe Zhou created IMPALA-12375: Summary: DataSource ojects are not persistent Key: IMPALA-12375 URL: https://issues.apache.org/jira/browse/IMPALA-12375 Project: IMPALA Issue Type: Sub-task Components: Backend, Frontend Reporter: Wenzhe Zhou DataSource ojects which are created with "CREATE DATA SOURCE" statements are not persistent. The objects are not shown in "show data sources" after the mini-cluster is restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*"
Joe McDonnell created IMPALA-12374: -- Summary: Explore optimizing re2 usage for leading / trailing ".*" Key: IMPALA-12374 URL: https://issues.apache.org/jira/browse/IMPALA-12374 Project: IMPALA Issue Type: Improvement Components: Backend Affects Versions: Impala 4.3.0 Reporter: Joe McDonnell Abseil has some recommendations about efficiently using re2 here: [https://abseil.io/fast/21] One recommendation it has is to avoid leading / trailing .* for FullMatch(): {noformat} Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead, change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch() performs an unanchored search, so it is also necessary to anchor the regular expression (i.e. with ^ or $) to indicate that it must match at the start or end of the string.{noformat} For our slow path LIKE evaluation, we convert the LIKE to a regular expression and use FullMatch(). Our code to generate the regular expression will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We could try detecting these cases and switching to PartialMatch with anchors. See the link for more details about how this works. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (IMPALA-11195) Disable SSL session renegotiation
[ https://issues.apache.org/jira/browse/IMPALA-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltán Borók-Nagy resolved IMPALA-11195. Resolution: Fixed Thanks Michael for fixing the webserver ssl bugs. > Disable SSL session renegotiation > - > > Key: IMPALA-11195 > URL: https://issues.apache.org/jira/browse/IMPALA-11195 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Zoltán Borók-Nagy >Assignee: Zoltán Borók-Nagy >Priority: Major > Fix For: Impala 4.3.0 > > > SSL renegotiations has had a couple of CVEs in the past. We should figure out > how to disable it. > Kudu disabled SSL renegotations in KUDU-1926, so we can do something similar. -- This message was sent by Atlassian Jira (v8.20.10#820010)