[jira] [Created] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

2023-08-17 Thread Michael Smith (Jira)
Michael Smith created IMPALA-12383:
--

 Summary: Aggregation with num_nodes=1 and limit returns too many 
rows
 Key: IMPALA-12383
 URL: https://issues.apache.org/jira/browse/IMPALA-12383
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Frontend
Affects Versions: Impala 4.1.0
Reporter: Michael Smith


With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too 
many rows:
{code}
> select distinct l_orderkey from tpch.lineitem limit 10;
...
Fetched 16 row(s) in 0.12s
> select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3;
...
Fetched 7 row(s) in 0.14s
{code}

This looks like it's caused by changes in IMPALA-2581, which attempts to push 
down limits to pre-aggregation. In SingleNodePlanner, there is no 
pre-aggregation, which the patch appears to have failed to account for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors

2023-08-17 Thread Abhishek Rawat (Jira)
Abhishek Rawat created IMPALA-12382:
---

 Summary: Coordinator could schedule fragments on gracefully 
shutdown executors
 Key: IMPALA-12382
 URL: https://issues.apache.org/jira/browse/IMPALA-12382
 Project: IMPALA
  Issue Type: Improvement
Reporter: Abhishek Rawat


Statestore does failure detection based on consecutive heartbeat failures. This 
is by default configured to be 10 (statestore_max_missed_heartbeats) at 1 
second intervals (statestore_heartbeat_frequency_ms). This could however take 
much longer than 10 seconds overall, especially if statestore is busy and due 
to rpc timeout duration.

In the following example it took 50 seconds for failure detection:
{code:java}
I0817 12:32:06.824721    86 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done
I0817 12:32:06.824741    86 failure-detector.cc:91] 1 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is OK
.
.
.
I0817 12:32:56.800251    83 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done 
I0817 12:32:56.800267    83 failure-detector.cc:91] 10 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is FAILED
I0817 12:32:56.800276    83 statestore.cc:1168] Subscriber 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'
 has failed, disconnected or re-registered (last known registration ID: 
c84bf70f03acda2b:b34a812c5e96e687){code}
As a result there is a window when statestore is determining node failure and 
coordinator might schedule fragments on that particular executor(s). The exec 
RPC will fail and if transparent query retries is enabled, coordinator will 
immediately retry the query and it will fail again.

Ideally in such situations coordinator should be notified sooner about a failed 
executor. Statestore could send priority topic update to coordinator when it 
enters failure detection logic. This should reduce the chances of coordinator 
scheduling query fragment on a failed executor.

The other argument could be to tune the heartbeat frequency and interval 
parameters. But, it's hard to find configuration which works for all cases. 
And, so while the default values are reasonable, under certain conditions they 
could be unreasonable as seen in the above example.

It might make sense to especially handle the case where executors are shutdown 
gracefully and in such case statestore shouldn't do failure detection and 
instead fail these executor immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-12372) Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-12372.

Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3
> ---
>
> Key: IMPALA-12372
> URL: https://issues.apache.org/jira/browse/IMPALA-12372
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> As part of supporting Redhat 9 / Ubuntu 22, those platforms use OpenSSL3 and 
> compilation will produce warnings that fail our build (due to -Werror). The 
> original change turned off those deprecation warnings for all platforms.
> This is overly broad. We should try to turn off those warnings only for 
> platforms that use OpenSSL3. Otherwise, we are blind to other locations that 
> are using deprecated functions. This came up when investigating using 
> googletest 1.12.1 (which deprecated some calls we use).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12381) Add jdbc related properties to JDBC data source object

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12381:


 Summary: Add jdbc related properties to JDBC data source object
 Key: IMPALA-12381
 URL: https://issues.apache.org/jira/browse/IMPALA-12381
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


Currently jdbc related properties are specified as table properties when 
creating table as below:

CREATE TABLE alltypes_jdbc_datasource (
 id INT, name STRING)
PRODUCED BY DATA SOURCE JdbcDataSource (
'{"database.type":"POSTGRES",
"jdbc.url":"jdbc:postgresql://localhost:5432/functional",
"jdbc.driver":"org.postgresql.Driver",
"dbcp.username":"hiveuser",
"dbcp.password":"password",
"table":"alltypes"}');

It's more convenient to move jdbc related properties to data source object as 
below so that user don't need to specify those properties for each table.

CREATE DATA SOURCE JdbcDataSource
LOCATION '/test-warehouse/data-sources/jdbc-data-source.jar'
CLASS 'org.apache.impala.extdatasource.jdbc.JdbcDataSource'
DATABSE-TYPE 'POSTGRES'
JDBC-URL 'jdbc:postgresql://localhost:5432/functional'
JDBC-DRIVER 'org.postgresql.Driver'
DBCP-USERNAME 'hiveuser'
DBCP-PASSWORD 'password'
API_VERSION 'V1';  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12380) Securing dbcp.password for JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12380:


 Summary: Securing dbcp.password for JDBC external data source
 Key: IMPALA-12380
 URL: https://issues.apache.org/jira/browse/IMPALA-12380
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Wenzhe Zhou


In the first patch of JDBC external data source 
(https://gerrit.cloudera.org/#/c/17842/) 
"dbcp.password" is provided as clear text in the table property. We should 
allow user to store password in a Java keystore file on HDFS and protect the 
keystore file for the authorized users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12379) Detect available jdbc drivers without restarting Impala

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12379:


 Summary: Detect available jdbc drivers without restarting Impala
 Key: IMPALA-12379
 URL: https://issues.apache.org/jira/browse/IMPALA-12379
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Wenzhe Zhou


JDBC external data source should be able detect any jdbc driver jars in 
classpath (include mysql, postgres, impala, oracle, etc) without restarting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12378) Auto Ship JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12378:


 Summary: Auto Ship JDBC external data source
 Key: IMPALA-12378
 URL: https://issues.apache.org/jira/browse/IMPALA-12378
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend, Infrastructure
Reporter: Wenzhe Zhou


The  library of JDBC external data source should be auto shipped in Impala 
binaries so that user don’t need to add the jar file manually. However jdbc 
driver jars are provided by user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12377) Improve 'select count(*)' for external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12377:


 Summary: Improve 'select count(*)' for external data source
 Key: IMPALA-12377
 URL: https://issues.apache.org/jira/browse/IMPALA-12377
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


The code to handle 'select count(*)' in backend function 
DataSourceScanNode::GetNext() are not efficient. Even there are no column data 
returned from external data source, it still try to materialize rows and add 
rows to RowBatch one by one up to the number of row count.  It also call 
GetNextInputBatch() multiple times (count / batch_size), while  
GetNextInputBatch() invoke JNI function.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12376) DataSourceScanNode drop some returned rows if FLAGS_data_source_batch_size is greater than default value

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12376:


 Summary: DataSourceScanNode drop some returned rows if 
FLAGS_data_source_batch_size is greater than default value
 Key: IMPALA-12376
 URL: https://issues.apache.org/jira/browse/IMPALA-12376
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Wenzhe Zhou
Assignee: Wenzhe Zhou


Backend DataSourceScanNode (be/src/exec/data-source-scan-node.cc) does not 
handle eos properly in function DataSourceScanNode::GetNext().  Rows, which are 
returned from external data source, could be dropped if 
FLAGS_data_source_batch_size is set with value which is greater than default 
value 1024.

In following code: 
  if (row_batch->AtCapacity() || input_batch_->eos || ReachedLimit()) {
*eos = input_batch_->eos || ReachedLimit();
eos could be set as true when some rows in input batch are not processed if 
row_batch->AtCapacity() return true. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12375:


 Summary: DataSource ojects are not persistent
 Key: IMPALA-12375
 URL: https://issues.apache.org/jira/browse/IMPALA-12375
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
not persistent.  The objects are not shown in "show data sources" after the 
mini-cluster is restarted.  




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*"

2023-08-17 Thread Joe McDonnell (Jira)
Joe McDonnell created IMPALA-12374:
--

 Summary: Explore optimizing re2 usage for leading / trailing ".*"
 Key: IMPALA-12374
 URL: https://issues.apache.org/jira/browse/IMPALA-12374
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Joe McDonnell


Abseil has some recommendations about efficiently using re2 here: 
[https://abseil.io/fast/21]

One recommendation it has is to avoid leading / trailing .* for FullMatch():
{noformat}
Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead, 
change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch() 
performs an unanchored search, so it is also necessary to anchor the regular 
expression (i.e. with ^ or $) to indicate that it must match at the start or 
end of the string.{noformat}
For our slow path LIKE evaluation, we convert the LIKE to a regular expression 
and use FullMatch(). Our code to generate the regular expression will use 
leading/trailing .* and FullMatch for patterns like '%a%b%'. We could try 
detecting these cases and switching to PartialMatch with anchors. See the link 
for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (IMPALA-11195) Disable SSL session renegotiation

2023-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy resolved IMPALA-11195.

Resolution: Fixed

Thanks Michael for fixing the webserver ssl bugs.

> Disable SSL session renegotiation
> -
>
> Key: IMPALA-11195
> URL: https://issues.apache.org/jira/browse/IMPALA-11195
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> SSL renegotiations has had a couple of CVEs in the past. We should figure out 
> how to disable it.
> Kudu disabled SSL renegotations in KUDU-1926, so we can do something similar.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)