[jira] [Updated] (IMPALA-7319) Investigate Clang Tidy Diff

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell updated IMPALA-7319:
--
Fix Version/s: (was: Not Applicable)

> Investigate Clang Tidy Diff
> ---
>
> Key: IMPALA-7319
> URL: https://issues.apache.org/jira/browse/IMPALA-7319
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Minor
>
> Clang has a script clang-tidy-diff.py that can run clang tidy on a diff. This 
> is substantially faster than the normal run-clang-tidy.py, because it 
> compiles and analyzes only the changed files. This might also allow a more 
> graceful way to incorporate new clang tidy checks. Kudu has implemented this 
> functionality in their project. See 
> [build-support/clang_tidy_gerrit.py|[https://github.com/apache/kudu/blob/master/build-support/clang_tidy_gerrit.py].]
>  
> While this is faster, it is possible to have a code change that introduces a 
> clang tidy issue in code that didn't change, so clang tidy on a diff might 
> miss some issues.
> We should evaluate whether this is something worth incorporating into Impala. 
> It could be a good way for a developer to do a quick check before upload to 
> Gerrit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Reopened] (IMPALA-7319) Investigate Clang Tidy Diff

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell reopened IMPALA-7319:
---
  Assignee: Joe McDonnell

> Investigate Clang Tidy Diff
> ---
>
> Key: IMPALA-7319
> URL: https://issues.apache.org/jira/browse/IMPALA-7319
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 3.1.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Minor
> Fix For: Not Applicable
>
>
> Clang has a script clang-tidy-diff.py that can run clang tidy on a diff. This 
> is substantially faster than the normal run-clang-tidy.py, because it 
> compiles and analyzes only the changed files. This might also allow a more 
> graceful way to incorporate new clang tidy checks. Kudu has implemented this 
> functionality in their project. See 
> [build-support/clang_tidy_gerrit.py|[https://github.com/apache/kudu/blob/master/build-support/clang_tidy_gerrit.py].]
>  
> While this is faster, it is possible to have a code change that introduces a 
> clang tidy issue in code that didn't change, so clang tidy on a diff might 
> miss some issues.
> We should evaluate whether this is something worth incorporating into Impala. 
> It could be a good way for a developer to do a quick check before upload to 
> Gerrit.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

2023-08-17 Thread Michael Smith (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755731#comment-17755731
 ] 

Michael Smith commented on IMPALA-12383:


[~sql_forever] [~liuyao] my best take on how to address this is to mark 
FINALIZE aggregators that we know have a preagg step as having preagg, and only 
do pushdown on those. If there's no preagg, I don't think pushdown makes sense.

> Aggregation with num_nodes=1 and limit returns too many rows
> 
>
> Key: IMPALA-12383
> URL: https://issues.apache.org/jira/browse/IMPALA-12383
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend, Frontend
>Affects Versions: Impala 4.1.0
>Reporter: Michael Smith
>Priority: Major
>
> With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too 
> many rows:
> {code}
> > select distinct l_orderkey from tpch.lineitem limit 10;
> ...
> Fetched 16 row(s) in 0.12s
> > select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3;
> ...
> Fetched 7 row(s) in 0.14s
> {code}
> This looks like it's caused by changes in IMPALA-2581, which attempts to push 
> down limits to pre-aggregation. In SingleNodePlanner, there is no 
> pre-aggregation, which the patch appears to have failed to account for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

2023-08-17 Thread Michael Smith (Jira)
Michael Smith created IMPALA-12383:
--

 Summary: Aggregation with num_nodes=1 and limit returns too many 
rows
 Key: IMPALA-12383
 URL: https://issues.apache.org/jira/browse/IMPALA-12383
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Frontend
Affects Versions: Impala 4.1.0
Reporter: Michael Smith


With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too 
many rows:
{code}
> select distinct l_orderkey from tpch.lineitem limit 10;
...
Fetched 16 row(s) in 0.12s
> select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3;
...
Fetched 7 row(s) in 0.14s
{code}

This looks like it's caused by changes in IMPALA-2581, which attempts to push 
down limits to pre-aggregation. In SingleNodePlanner, there is no 
pre-aggregation, which the patch appears to have failed to account for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12383) Aggregation with num_nodes=1 and limit returns too many rows

2023-08-17 Thread Michael Smith (Jira)
Michael Smith created IMPALA-12383:
--

 Summary: Aggregation with num_nodes=1 and limit returns too many 
rows
 Key: IMPALA-12383
 URL: https://issues.apache.org/jira/browse/IMPALA-12383
 Project: IMPALA
  Issue Type: Bug
  Components: Backend, Frontend
Affects Versions: Impala 4.1.0
Reporter: Michael Smith


With {{set num_nodes=1}} to select SingleNodePlanner, aggregations return too 
many rows:
{code}
> select distinct l_orderkey from tpch.lineitem limit 10;
...
Fetched 16 row(s) in 0.12s
> select ss_cdemo_sk from tpcds.store_sales group by ss_cdemo_sk limit 3;
...
Fetched 7 row(s) in 0.14s
{code}

This looks like it's caused by changes in IMPALA-2581, which attempts to push 
down limits to pre-aggregation. In SingleNodePlanner, there is no 
pre-aggregation, which the patch appears to have failed to account for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-7131) Support external data sources in local catalog mode

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755730#comment-17755730
 ] 

Wenzhe Zhou commented on IMPALA-7131:
-

getDataSource() and getDataSources() are defined in FeCatalog interface, but 
they are not implemented by LocalCatalog class, and the cache of data source 
objects is not defined in MetaProvider class. 
addDataSource() and removeDataSource() are not defined in FeCatalog interface.

> Support external data sources in local catalog mode
> ---
>
> Key: IMPALA-7131
> URL: https://issues.apache.org/jira/browse/IMPALA-7131
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Catalog, Frontend
>Reporter: Todd Lipcon
>Assignee: Wenzhe Zhou
>Priority: Minor
>  Labels: catalog-v2
>
> Currently it seems that external data sources are not persisted except in 
> memory on the catalogd. This means that it will be somewhat more difficult to 
> support this feature in the design of impalad without a catalogd.
> This JIRA is to eventually figure out a way to support this feature -- either 
> by supporting in-memory on a per-impalad basis, or perhaps by figuring out a 
> way to register them persistently in a file system directory, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou updated IMPALA-12375:
-
Description: 
DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
not persistent.  The objects are not shown in "show data sources" after the 
catalog server is restarted.  


  was:
DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
not persistent.  The objects are not shown in "show data sources" after the 
mini-cluster is restarted.  



> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> catalog server is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755727#comment-17755727
 ] 

Wenzhe Zhou edited comment on IMPALA-12375 at 8/17/23 11:14 PM:


Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are [NOT persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with[JDBC 
storage|https://cwiki.apache.org/confluence/display/Hive/JDBC+Storage+Handler].

Since data source objects are not persistent, they are missing when catalog 
server is restarted. If CatalogD HA is enabled, data source objects are created 
only on active catalogd, not on standby catalogd.  The missing data source 
objects don't affect existing data source tables. We need to recreate data 
source object before creating new data source table.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 


was (Author: wzhou):
Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are NOT [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with[ JDBC 
storage|https://cwiki.apache.org/confluence/display/Hive/JDBC+Storage+Handler].

Since data source objects are not persistent, they are missing when catalog 
server is restarted. If CatalogD HA is enabled, data source objects are created 
only on active catalogd, not on standby catalogd.  The missing data source 
objects don't affect existing data source tables. We need to recreate data 
source object before creating new data source table.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755727#comment-17755727
 ] 

Wenzhe Zhou edited comment on IMPALA-12375 at 8/17/23 11:12 PM:


Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are NOT [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with[ JDBC 
storage|https://cwiki.apache.org/confluence/display/Hive/JDBC+Storage+Handler].

Since data source objects are not persistent, they are missing when catalog 
server is restarted. If CatalogD HA is enabled, data source objects are created 
only on active catalogd, not on standby catalogd.  The missing data source 
objects don't affect existing data source tables. We need to recreate data 
source object before creating new data source table.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 


was (Author: wzhou):
Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are NOT [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with JDBC storage.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Comment Edited] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755727#comment-17755727
 ] 

Wenzhe Zhou edited comment on IMPALA-12375 at 8/17/23 11:04 PM:


Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are NOT [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with JDBC storage.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 


was (Author: wzhou):
Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with JDBC storage.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755727#comment-17755727
 ] 

Wenzhe Zhou commented on IMPALA-12375:
--

Data source objects are saved in-memory 
[cache|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L93-L94]
 in Catalog.  They are [persisted to the 
metastore|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/Catalog.java#L261-L267]
 by original design.  All data source properties are stored as [table 
properties|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/DataSourceTable.java#L41-L48]
 (persisted in the metastore) so that the DataSource catalog objects are not 
needed in order to scan data source tables.

Hive don't have data source object. All properties are specified as table 
properties when creating table with JDBC storage.

To make data source object persistent,  we need to add new APIs in HMS to 
support this Impala specific objects. 

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors

2023-08-17 Thread Abhishek Rawat (Jira)
Abhishek Rawat created IMPALA-12382:
---

 Summary: Coordinator could schedule fragments on gracefully 
shutdown executors
 Key: IMPALA-12382
 URL: https://issues.apache.org/jira/browse/IMPALA-12382
 Project: IMPALA
  Issue Type: Improvement
Reporter: Abhishek Rawat


Statestore does failure detection based on consecutive heartbeat failures. This 
is by default configured to be 10 (statestore_max_missed_heartbeats) at 1 
second intervals (statestore_heartbeat_frequency_ms). This could however take 
much longer than 10 seconds overall, especially if statestore is busy and due 
to rpc timeout duration.

In the following example it took 50 seconds for failure detection:
{code:java}
I0817 12:32:06.824721    86 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done
I0817 12:32:06.824741    86 failure-detector.cc:91] 1 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is OK
.
.
.
I0817 12:32:56.800251    83 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done 
I0817 12:32:56.800267    83 failure-detector.cc:91] 10 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is FAILED
I0817 12:32:56.800276    83 statestore.cc:1168] Subscriber 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'
 has failed, disconnected or re-registered (last known registration ID: 
c84bf70f03acda2b:b34a812c5e96e687){code}
As a result there is a window when statestore is determining node failure and 
coordinator might schedule fragments on that particular executor(s). The exec 
RPC will fail and if transparent query retries is enabled, coordinator will 
immediately retry the query and it will fail again.

Ideally in such situations coordinator should be notified sooner about a failed 
executor. Statestore could send priority topic update to coordinator when it 
enters failure detection logic. This should reduce the chances of coordinator 
scheduling query fragment on a failed executor.

The other argument could be to tune the heartbeat frequency and interval 
parameters. But, it's hard to find configuration which works for all cases. 
And, so while the default values are reasonable, under certain conditions they 
could be unreasonable as seen in the above example.

It might make sense to especially handle the case where executors are shutdown 
gracefully and in such case statestore shouldn't do failure detection and 
instead fail these executor immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors

2023-08-17 Thread Abhishek Rawat (Jira)
Abhishek Rawat created IMPALA-12382:
---

 Summary: Coordinator could schedule fragments on gracefully 
shutdown executors
 Key: IMPALA-12382
 URL: https://issues.apache.org/jira/browse/IMPALA-12382
 Project: IMPALA
  Issue Type: Improvement
Reporter: Abhishek Rawat


Statestore does failure detection based on consecutive heartbeat failures. This 
is by default configured to be 10 (statestore_max_missed_heartbeats) at 1 
second intervals (statestore_heartbeat_frequency_ms). This could however take 
much longer than 10 seconds overall, especially if statestore is busy and due 
to rpc timeout duration.

In the following example it took 50 seconds for failure detection:
{code:java}
I0817 12:32:06.824721    86 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done
I0817 12:32:06.824741    86 failure-detector.cc:91] 1 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is OK
.
.
.
I0817 12:32:56.800251    83 statestore.cc:1157] Unable to send heartbeat 
message to subscriber 
impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
 received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
exception: No more data to read., type: 
N6apache6thrift9transport19TTransportExceptionE, rpc: 
N6impala18THeartbeatResponseE, send: done 
I0817 12:32:56.800267    83 failure-detector.cc:91] 10 consecutive heartbeats 
failed for 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
 State is FAILED
I0817 12:32:56.800276    83 statestore.cc:1168] Subscriber 
'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'
 has failed, disconnected or re-registered (last known registration ID: 
c84bf70f03acda2b:b34a812c5e96e687){code}
As a result there is a window when statestore is determining node failure and 
coordinator might schedule fragments on that particular executor(s). The exec 
RPC will fail and if transparent query retries is enabled, coordinator will 
immediately retry the query and it will fail again.

Ideally in such situations coordinator should be notified sooner about a failed 
executor. Statestore could send priority topic update to coordinator when it 
enters failure detection logic. This should reduce the chances of coordinator 
scheduling query fragment on a failed executor.

The other argument could be to tune the heartbeat frequency and interval 
parameters. But, it's hard to find configuration which works for all cases. 
And, so while the default values are reasonable, under certain conditions they 
could be unreasonable as seen in the above example.

It might make sense to especially handle the case where executors are shutdown 
gracefully and in such case statestore shouldn't do failure detection and 
instead fail these executor immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12372) Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-12372.

Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3
> ---
>
> Key: IMPALA-12372
> URL: https://issues.apache.org/jira/browse/IMPALA-12372
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> As part of supporting Redhat 9 / Ubuntu 22, those platforms use OpenSSL3 and 
> compilation will produce warnings that fail our build (due to -Werror). The 
> original change turned off those deprecation warnings for all platforms.
> This is overly broad. We should try to turn off those warnings only for 
> platforms that use OpenSSL3. Otherwise, we are blind to other locations that 
> are using deprecated functions. This came up when investigating using 
> googletest 1.12.1 (which deprecated some calls we use).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-12372) Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-12372.

Fix Version/s: Impala 4.3.0
   Resolution: Fixed

> Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3
> ---
>
> Key: IMPALA-12372
> URL: https://issues.apache.org/jira/browse/IMPALA-12372
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> As part of supporting Redhat 9 / Ubuntu 22, those platforms use OpenSSL3 and 
> compilation will produce warnings that fail our build (due to -Werror). The 
> original change turned off those deprecation warnings for all platforms.
> This is overly broad. We should try to turn off those warnings only for 
> platforms that use OpenSSL3. Otherwise, we are blind to other locations that 
> are using deprecated functions. This came up when investigating using 
> googletest 1.12.1 (which deprecated some calls we use).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (IMPALA-11877) Add support for DELETE statements for Iceberg tables

2023-08-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-11877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755701#comment-17755701
 ] 

ASF subversion and git services commented on IMPALA-11877:
--

Commit 12276c79f9975dc63322138ea56290434a49221d in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=12276c79f ]

IMPALA-12335: [DOCS] Add documentation about the DELETE statement

IMPALA-11877 added support for the DELETE statement for Iceberg
tables. This patch documents this feature.

Change-Id: If111a7ecd20bda2d4928332ef2ccd905814cb203
Reviewed-on: http://gerrit.cloudera.org:8080/20361
Reviewed-by: Zoltan Borok-Nagy 
Tested-by: Impala Public Jenkins 


> Add support for DELETE statements for Iceberg tables
> 
>
> Key: IMPALA-11877
> URL: https://issues.apache.org/jira/browse/IMPALA-11877
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend, Frontend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>  Labels: impala-iceberg
> Fix For: Impala 4.3.0
>
>
> Add support for DELETE statements for Iceberg tables.
> We can do it based on the following design doc: 
> https://docs.google.com/document/d/1GuRiJ3jjqkwINsSCKYaWwcfXHzbMrsd3WEMDOB11Xqw/edit#heading=h.5bmfhbmb4qdk
> Limitations:
> * only support merge-on-read
> * only write position delete files



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12372) Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3

2023-08-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755702#comment-17755702
 ] 

ASF subversion and git services commented on IMPALA-12372:
--

Commit 5d0a2f01a52f2660acc1f0f4b3214ca6ecfa66ce in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5d0a2f01a ]

IMPALA-12372: Only use -Wno-deprecated-declaration for OpenSSL3

Redhat 9 and Ubuntu 22.04 both use OpenSSL3, which deprecated
several APIs that we use. To support those platforms, we added
the -Wno-deprecated-declaration to the build. Historically, the
Impala build has also specified -Wno-deprecated due to
use of deprecated headers in gutils. These flags limit our
ability to notice use of deprecated code in other parts of the
code.

The code in gutils no longer requires -Wno-deprecated, so
this removes it completely. Additionally, this limits the
-Wno-deprecated-declaration flag to machines using
OpenSSL 3.

Reenabling deprecation warnings also reenables Clang Tidy's
clang-diagnostic-deprecated enforcement. This is currently
broken, so this turns off clang-diagnostic-deprecated
until it can be addressed properly.

Testing:
 - Ran build-all-options on Ubuntu 22 and Ubuntu 16
 - Ran a Rocky 9.2 build

Change-Id: I1b36450d084f342eeab5dac2272580ab6b0c988b
Reviewed-on: http://gerrit.cloudera.org:8080/20369
Reviewed-by: Laszlo Gaal 
Reviewed-by: Zoltan Borok-Nagy 
Tested-by: Joe McDonnell 


> Only use -Wno-deprecated / -Wno-deprecated-declaration for OpenSSL3
> ---
>
> Key: IMPALA-12372
> URL: https://issues.apache.org/jira/browse/IMPALA-12372
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
>
> As part of supporting Redhat 9 / Ubuntu 22, those platforms use OpenSSL3 and 
> compilation will produce warnings that fail our build (due to -Werror). The 
> original change turned off those deprecation warnings for all platforms.
> This is overly broad. We should try to turn off those warnings only for 
> platforms that use OpenSSL3. Otherwise, we are blind to other locations that 
> are using deprecated functions. This came up when investigating using 
> googletest 1.12.1 (which deprecated some calls we use).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12335) Document Iceberg DELETE

2023-08-17 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755700#comment-17755700
 ] 

ASF subversion and git services commented on IMPALA-12335:
--

Commit 12276c79f9975dc63322138ea56290434a49221d in impala's branch 
refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=12276c79f ]

IMPALA-12335: [DOCS] Add documentation about the DELETE statement

IMPALA-11877 added support for the DELETE statement for Iceberg
tables. This patch documents this feature.

Change-Id: If111a7ecd20bda2d4928332ef2ccd905814cb203
Reviewed-on: http://gerrit.cloudera.org:8080/20361
Reviewed-by: Zoltan Borok-Nagy 
Tested-by: Impala Public Jenkins 


> Document Iceberg DELETE
> ---
>
> Key: IMPALA-12335
> URL: https://issues.apache.org/jira/browse/IMPALA-12335
> Project: IMPALA
>  Issue Type: Documentation
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>
> Document DELETE support for Iceberg V2 tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12381) Add jdbc related properties to JDBC data source object

2023-08-17 Thread Wenzhe Zhou (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755696#comment-17755696
 ] 

Wenzhe Zhou commented on IMPALA-12381:
--

In IMPALA-12378, the library of JDBC external data source will be shipped in 
Impala package so that we don't need to specify the location and class for JDBC 
data source.
 

> Add jdbc related properties to JDBC data source object
> --
>
> Key: IMPALA-12381
> URL: https://issues.apache.org/jira/browse/IMPALA-12381
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Wenzhe Zhou
>Priority: Major
>
> Currently jdbc related properties are specified as table properties when 
> creating table as below:
> CREATE TABLE alltypes_jdbc_datasource (
>  id INT, name STRING)
> PRODUCED BY DATA SOURCE JdbcDataSource (
> '{"database.type":"POSTGRES",
> "jdbc.url":"jdbc:postgresql://localhost:5432/functional",
> "jdbc.driver":"org.postgresql.Driver",
> "dbcp.username":"hiveuser",
> "dbcp.password":"password",
> "table":"alltypes"}');
> It's more convenient to move jdbc related properties to data source object as 
> below so that user don't need to specify those properties for each table.
> CREATE DATA SOURCE JdbcDataSource
> LOCATION '/test-warehouse/data-sources/jdbc-data-source.jar'
> CLASS 'org.apache.impala.extdatasource.jdbc.JdbcDataSource'
> DATABSE-TYPE 'POSTGRES'
> JDBC-URL 'jdbc:postgresql://localhost:5432/functional'
> JDBC-DRIVER 'org.postgresql.Driver'
> DBCP-USERNAME 'hiveuser'
> DBCP-PASSWORD 'password'
> API_VERSION 'V1';  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12381) Add jdbc related properties to JDBC data source object

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12381:


 Summary: Add jdbc related properties to JDBC data source object
 Key: IMPALA-12381
 URL: https://issues.apache.org/jira/browse/IMPALA-12381
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


Currently jdbc related properties are specified as table properties when 
creating table as below:

CREATE TABLE alltypes_jdbc_datasource (
 id INT, name STRING)
PRODUCED BY DATA SOURCE JdbcDataSource (
'{"database.type":"POSTGRES",
"jdbc.url":"jdbc:postgresql://localhost:5432/functional",
"jdbc.driver":"org.postgresql.Driver",
"dbcp.username":"hiveuser",
"dbcp.password":"password",
"table":"alltypes"}');

It's more convenient to move jdbc related properties to data source object as 
below so that user don't need to specify those properties for each table.

CREATE DATA SOURCE JdbcDataSource
LOCATION '/test-warehouse/data-sources/jdbc-data-source.jar'
CLASS 'org.apache.impala.extdatasource.jdbc.JdbcDataSource'
DATABSE-TYPE 'POSTGRES'
JDBC-URL 'jdbc:postgresql://localhost:5432/functional'
JDBC-DRIVER 'org.postgresql.Driver'
DBCP-USERNAME 'hiveuser'
DBCP-PASSWORD 'password'
API_VERSION 'V1';  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12381) Add jdbc related properties to JDBC data source object

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12381:


 Summary: Add jdbc related properties to JDBC data source object
 Key: IMPALA-12381
 URL: https://issues.apache.org/jira/browse/IMPALA-12381
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


Currently jdbc related properties are specified as table properties when 
creating table as below:

CREATE TABLE alltypes_jdbc_datasource (
 id INT, name STRING)
PRODUCED BY DATA SOURCE JdbcDataSource (
'{"database.type":"POSTGRES",
"jdbc.url":"jdbc:postgresql://localhost:5432/functional",
"jdbc.driver":"org.postgresql.Driver",
"dbcp.username":"hiveuser",
"dbcp.password":"password",
"table":"alltypes"}');

It's more convenient to move jdbc related properties to data source object as 
below so that user don't need to specify those properties for each table.

CREATE DATA SOURCE JdbcDataSource
LOCATION '/test-warehouse/data-sources/jdbc-data-source.jar'
CLASS 'org.apache.impala.extdatasource.jdbc.JdbcDataSource'
DATABSE-TYPE 'POSTGRES'
JDBC-URL 'jdbc:postgresql://localhost:5432/functional'
JDBC-DRIVER 'org.postgresql.Driver'
DBCP-USERNAME 'hiveuser'
DBCP-PASSWORD 'password'
API_VERSION 'V1';  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12380) Securing dbcp.password for JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12380:


 Summary: Securing dbcp.password for JDBC external data source
 Key: IMPALA-12380
 URL: https://issues.apache.org/jira/browse/IMPALA-12380
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Wenzhe Zhou


In the first patch of JDBC external data source 
(https://gerrit.cloudera.org/#/c/17842/) 
"dbcp.password" is provided as clear text in the table property. We should 
allow user to store password in a Java keystore file on HDFS and protect the 
keystore file for the authorized users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12380) Securing dbcp.password for JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12380:


 Summary: Securing dbcp.password for JDBC external data source
 Key: IMPALA-12380
 URL: https://issues.apache.org/jira/browse/IMPALA-12380
 Project: IMPALA
  Issue Type: Sub-task
Reporter: Wenzhe Zhou


In the first patch of JDBC external data source 
(https://gerrit.cloudera.org/#/c/17842/) 
"dbcp.password" is provided as clear text in the table property. We should 
allow user to store password in a Java keystore file on HDFS and protect the 
keystore file for the authorized users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12377) Improve 'select count(*)' performance for external data source

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou updated IMPALA-12377:
-
Summary: Improve 'select count(*)' performance for external data source  
(was: Improve 'select count(*)' for external data source)

> Improve 'select count(*)' performance for external data source
> --
>
> Key: IMPALA-12377
> URL: https://issues.apache.org/jira/browse/IMPALA-12377
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> The code to handle 'select count(*)' in backend function 
> DataSourceScanNode::GetNext() are not efficient. Even there are no column 
> data returned from external data source, it still try to materialize rows and 
> add rows to RowBatch one by one up to the number of row count.  It also call 
> GetNextInputBatch() multiple times (count / batch_size), while  
> GetNextInputBatch() invoke JNI function.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12379) Detect available jdbc drivers without restarting Impala

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12379:


 Summary: Detect available jdbc drivers without restarting Impala
 Key: IMPALA-12379
 URL: https://issues.apache.org/jira/browse/IMPALA-12379
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Wenzhe Zhou


JDBC external data source should be able detect any jdbc driver jars in 
classpath (include mysql, postgres, impala, oracle, etc) without restarting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12379) Detect available jdbc drivers without restarting Impala

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12379:


 Summary: Detect available jdbc drivers without restarting Impala
 Key: IMPALA-12379
 URL: https://issues.apache.org/jira/browse/IMPALA-12379
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend
Reporter: Wenzhe Zhou


JDBC external data source should be able detect any jdbc driver jars in 
classpath (include mysql, postgres, impala, oracle, etc) without restarting.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12378) Auto Ship JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12378:


 Summary: Auto Ship JDBC external data source
 Key: IMPALA-12378
 URL: https://issues.apache.org/jira/browse/IMPALA-12378
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend, Infrastructure
Reporter: Wenzhe Zhou


The  library of JDBC external data source should be auto shipped in Impala 
binaries so that user don’t need to add the jar file manually. However jdbc 
driver jars are provided by user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12378) Auto Ship JDBC external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12378:


 Summary: Auto Ship JDBC external data source
 Key: IMPALA-12378
 URL: https://issues.apache.org/jira/browse/IMPALA-12378
 Project: IMPALA
  Issue Type: Sub-task
  Components: Frontend, Infrastructure
Reporter: Wenzhe Zhou


The  library of JDBC external data source should be auto shipped in Impala 
binaries so that user don’t need to add the jar file manually. However jdbc 
driver jars are provided by user.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou reassigned IMPALA-12375:


Assignee: Wenzhe Zhou

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Assigned] (IMPALA-12377) Improve 'select count(*)' for external data source

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou reassigned IMPALA-12377:


Assignee: Wenzhe Zhou

> Improve 'select count(*)' for external data source
> --
>
> Key: IMPALA-12377
> URL: https://issues.apache.org/jira/browse/IMPALA-12377
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Frontend
>Reporter: Wenzhe Zhou
>Assignee: Wenzhe Zhou
>Priority: Major
>
> The code to handle 'select count(*)' in backend function 
> DataSourceScanNode::GetNext() are not efficient. Even there are no column 
> data returned from external data source, it still try to materialize rows and 
> add rows to RowBatch one by one up to the number of row count.  It also call 
> GetNextInputBatch() multiple times (count / batch_size), while  
> GetNextInputBatch() invoke JNI function.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12377) Improve 'select count(*)' for external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12377:


 Summary: Improve 'select count(*)' for external data source
 Key: IMPALA-12377
 URL: https://issues.apache.org/jira/browse/IMPALA-12377
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


The code to handle 'select count(*)' in backend function 
DataSourceScanNode::GetNext() are not efficient. Even there are no column data 
returned from external data source, it still try to materialize rows and add 
rows to RowBatch one by one up to the number of row count.  It also call 
GetNextInputBatch() multiple times (count / batch_size), while  
GetNextInputBatch() invoke JNI function.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou updated IMPALA-12375:
-
Component/s: Catalog

> DataSource ojects are not persistent
> 
>
> Key: IMPALA-12375
> URL: https://issues.apache.org/jira/browse/IMPALA-12375
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Backend, Catalog, Frontend
>Reporter: Wenzhe Zhou
>Priority: Major
>
> DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
> not persistent.  The objects are not shown in "show data sources" after the 
> mini-cluster is restarted.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12377) Improve 'select count(*)' for external data source

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12377:


 Summary: Improve 'select count(*)' for external data source
 Key: IMPALA-12377
 URL: https://issues.apache.org/jira/browse/IMPALA-12377
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


The code to handle 'select count(*)' in backend function 
DataSourceScanNode::GetNext() are not efficient. Even there are no column data 
returned from external data source, it still try to materialize rows and add 
rows to RowBatch one by one up to the number of row count.  It also call 
GetNextInputBatch() multiple times (count / batch_size), while  
GetNextInputBatch() invoke JNI function.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12376) DataSourceScanNode drop some returned rows if FLAGS_data_source_batch_size is greater than default value

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12376:


 Summary: DataSourceScanNode drop some returned rows if 
FLAGS_data_source_batch_size is greater than default value
 Key: IMPALA-12376
 URL: https://issues.apache.org/jira/browse/IMPALA-12376
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Wenzhe Zhou
Assignee: Wenzhe Zhou


Backend DataSourceScanNode (be/src/exec/data-source-scan-node.cc) does not 
handle eos properly in function DataSourceScanNode::GetNext().  Rows, which are 
returned from external data source, could be dropped if 
FLAGS_data_source_batch_size is set with value which is greater than default 
value 1024.

In following code: 
  if (row_batch->AtCapacity() || input_batch_->eos || ReachedLimit()) {
*eos = input_batch_->eos || ReachedLimit();
eos could be set as true when some rows in input batch are not processed if 
row_batch->AtCapacity() return true. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12376) DataSourceScanNode drop some returned rows if FLAGS_data_source_batch_size is greater than default value

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12376:


 Summary: DataSourceScanNode drop some returned rows if 
FLAGS_data_source_batch_size is greater than default value
 Key: IMPALA-12376
 URL: https://issues.apache.org/jira/browse/IMPALA-12376
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend
Reporter: Wenzhe Zhou
Assignee: Wenzhe Zhou


Backend DataSourceScanNode (be/src/exec/data-source-scan-node.cc) does not 
handle eos properly in function DataSourceScanNode::GetNext().  Rows, which are 
returned from external data source, could be dropped if 
FLAGS_data_source_batch_size is set with value which is greater than default 
value 1024.

In following code: 
  if (row_batch->AtCapacity() || input_batch_->eos || ReachedLimit()) {
*eos = input_batch_->eos || ReachedLimit();
eos could be set as true when some rows in input batch are not processed if 
row_batch->AtCapacity() return true. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12375:


 Summary: DataSource ojects are not persistent
 Key: IMPALA-12375
 URL: https://issues.apache.org/jira/browse/IMPALA-12375
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
not persistent.  The objects are not shown in "show data sources" after the 
mini-cluster is restarted.  




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12375) DataSource ojects are not persistent

2023-08-17 Thread Wenzhe Zhou (Jira)
Wenzhe Zhou created IMPALA-12375:


 Summary: DataSource ojects are not persistent
 Key: IMPALA-12375
 URL: https://issues.apache.org/jira/browse/IMPALA-12375
 Project: IMPALA
  Issue Type: Sub-task
  Components: Backend, Frontend
Reporter: Wenzhe Zhou


DataSource ojects which are created with "CREATE DATA SOURCE" statements are 
not persistent.  The objects are not shown in "show data sources" after the 
mini-cluster is restarted.  




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (IMPALA-5741) External JDBC Read Support

2023-08-17 Thread Wenzhe Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou updated IMPALA-5741:

Summary: External JDBC Read Support  (was: JDBC storage handler)

> External JDBC Read Support
> --
>
> Key: IMPALA-5741
> URL: https://issues.apache.org/jira/browse/IMPALA-5741
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Distributed Exec
>Reporter: Istvan Vajnorak
>Assignee: Wenzhe Zhou
>Priority: Major
>
> In Hive there is a generic JDBC Storage handler that would be beneficial to 
> be replicated into Impala. There are several workloads out that that could 
> make good use of it.
> The hive version of the handler is tracked under:
> https://issues.apache.org/jira/browse/HIVE-1555
> Please evaluate the possibility of including this into the roadmap at some 
> point.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE regex

2023-08-17 Thread Joe McDonnell (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell updated IMPALA-12374:
---
Summary: Explore optimizing re2 usage for leading / trailing ".*" when 
generating LIKE regex  (was: Explore optimizing re2 usage for leading / 
trailing ".*")

> Explore optimizing re2 usage for leading / trailing ".*" when generating LIKE 
> regex
> ---
>
> Key: IMPALA-12374
> URL: https://issues.apache.org/jira/browse/IMPALA-12374
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.3.0
>Reporter: Joe McDonnell
>Priority: Major
>
> Abseil has some recommendations about efficiently using re2 here: 
> [https://abseil.io/fast/21]
> One recommendation it has is to avoid leading / trailing .* for FullMatch():
> {noformat}
> Using RE2::FullMatch() with leading or trailing .* is an antipattern. 
> Instead, change it to RE2::PartialMatch() and remove the .*. 
> RE2::PartialMatch() performs an unanchored search, so it is also necessary to 
> anchor the regular expression (i.e. with ^ or $) to indicate that it must 
> match at the start or end of the string.{noformat}
> For our slow path LIKE evaluation, we convert the LIKE to a regular 
> expression and use FullMatch(). Our code to generate the regular expression 
> will use leading/trailing .* and FullMatch for patterns like '%a%b%'. We 
> could try detecting these cases and switching to PartialMatch with anchors. 
> See the link for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12373) Implement Small String Optimization for StringValue

2023-08-17 Thread Daniel Becker (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755610#comment-17755610
 ] 

Daniel Becker commented on IMPALA-12373:


Great idea!
Theoretically, reading the inactive member of a union is UB, so we can't use 
{{rep.small_rep.len}} in {{is_small()}}. We could use memcpy to get the last 
byte.
Although cppreference says
{code:java}
Many compilers implement, as a non-standard language extension, the ability to 
read inactive members of a union.
{code}
https://en.cppreference.com/w/cpp/language/union



> Implement Small String Optimization for StringValue
> ---
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: Performance
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;// 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
> char* ptr;
> char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Created] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*"

2023-08-17 Thread Joe McDonnell (Jira)
Joe McDonnell created IMPALA-12374:
--

 Summary: Explore optimizing re2 usage for leading / trailing ".*"
 Key: IMPALA-12374
 URL: https://issues.apache.org/jira/browse/IMPALA-12374
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Joe McDonnell


Abseil has some recommendations about efficiently using re2 here: 
[https://abseil.io/fast/21]

One recommendation it has is to avoid leading / trailing .* for FullMatch():
{noformat}
Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead, 
change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch() 
performs an unanchored search, so it is also necessary to anchor the regular 
expression (i.e. with ^ or $) to indicate that it must match at the start or 
end of the string.{noformat}
For our slow path LIKE evaluation, we convert the LIKE to a regular expression 
and use FullMatch(). Our code to generate the regular expression will use 
leading/trailing .* and FullMatch for patterns like '%a%b%'. We could try 
detecting these cases and switching to PartialMatch with anchors. See the link 
for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (IMPALA-12374) Explore optimizing re2 usage for leading / trailing ".*"

2023-08-17 Thread Joe McDonnell (Jira)
Joe McDonnell created IMPALA-12374:
--

 Summary: Explore optimizing re2 usage for leading / trailing ".*"
 Key: IMPALA-12374
 URL: https://issues.apache.org/jira/browse/IMPALA-12374
 Project: IMPALA
  Issue Type: Improvement
  Components: Backend
Affects Versions: Impala 4.3.0
Reporter: Joe McDonnell


Abseil has some recommendations about efficiently using re2 here: 
[https://abseil.io/fast/21]

One recommendation it has is to avoid leading / trailing .* for FullMatch():
{noformat}
Using RE2::FullMatch() with leading or trailing .* is an antipattern. Instead, 
change it to RE2::PartialMatch() and remove the .*. RE2::PartialMatch() 
performs an unanchored search, so it is also necessary to anchor the regular 
expression (i.e. with ^ or $) to indicate that it must match at the start or 
end of the string.{noformat}
For our slow path LIKE evaluation, we convert the LIKE to a regular expression 
and use FullMatch(). Our code to generate the regular expression will use 
leading/trailing .* and FullMatch for patterns like '%a%b%'. We could try 
detecting these cases and switching to PartialMatch with anchors. See the link 
for more details about how this works.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12373) Implement Small String Optimization for StringValue

2023-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-12373:
---
Labels: Performance  (was: )

> Implement Small String Optimization for StringValue
> ---
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
>  Labels: Performance
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;// 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
> char* ptr;
> char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12373) Implement Small String Optimization for StringValue

2023-08-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1773#comment-1773
 ] 

Zoltán Borók-Nagy commented on IMPALA-12373:


I think we don't need NULL termination so we can store actually 11 chars with 
libc++'s technique.

I uploaded a simple implementation that works on little-endian architectures: 
[^small_string.cpp]

It uses the following representations:
{noformat}
  static constexpr int SMALL_LIMIT = 11;

  struct SmallStringRep {
char buf[SMALL_LIMIT];
char len;
  };
  
  struct __attribute__((__packed__)) LongStringRep {
char* ptr;
unsigned int len;
  };

  static_assert(sizeof(SmallStringRep) == sizeof(LongStringRep));

  union {
SmallStringRep small_rep;
LongStringRep long_rep;
  } rep;
{noformat}
The small string indicator bit is stored in the MSB of the last byte 
(small_rep.len). This works on little-endian architectures as this will be also 
the MSB of long_rep.len. On big-endian architectures we would still use the 
last byte of course, but we have to use the LSB of small_rep.len (which would 
be also the LSB of long_rep.len).

We can use one bit of length as Impala puts a 2GB hard-limit on string length: 
[https://impala.apache.org/docs/build/html/topics/impala_string.html]
(Otherwise we could swap the order of ptr and len in LongStringRep, and use the 
highest bit of the ptr which is unused in 64-bit architectures).

In little-endian we can get the len with masking:
{noformat}
  bool is_small() {
return rep.small_rep.len & 0b1000;
  }

  int len() {
if (is_small()) {
  return rep.small_rep.len & 0b0111;
} else {
  return rep.long_rep.len;
}
  }
{noformat}
In big-endian we would get the len with bit-shifting.

> Implement Small String Optimization for StringValue
> ---
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;// 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
> char* ptr;
> char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-12373) Implement Small String Optimization for StringValue

2023-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-12373:
---
Attachment: small_string.cpp

> Implement Small String Optimization for StringValue
> ---
>
> Key: IMPALA-12373
> URL: https://issues.apache.org/jira/browse/IMPALA-12373
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Zoltán Borók-Nagy
>Priority: Major
> Attachments: small_string.cpp
>
>
> Implement Small String Optimization for StringValue.
> Current memory layout of StringValue is:
> {noformat}
>   char* ptr;  // 8 byte
>   int len;// 4 byte
> {noformat}
> For small strings with size up to 8 we could store the string contents in the 
> bytes of the 'ptr'. Something like that:
> {noformat}
>   union {
> char* ptr;
> char small_buf[sizeof(ptr)];
>   };
>   int len;
> {noformat}
> Many C++ string implementations use the {{Small String Optimization}} to 
> speed up work with small strings. For example:
> {code:java}
> Microsoft STL, libstdc++, libc++, Boost, Folly.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-12347) Cumulated floating point error in window functions

2023-08-17 Thread Daniel Becker (Jira)


[ 
https://issues.apache.org/jira/browse/IMPALA-12347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755492#comment-17755492
 ] 

Daniel Becker commented on IMPALA-12347:


Nice catch [~pranav.lodha] and [~stigahuang].

In addition to floating point rounding errors I presume this error could also 
happen with very big numbers near the maximal value of the floating point type: 
the aggregate value may become INF and we can never get back to finite values 
even if the subsequent numbers are small. One way to handle it would be to 
switch to the Hive pattern (at least temporarily) when we see infinite or NaN 
values.

We should note that switching to the Hive pattern is only useful for imprecise 
types, i.e. FLOAT and DOUBLE. For other types, such as integers, the operations 
are precise _except_ for signed overflow, which is undefined behaviour so the 
Hive pattern wouldn't help (once we've run into undefined behaviour the program 
is undefined).

The new behaviour could be behind a query option because
* if someone relies on the old behaviour they can get it back
* if someone needs correct (or more precise) results even for large windows 
they can have it in exchange for worse performance
* debugging the two patterns (Hive vs. old Impala) is easier if we can control 
it directly.

> Cumulated floating point error in window functions
> --
>
> Key: IMPALA-12347
> URL: https://issues.apache.org/jira/browse/IMPALA-12347
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Quanlong Huang
>Priority: Major
>
> In the developement of IMPALA-11957, [~pranav.lodha] found the following 
> query has different result than Hive:
> {code:sql}
> select s_store_sk,
>   regr_slope(s_number_employees, s_floor_space)
>   over (partition by s_city order by s_store_sk
> rows between 1 preceding and 1 following)
> from tpcds.store;{code}
> The following query is simpler but can still reproduce the difference:
> {code:sql}
> select regr_slope(a, b) over (order by b rows between 1 preceding and 1 
> following)
> from (values (271 a, 6995995 b), (294, 9294113), (294, 9294113)) v;{code}
> The results in Hive (correct):
> {noformat}
> ++
> |  regr_slope_window_0   |
> ++
> | 1.0008189309687318E-5  |
> | 1.0008189309687323E-5  |
> | NULL   |
> ++ {noformat}
> The results in Impala (last line is wrong):
> {noformat}
> ++
> | regr_slope(a, b) OVER(...) |
> ++
> | 1.00081893097e-05  |
> | 1.00081893097e-05  |
> | 2.13623046875e-05  |
> ++{noformat}
> The last two points are the same so the slope should be NULL.
> The difference is due to cumulated floating point error in Impala. The 
> intermediate state of regression functions consist of double values. They can 
> have more error if we have more computation.
> In Impala, each analytic function has a remove method (remove_fn_) to deal 
> with expiring rows when sliding the window. They also have an update method 
> to add new rows. However, in Hive, analytic functions don't need remove 
> methods. Each time when sliding the window, Hive calculates the analytic 
> function by iterating all rows in the window from scratch:
> [https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L205-L207]
> [https://github.com/apache/hive/blob/b9918becd96a52659c6a99b78cf5531c6800b1d3/ql/src/java/org/apache/hadoop/hive/ql/udf/ptf/BasePartitionEvaluator.java#L230-L240]
> The implementation in Hive can achieve less floating point computation error 
> since for the value of each row, the compuation happens only on rows inside 
> the window. However, in Impala, to get the value of each row, we need to 
> invoke the remove method to update the intermediate state, then invoke the 
> update method to add the current row. The intermediate state cumulates the 
> floating point computation error.
> For evaluating analytic functions in small window sizes, maybe we should 
> switch to Hive's pattern to have higher precision.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-11195) Disable SSL session renegotiation

2023-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy resolved IMPALA-11195.

Resolution: Fixed

Thanks Michael for fixing the webserver ssl bugs.

> Disable SSL session renegotiation
> -
>
> Key: IMPALA-11195
> URL: https://issues.apache.org/jira/browse/IMPALA-11195
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> SSL renegotiations has had a couple of CVEs in the past. We should figure out 
> how to disable it.
> Kudu disabled SSL renegotations in KUDU-1926, so we can do something similar.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Resolved] (IMPALA-11195) Disable SSL session renegotiation

2023-08-17 Thread Jira


 [ 
https://issues.apache.org/jira/browse/IMPALA-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy resolved IMPALA-11195.

Resolution: Fixed

Thanks Michael for fixing the webserver ssl bugs.

> Disable SSL session renegotiation
> -
>
> Key: IMPALA-11195
> URL: https://issues.apache.org/jira/browse/IMPALA-11195
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
> Fix For: Impala 4.3.0
>
>
> SSL renegotiations has had a couple of CVEs in the past. We should figure out 
> how to disable it.
> Kudu disabled SSL renegotations in KUDU-1926, so we can do something similar.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)