[jira] [Comment Edited] (HIVE-27712) GenericUDAFNumericStatsEvaluator throws NPE

2023-12-05 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793539#comment-17793539
 ] 

liang yu edited comment on HIVE-27712 at 12/6/23 7:56 AM:
--

[~zhangbutao] 

Thanks for your comments.


I checked the patch, it deprecate the function compute_stats, and use other 
functions to get the same result, this made too many changes.
as I have mentioned in my solution and description, this is just a bug for 
functions compute_stats, which can be easily solved, if we made so many new 
changes, there might be more problems and bugs.


was (Author: JIRAUSER299608):
[~zhangbutao] 
I checked the patch, it deprecate the function compute_stats, and use other 
functions to get the same result, this made too many changes.
as I have mentioned in my solution and description, this is just a bug for 
functions compute_stats, which can be easily solved, if we made so many new 
changes, there might be more problems and bugs.

> GenericUDAFNumericStatsEvaluator throws NPE
> ---
>
> Key: HIVE-27712
> URL: https://issues.apache.org/jira/browse/HIVE-27712
> Project: Hive
>  Issue Type: Bug
>Reporter: liang yu
>Assignee: liang yu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-09-19-16-33-49-881.png
>
>
> using Hadoop 3.3.4
> Hive 3.1.3
> when I set the config:
> {code:java}
> set hive.groupby.skewindata=true;
> set hive.map.aggr=true; {code}
> and execute a sql with groupby execution and join execution, I got a 
> NullPointerException below:
>  
> !image-2023-09-19-16-33-49-881.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27712) GenericUDAFNumericStatsEvaluator throws NPE

2023-12-05 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793539#comment-17793539
 ] 

liang yu commented on HIVE-27712:
-

[~zhangbutao] 
I checked the patch, it deprecate the function compute_stats, and use other 
functions to get the same result, this made too many changes.
as I have mentioned in my solution and description, this is just a bug for 
functions compute_stats, which can be easily solved, if we made so many new 
changes, there might be more problems and bugs.

> GenericUDAFNumericStatsEvaluator throws NPE
> ---
>
> Key: HIVE-27712
> URL: https://issues.apache.org/jira/browse/HIVE-27712
> Project: Hive
>  Issue Type: Bug
>Reporter: liang yu
>Assignee: liang yu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2023-09-19-16-33-49-881.png
>
>
> using Hadoop 3.3.4
> Hive 3.1.3
> when I set the config:
> {code:java}
> set hive.groupby.skewindata=true;
> set hive.map.aggr=true; {code}
> and execute a sql with groupby execution and join execution, I got a 
> NullPointerException below:
>  
> !image-2023-09-19-16-33-49-881.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27925) HiveConf: unify ConfVars enum and use underscore for better readability

2023-12-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27925:
--
Labels: pull-request-available  (was: )

> HiveConf: unify ConfVars enum and use underscore for better readability 
> 
>
> Key: HIVE-27925
> URL: https://issues.apache.org/jira/browse/HIVE-27925
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Kokila N
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When I read something like 
> "[BASICSTATSTASKSMAXTHREADSFACTOR|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L753];
>  I feel someone in the world laughs out loud thinking of me struggling. I can 
> read it, but I hate it :) imagine what if we have vars like 
> [HIVE_MATERIALIZED_VIEW_ENABLE_AUTO_REWRITING_SUBQUERY_SQL|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1921]
>  without underscores...okay, let me help, it is: 
> HIVEMATERIALIZEDVIEWENABLEAUTOREWRITINGSUBQUERYSQL :D
> please let's fix this in 4.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-27925) HiveConf: unify ConfVars enum and use underscore for better readability

2023-12-05 Thread Kokila N (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27925 started by Kokila N.
---
> HiveConf: unify ConfVars enum and use underscore for better readability 
> 
>
> Key: HIVE-27925
> URL: https://issues.apache.org/jira/browse/HIVE-27925
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Kokila N
>Priority: Major
> Fix For: 4.0.0
>
>
> When I read something like 
> "[BASICSTATSTASKSMAXTHREADSFACTOR|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L753];
>  I feel someone in the world laughs out loud thinking of me struggling. I can 
> read it, but I hate it :) imagine what if we have vars like 
> [HIVE_MATERIALIZED_VIEW_ENABLE_AUTO_REWRITING_SUBQUERY_SQL|https://github.com/apache/hive/blob/70f34e27349dccf5fabbfc6c63e63c7be0785360/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L1921]
>  without underscores...okay, let me help, it is: 
> HIVEMATERIALIZEDVIEWENABLEAUTOREWRITINGSUBQUERYSQL :D
> please let's fix this in 4.0.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-27556) Add Unit Test for KafkaStorageHandlerInfo

2023-12-05 Thread Kokila N (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27556 started by Kokila N.
---
> Add Unit Test for KafkaStorageHandlerInfo
> -
>
> Key: HIVE-27556
> URL: https://issues.apache.org/jira/browse/HIVE-27556
> Project: Hive
>  Issue Type: Test
>  Components: kafka integration, StorageHandler
>Reporter: Kokila N
>Assignee: Kokila N
>Priority: Major
>  Labels: pull-request-available
>
> Adding unit tests for KafkaStorageHandlerInfo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27894) Enhance HMS Handler Logs for all 'get_partition' functions.

2023-12-05 Thread Chinna Rao Lalam (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793502#comment-17793502
 ] 

Chinna Rao Lalam commented on HIVE-27894:
-

Merged to master !! Thanks for the patch [~shivijha30] 

> Enhance HMS Handler Logs for all 'get_partition' functions.
> ---
>
> Key: HIVE-27894
> URL: https://issues.apache.org/jira/browse/HIVE-27894
> Project: Hive
>  Issue Type: Improvement
>Reporter: Shivangi Jha
>Assignee: Shivangi Jha
>Priority: Major
>  Labels: pull-request-available
>
> The HMSHandler 
> (standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java)
>  class encompasses various functions pertaining to partition information, yet 
> its current implementation lacks comprehensive logging of substantial 
> partition data. Enhancing this aspect would significantly contribute to 
> improved log readability and facilitate more effective debugging processes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27894) Enhance HMS Handler Logs for all 'get_partition' functions.

2023-12-05 Thread Chinna Rao Lalam (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chinna Rao Lalam resolved HIVE-27894.
-
Fix Version/s: 4.1.0
   Resolution: Fixed

> Enhance HMS Handler Logs for all 'get_partition' functions.
> ---
>
> Key: HIVE-27894
> URL: https://issues.apache.org/jira/browse/HIVE-27894
> Project: Hive
>  Issue Type: Improvement
>Reporter: Shivangi Jha
>Assignee: Shivangi Jha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> The HMSHandler 
> (standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java)
>  class encompasses various functions pertaining to partition information, yet 
> its current implementation lacks comprehensive logging of substantial 
> partition data. Enhancing this aspect would significantly contribute to 
> improved log readability and facilitate more effective debugging processes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27894) Enhance HMS Handler Logs for all 'get_partition' functions.

2023-12-05 Thread Chinna Rao Lalam (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793500#comment-17793500
 ] 

Chinna Rao Lalam commented on HIVE-27894:
-

+1 LGTM

> Enhance HMS Handler Logs for all 'get_partition' functions.
> ---
>
> Key: HIVE-27894
> URL: https://issues.apache.org/jira/browse/HIVE-27894
> Project: Hive
>  Issue Type: Improvement
>Reporter: Shivangi Jha
>Assignee: Shivangi Jha
>Priority: Major
>  Labels: pull-request-available
>
> The HMSHandler 
> (standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java)
>  class encompasses various functions pertaining to partition information, yet 
> its current implementation lacks comprehensive logging of substantial 
> partition data. Enhancing this aspect would significantly contribute to 
> improved log readability and facilitate more effective debugging processes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-27924) Incremental rebuild goes wrong when inserts and deletes overlap between the source tables

2023-12-05 Thread Krisztian Kasa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27924 started by Krisztian Kasa.
-
> Incremental rebuild goes wrong when inserts and deletes overlap between the 
> source tables
> -
>
> Key: HIVE-27924
> URL: https://issues.apache.org/jira/browse/HIVE-27924
> Project: Hive
>  Issue Type: Bug
>  Components: Materialized views
>Affects Versions: 4.0.0-beta-1
> Environment: * Docker version : 19.03.6
>  * Hive version : 4.0.0-beta-1
>  * Driver version : Hive JDBC (4.0.0-beta-1)
>  * Beeline version : 4.0.0-beta-1
>Reporter: Wenhao Li
>Assignee: Krisztian Kasa
>Priority: Critical
>  Labels: bug, hive, materializedviews
> Attachments: 截图.PNG, 截图1.PNG, 截图2.PNG, 截图3.PNG, 截图4.PNG, 截图5.PNG, 
> 截图6.PNG, 截图7.PNG, 截图8.PNG, 截图9.PNG
>
>
> h1. Summary
> The incremental rebuild plan and execution output are incorrect when one side 
> of the table join has inserted/deleted join keys that the other side has 
> deleted/inserted (note the order).
> The argument is that tuples that have never been present simultaneously 
> should not interact with one another, i.e., one's inserts should not join the 
> other's deletes.
> h1. Related Test Case
> The bug was discovered during replication of the test case:
> ??hive/ql/src/test/queries/clientpositive/materialized_view_create_rewrite_5.q??
> h1. Steps to Reproduce the Issue
>  # Configurations:
> {code:sql}
> SET hive.vectorized.execution.enabled=false;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.strict.checks.cartesian.product=false;
> set hive.materializedview.rewriting=true;{code}
>  # 
> {code:sql}
> create table cmv_basetable_n6 (a int, b varchar(256), c decimal(10,2), d int) 
> stored as orc TBLPROPERTIES ('transactional'='true'); {code}
>  # 
> {code:sql}
> insert into cmv_basetable_n6 values
> (1, 'alfred', 10.30, 2),
> (1, 'charlie', 20.30, 2); {code}
>  # 
> {code:sql}
> create table cmv_basetable_2_n3 (a int, b varchar(256), c decimal(10,2), d 
> int) stored as orc TBLPROPERTIES ('transactional'='true'); {code}
>  # 
> {code:sql}
> insert into cmv_basetable_2_n3 values
> (1, 'bob', 30.30, 2),
> (1, 'bonnie', 40.30, 2);{code}
>  # 
> {code:sql}
> CREATE MATERIALIZED VIEW cmv_mat_view_n6 TBLPROPERTIES 
> ('transactional'='true') AS
> SELECT cmv_basetable_n6.a, cmv_basetable_2_n3.c
> FROM cmv_basetable_n6 JOIN cmv_basetable_2_n3 ON (cmv_basetable_n6.a = 
> cmv_basetable_2_n3.a)
> WHERE cmv_basetable_2_n3.c > 10.0;{code}
>  # 
> {code:sql}
> show tables; {code}
> !截图.PNG!
>  # Select tuples, including deletion and with VirtualColumn's, from the MV 
> and source tables. We see that the MV is correctly built upon creation:
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_mat_view_n6('acid.fetch.deleted.rows'='true');{code}
> !截图1.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_n6('acid.fetch.deleted.rows'='true'); {code}
> !截图2.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_2_n3('acid.fetch.deleted.rows'='true'); {code}
> !截图3.PNG!
>  # Now make an insert to the LHS and a delete to the RHS source table:
> {code:sql}
> insert into cmv_basetable_n6 values
> (1, 'kevin', 50.30, 2);
> DELETE FROM cmv_basetable_2_n3 WHERE b = 'bonnie';{code}
>  # Select again to see what happened:
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_n6('acid.fetch.deleted.rows'='true'); {code}
> !截图4.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_2_n3('acid.fetch.deleted.rows'='true'); {code}
> !截图5.PNG!
>  # Use {{EXPLAIN CBO}} to produce the incremental rebuild plan for the MV, 
> which is incorrect already:
> {code:sql}
> EXPLAIN CBO
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 REBUILD; {code}
> !截图6.PNG!
>  # Rebuild MV and see (incorrect) results:
> {code:sql}
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 REBUILD;
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_mat_view_n6('acid.fetch.deleted.rows'='true');{code}
> !截图7.PNG!
>  # Run MV definition directly, which outputs incorrect results because the MV 
> is enabled for MV-based query rewrite, i.e., the following query will output 
> what's in the MV for the time being:
> {code:sql}
> SELECT cmv_basetable_n6.a, cmv_basetable_2_n3.c
> FROM cmv_basetable_n6 JOIN cmv_basetable_2_n3 ON (cmv_basetable_n6.a = 
> cmv_basetable_2_n3.a)
> WHERE cmv_basetable_2_n3.c > 10.0; {code}
> !截图8.PNG!
>  # Disable MV-based query rewrite for the MV and re-run the definition, which 
> should give the correct results:
> {code:sql}
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 DISABLE REWRITE;
> SELECT 

[jira] [Assigned] (HIVE-27924) Incremental rebuild goes wrong when inserts and deletes overlap between the source tables

2023-12-05 Thread Krisztian Kasa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa reassigned HIVE-27924:
-

Assignee: Krisztian Kasa

> Incremental rebuild goes wrong when inserts and deletes overlap between the 
> source tables
> -
>
> Key: HIVE-27924
> URL: https://issues.apache.org/jira/browse/HIVE-27924
> Project: Hive
>  Issue Type: Bug
>  Components: Materialized views
>Affects Versions: 4.0.0-beta-1
> Environment: * Docker version : 19.03.6
>  * Hive version : 4.0.0-beta-1
>  * Driver version : Hive JDBC (4.0.0-beta-1)
>  * Beeline version : 4.0.0-beta-1
>Reporter: Wenhao Li
>Assignee: Krisztian Kasa
>Priority: Critical
>  Labels: bug, hive, materializedviews
> Attachments: 截图.PNG, 截图1.PNG, 截图2.PNG, 截图3.PNG, 截图4.PNG, 截图5.PNG, 
> 截图6.PNG, 截图7.PNG, 截图8.PNG, 截图9.PNG
>
>
> h1. Summary
> The incremental rebuild plan and execution output are incorrect when one side 
> of the table join has inserted/deleted join keys that the other side has 
> deleted/inserted (note the order).
> The argument is that tuples that have never been present simultaneously 
> should not interact with one another, i.e., one's inserts should not join the 
> other's deletes.
> h1. Related Test Case
> The bug was discovered during replication of the test case:
> ??hive/ql/src/test/queries/clientpositive/materialized_view_create_rewrite_5.q??
> h1. Steps to Reproduce the Issue
>  # Configurations:
> {code:sql}
> SET hive.vectorized.execution.enabled=false;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.strict.checks.cartesian.product=false;
> set hive.materializedview.rewriting=true;{code}
>  # 
> {code:sql}
> create table cmv_basetable_n6 (a int, b varchar(256), c decimal(10,2), d int) 
> stored as orc TBLPROPERTIES ('transactional'='true'); {code}
>  # 
> {code:sql}
> insert into cmv_basetable_n6 values
> (1, 'alfred', 10.30, 2),
> (1, 'charlie', 20.30, 2); {code}
>  # 
> {code:sql}
> create table cmv_basetable_2_n3 (a int, b varchar(256), c decimal(10,2), d 
> int) stored as orc TBLPROPERTIES ('transactional'='true'); {code}
>  # 
> {code:sql}
> insert into cmv_basetable_2_n3 values
> (1, 'bob', 30.30, 2),
> (1, 'bonnie', 40.30, 2);{code}
>  # 
> {code:sql}
> CREATE MATERIALIZED VIEW cmv_mat_view_n6 TBLPROPERTIES 
> ('transactional'='true') AS
> SELECT cmv_basetable_n6.a, cmv_basetable_2_n3.c
> FROM cmv_basetable_n6 JOIN cmv_basetable_2_n3 ON (cmv_basetable_n6.a = 
> cmv_basetable_2_n3.a)
> WHERE cmv_basetable_2_n3.c > 10.0;{code}
>  # 
> {code:sql}
> show tables; {code}
> !截图.PNG!
>  # Select tuples, including deletion and with VirtualColumn's, from the MV 
> and source tables. We see that the MV is correctly built upon creation:
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_mat_view_n6('acid.fetch.deleted.rows'='true');{code}
> !截图1.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_n6('acid.fetch.deleted.rows'='true'); {code}
> !截图2.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_2_n3('acid.fetch.deleted.rows'='true'); {code}
> !截图3.PNG!
>  # Now make an insert to the LHS and a delete to the RHS source table:
> {code:sql}
> insert into cmv_basetable_n6 values
> (1, 'kevin', 50.30, 2);
> DELETE FROM cmv_basetable_2_n3 WHERE b = 'bonnie';{code}
>  # Select again to see what happened:
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_n6('acid.fetch.deleted.rows'='true'); {code}
> !截图4.PNG!
>  # 
> {code:sql}
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_basetable_2_n3('acid.fetch.deleted.rows'='true'); {code}
> !截图5.PNG!
>  # Use {{EXPLAIN CBO}} to produce the incremental rebuild plan for the MV, 
> which is incorrect already:
> {code:sql}
> EXPLAIN CBO
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 REBUILD; {code}
> !截图6.PNG!
>  # Rebuild MV and see (incorrect) results:
> {code:sql}
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 REBUILD;
> SELECT ROW__IS__DELETED, ROW__ID, * FROM 
> cmv_mat_view_n6('acid.fetch.deleted.rows'='true');{code}
> !截图7.PNG!
>  # Run MV definition directly, which outputs incorrect results because the MV 
> is enabled for MV-based query rewrite, i.e., the following query will output 
> what's in the MV for the time being:
> {code:sql}
> SELECT cmv_basetable_n6.a, cmv_basetable_2_n3.c
> FROM cmv_basetable_n6 JOIN cmv_basetable_2_n3 ON (cmv_basetable_n6.a = 
> cmv_basetable_2_n3.a)
> WHERE cmv_basetable_2_n3.c > 10.0; {code}
> !截图8.PNG!
>  # Disable MV-based query rewrite for the MV and re-run the definition, which 
> should give the correct results:
> {code:sql}
> ALTER MATERIALIZED VIEW cmv_mat_view_n6 DISABLE REWRITE;
> 

[jira] [Commented] (HIVE-27226) FullOuterJoin with filter expressions is not computed correctly

2023-12-05 Thread Seonggon Namgung (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793209#comment-17793209
 ] 

Seonggon Namgung commented on HIVE-27226:
-

[~dkuzmenko] , I think that would not take much time; it seems that we can 
disable HIVE-18908 optimization by adding an extra condition in 
ConvertJoinMapJoin.getMapJoinConversion().

> FullOuterJoin with filter expressions is not computed correctly
> ---
>
> Key: HIVE-27226
> URL: https://issues.apache.org/jira/browse/HIVE-27226
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must
>
> I tested many OuterJoin queries as an extension of HIVE-27138, and I found 
> that Hive returns incorrect result for a query containing FullOuterJoin with 
> filter expressions. In a nutshell, all JoinOperators that run on Tez engine 
> return incorrect result for OuterJoin queries, and one of the reason for 
> incorrect computation comes from CommonJoinOperator, which is the base of all 
> JoinOperators. I attached the queries and configuration that I used at the 
> bottom of the document. I am still inspecting this problems, and I will share 
> an update once when I find out another reason. Also any comments and opinions 
> would be appreciated.
> First of all, I observed that current Hive ignores filter expressions 
> contained in MapJoinOperator. For example, the attached result of query1 
> shows that MapJoinOperator performs inner join, not full outer join. This 
> problem stems from removal of filterMap. When converting JoinOperator to 
> MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() 
> removes filterMap of MapJoinOperator. Because MapJoinOperator does not 
> evaluate filter expressions if filterMap is null, this change makes 
> MapJoinOperator ignore filter expressions and it always joins tables 
> regardless whether they satisfy filter expressions or not. To solve this 
> problem, I disable FullOuterMapJoinOptimization and apply path for 
> HIVE-27138, which prevents NPE. (The patch is available at the following 
> link: LINK.) The rest of this document uses this modified Hive, but most of 
> problems happen to current Hive, too.
> The second problem I found is that Hive returns the same left-null or 
> right-null rows multiple time when it uses MapJoinOperator or 
> CommonMergeJoinOperator. This is caused by the logic of current 
> CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. 
> First, they create RowContainers, each of which is a group of rows from one 
> table and has the same key. Second, they call 
> CommonJoinOperator#checkAndGenObject() with created RowContainers. This 
> method checks filterTag of each row in RowContainers and forwards joined row 
> if they meet all filter conditions. For OuterJoin, checkAndGenObject() 
> forwards non-matching rows if there is no matching row in RowContainer. The 
> problem happens when there are multiple RowContainer for the same key and 
> table. For example, suppose that there are two left RowContainers and one 
> right RowContainer. If none of the row in two left RowContainers satisfies 
> filter condition, then checkAndGenObject() will forward Left-Null row for 
> each right row. Because checkAndGenObject() is called with each left 
> RowContainer, there will be two duplicated Left-Null rows for every right row.
> In the case of MapJoinOperator, it always creates singleton RowContainer for 
> big table. Therefore, it always produces duplicated non-matching rows. 
> CommonMergeJoinOperator also creates multiple RowContainer for big table, 
> whose size is hive.join.emit.interval. In the below experiment, I also set 
> hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to 
> disable specialized algorithm for OuterJoin of 2 tables and force calling 
> checkAndGenObject() before all rows with the same keys are gathered. I didn't 
> observe this problem when using VectorMapJoinOperator, and I will inspect 
> VectorMapJoinOperator whether we can reproduce the problem with it.
> I think the second problem is not limited to FullOuterJoin, but I couldn't 
> find such query as of now. This will also be added to this issue if I can 
> write a query that reproduces the second problem without FullOuterJoin.
> I also found that Hive returns wrong result for query2 even when I used 
> VectorMapJoinOperator. I am still inspecting this problem and I will add an 
> update on it when I find out the reason.
>  
> Experiment:
>  
> {code:java}
>  Configuration
> set hive.optimize.shared.work=false;
> -- Std MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=false;
> -- Vec MapJoin
> set hive.auto.convert.join=true;
> set 

[jira] [Resolved] (HIVE-27918) Iceberg: Push transforms for clustering during table writes

2023-12-05 Thread Sourabh Badhya (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Badhya resolved HIVE-27918.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master.
Thanks [~dkuzmenko] for the reviews.

> Iceberg: Push transforms for clustering during table writes
> ---
>
> Key: HIVE-27918
> URL: https://issues.apache.org/jira/browse/HIVE-27918
> Project: Hive
>  Issue Type: Improvement
>Reporter: Sourabh Badhya
>Assignee: Sourabh Badhya
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Currently transformed columns (except for bucket transform) are not pushed / 
> passed as clustering columns. This can lead to incorrect clustering on such 
> columns which can lead non-performant writes.
> Hence push transforms for clustering during table writes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)