[jira] [Commented] (HIVE-28490) SharedWorkOptimizer sometimes removes useful DPP sources.

2024-08-29 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877927#comment-17877927
 ] 

Sungwoo Park commented on HIVE-28490:
-

On 10TB TPC-DS benchmark (tested with Hive 4 on MR3),

query 58, before: 72.9s, after: 9.6s
query 83, before: 18.3s, after: 14.6s

> SharedWorkOptimizer sometimes removes useful DPP sources.
> -
>
> Key: HIVE-28490
> URL: https://issues.apache.org/jira/browse/HIVE-28490
> Project: Hive
>  Issue Type: Improvement
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
> Attachments: 3.StopRemovingRetainableDPP.pptx
>
>
> Current SharedWorkOptimizer sometimes removes DPP sources that are not 
> invalidated. I found that findAscendantWorkOperators() returns a super set of 
> ascendant operators, which causes wrong DPP source removal.
> Please check out the attached slides for detailed explanation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28489) Partitioning the input data of Grouping Set GroupBy operator

2024-08-29 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877925#comment-17877925
 ] 

Sungwoo Park commented on HIVE-28489:
-

On 10TB TPC-DS benchmark (tested with Hive 4 on MR3),

query 18, before: 30.6s, after: 28.2s
query 22, before: 53.1s, after: 18.0s
query 67, before: 842.3s, after: 429.1s


> Partitioning the input data of Grouping Set GroupBy operator
> 
>
> Key: HIVE-28489
> URL: https://issues.apache.org/jira/browse/HIVE-28489
> Project: Hive
>  Issue Type: Improvement
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
> Attachments: 2.PartitionDataBeforeGroupingSet.pptx
>
>
> GroupBy operator with grouping sets often emits too many rows, which becomes 
> the bottleneck of query execution. To reduce the number output rows, this 
> JIRA proposes partitioning the input data of such GroupBy operator.
> Please check out the attached slides for detailed explanation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28488) Merge adjacent union distinct

2024-08-29 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877924#comment-17877924
 ] 

Sungwoo Park commented on HIVE-28488:
-

On 10TB TPC-DS benchmark (tested with Hive 4 on MR3),

query 49, before: 26.1s, after: 25.3s
query 75, before: 224.2s, after: 204.8s

> Merge adjacent union distinct
> -
>
> Key: HIVE-28488
> URL: https://issues.apache.org/jira/browse/HIVE-28488
> Project: Hive
>  Issue Type: Improvement
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
> Attachments: 1.MergeAdjacentUnionDistinct.pptx
>
>
> Current Hive compiles
> "SELECT * FROM TBL1 UNION SELECT * FROM TBL2 UNION SELECT * FROM TBL3"
> to
> {code:java}
> TS - GBY - RS
> TS - GBY - RS - GBY - RS
>            TS - GBY - RS - GBY {code}
> This can be optimized as follows:
> {code:java}
> TS - GBY - RS
> TS - GBY - RS
> TS - GBY - RS - GBY {code}
> Please check out the attached slides for detailed explanation and feel free 
> to ask any questions or share suggestions. Also, it would be glad if one can 
> share about better location of this optimization (e.g. SemanticAnalyzer, 
> Calcite, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28484) SharedWorkOptimizer leaves residual unused operator tree that send DPP events to unknown operators

2024-08-29 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17877654#comment-17877654
 ] 

Sungwoo Park commented on HIVE-28484:
-

[~rameshkumar] Out of curiosity, did you find this bug by analyzing a query in 
production use, or a TPC-DS query?


> SharedWorkOptimizer leaves residual unused operator tree that send DPP events 
> to unknown operators
> --
>
> Key: HIVE-28484
> URL: https://issues.apache.org/jira/browse/HIVE-28484
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2, Physical Optimizer
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>
> Please see below the series of events:
>  
> {code:java}
> 2024-08-27 15:59:47,141 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Before SharedWorkOptimizer:
> TS[0]-SEL[2]-MAPJOIN[189]-MAPJOIN[194]-SEL[91]-FIL[92]-SEL[93]-LIM[94]-FS[95]
> TS[3]-FIL[123]-SEL[5]-RS[30]-MAPJOIN[185]-MAPJOIN[188]-SEL[38]-GBY[40]-RS[41]-GBY[42]-SEL[43]-RS[86]-MAPJOIN[189]
> TS[6]-FIL[124]-SEL[8]-MAPJOIN[185]
> TS[9]-FIL[126]-SEL[11]-MAPJOIN[187]-SEL[29]-GBY[34]-RS[36]-MAPJOIN[188]
> TS[12]-FIL[128]-SEL[14]-MAPJOIN[186]-GBY[22]-RS[23]-GBY[24]-SEL[25]-RS[27]-MAPJOIN[187]
>                                                                    
> -SEL[147]-GBY[148]-EVENT[149]
> TS[15]-FIL[129]-SEL[17]-RS[19]-MAPJOIN[186]
>                        -SEL[153]-GBY[154]-EVENT[155]
> TS[44]-FIL[130]-SEL[46]-RS[71]-MAPJOIN[190]-MAPJOIN[193]-SEL[79]-GBY[81]-RS[82]-GBY[83]-RS[89]-MAPJOIN[194]
> TS[47]-FIL[131]-SEL[49]-MAPJOIN[190]
> TS[50]-FIL[133]-SEL[52]-MAPJOIN[192]-SEL[70]-GBY[75]-RS[77]-MAPJOIN[193]
> TS[53]-FIL[135]-SEL[55]-MAPJOIN[191]-GBY[63]-RS[64]-GBY[65]-SEL[66]-RS[68]-MAPJOIN[192]
>                                                                    
> -SEL[171]-GBY[172]-EVENT[173]
> TS[56]-FIL[136]-SEL[58]-RS[60]-MAPJOIN[191]
>                        -SEL[177]-GBY[178]-EVENT[179]2024-08-27 15:59:47,141 
> DEBUG org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: 
> Thread-190]:DPP information stored in the cache: {TS[9]=[EVENT[149]], 
> TS[12]=[EVENT[155]], TS[53]=[EVENT[179]], TS[50]=[EVENT[173]]}2024-08-27 
> 15:59:47,142 DEBUG org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Merging subtree starting at TS[50] into subtree starting at TS[9]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: MAPJOIN[191]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: RS[68]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: GBY[65]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: RS[64]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: SEL[66]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: SEL[55]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: GBY[63]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: RS[60]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: EVENT[173]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache.hadoop.hive.ql.optimizer.SharedWorkOptimizer: 
> [51bcc513-a0e8-4b90-9108-bed2005f7f8c HiveServer2-Handler-Pool: Thread-190]: 
> Input operator removed: SEL[171]
> 2024-08-27 15:59:47,142 DEBUG 
> org.apache

[jira] [Updated] (HIVE-28480) Disable SMB on partition hash generator mismatch across join branches in previous RS

2024-08-27 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-28480:

Fix Version/s: 4.0.1

> Disable SMB on partition hash generator mismatch across join branches in 
> previous RS
> 
>
> Key: HIVE-28480
> URL: https://issues.apache.org/jira/browse/HIVE-28480
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Himanshu Mishra
>Assignee: Himanshu Mishra
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.1.0, 4.0.1
>
>
> As SMB replaces last RS op from the joining branches and the JOIN op with 
> MERGEJOIN, we need to ensure the RS before these RS, in both branches, are 
> partitioning using same hash generator.
> Hash code generator differs based on ReducerTraits.UNIFORM i.e. 
> [ReduceSinkOperator#computeMurmurHash()  or 
> ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344],
>  leading to different hash code for same value.
> Skip SMB join in such cases.
> h3. Replication:
> Consider following query, where join would get converted to SMB. Auto reducer 
> is enabled which ensures more than 1 reducer task.
>  
> {code:java}
> CREATE TABLE t_asj_18 (k STRING, v INT);
> INSERT INTO t_asj_18 values ('a', 10), ('a', 10);
> set hive.auto.convert.join=false;
> set hive.tez.auto.reducer.parallelism=true;
> EXPLAIN SELECT * FROM (
> SELECT k, COUNT(DISTINCT v), SUM(v)
> FROM t_asj_18 GROUP BY k
> ) a LEFT JOIN (
> SELECT k, COUNT(v)
> FROM t_asj_18 GROUP BY k
> ) b ON a.k = b.k; {code}
>  
>  
> Expected result is:
>  
> {code:java}
> a   1   20  a   2 {code}
> but on master branch, it results in
>  
>  
> {code:java}
> a   1   20  NULLNULL {code}
>  
>  
> Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In 
> such scenario [reducer trait UNIFORM is not 
> set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].]
>  The hash code for "a" from 2nd subquery is generated using murmurHash 
> (270516725) while 1st is generated using bucketHash (1086686554) and result 
> in rows with "a" key reaching different reducer tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28480) Disable SMB on partition hash generator mismatch across join branches in previous RS

2024-08-27 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-28480:

Fix Version/s: 4.1.0

> Disable SMB on partition hash generator mismatch across join branches in 
> previous RS
> 
>
> Key: HIVE-28480
> URL: https://issues.apache.org/jira/browse/HIVE-28480
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Himanshu Mishra
>Assignee: Himanshu Mishra
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> As SMB replaces last RS op from the joining branches and the JOIN op with 
> MERGEJOIN, we need to ensure the RS before these RS, in both branches, are 
> partitioning using same hash generator.
> Hash code generator differs based on ReducerTraits.UNIFORM i.e. 
> [ReduceSinkOperator#computeMurmurHash()  or 
> ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344],
>  leading to different hash code for same value.
> Skip SMB join in such cases.
> h3. Replication:
> Consider following query, where join would get converted to SMB. Auto reducer 
> is enabled which ensures more than 1 reducer task.
>  
> {code:java}
> CREATE TABLE t_asj_18 (k STRING, v INT);
> INSERT INTO t_asj_18 values ('a', 10), ('a', 10);
> set hive.auto.convert.join=false;
> set hive.tez.auto.reducer.parallelism=true;
> EXPLAIN SELECT * FROM (
> SELECT k, COUNT(DISTINCT v), SUM(v)
> FROM t_asj_18 GROUP BY k
> ) a LEFT JOIN (
> SELECT k, COUNT(v)
> FROM t_asj_18 GROUP BY k
> ) b ON a.k = b.k; {code}
>  
>  
> Expected result is:
>  
> {code:java}
> a   1   20  a   2 {code}
> but on master branch, it results in
>  
>  
> {code:java}
> a   1   20  NULLNULL {code}
>  
>  
> Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In 
> such scenario [reducer trait UNIFORM is not 
> set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].]
>  The hash code for "a" from 2nd subquery is generated using murmurHash 
> (270516725) while 1st is generated using bucketHash (1086686554) and result 
> in rows with "a" key reaching different reducer tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28480) Disable SMB on partition hash generator mismatch across join branches in previous RS

2024-08-27 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17876949#comment-17876949
 ] 

Sungwoo Park commented on HIVE-28480:
-

As this is a (critical) correctness problem, I would like to suggest setting 
Priority to Critical.


> Disable SMB on partition hash generator mismatch across join branches in 
> previous RS
> 
>
> Key: HIVE-28480
> URL: https://issues.apache.org/jira/browse/HIVE-28480
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: Himanshu Mishra
>Assignee: Himanshu Mishra
>Priority: Major
>  Labels: pull-request-available
>
> As SMB replaces last RS op from the joining branches and the JOIN op with 
> MERGEJOIN, we need to ensure the RS before these RS, in both branches, are 
> partitioning using same hash generator.
> Hash code generator differs based on ReducerTraits.UNIFORM i.e. 
> [ReduceSinkOperator#computeMurmurHash()  or 
> ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344],
>  leading to different hash code for same value.
> Skip SMB join in such cases.
> h3. Replication:
> Consider following query, where join would get converted to SMB. Auto reducer 
> is enabled which ensures more than 1 reducer task.
>  
> {code:java}
> CREATE TABLE t_asj_18 (k STRING, v INT);
> INSERT INTO t_asj_18 values ('a', 10), ('a', 10);
> set hive.auto.convert.join=false;
> set hive.tez.auto.reducer.parallelism=true;
> EXPLAIN SELECT * FROM (
> SELECT k, COUNT(DISTINCT v), SUM(v)
> FROM t_asj_18 GROUP BY k
> ) a LEFT JOIN (
> SELECT k, COUNT(v)
> FROM t_asj_18 GROUP BY k
> ) b ON a.k = b.k; {code}
>  
>  
> Expected result is:
>  
> {code:java}
> a   1   20  a   2 {code}
> but on master branch, it results in
>  
>  
> {code:java}
> a   1   20  NULLNULL {code}
>  
>  
> Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In 
> such scenario [reducer trait UNIFORM is not 
> set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].]
>  The hash code for "a" from 2nd subquery is generated using murmurHash 
> (270516725) while 1st is generated using bucketHash (1086686554) and result 
> in rows with "a" key reaching different reducer tasks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-27985) Avoid duplicate files.

2024-06-19 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856201#comment-17856201
 ] 

Sungwoo Park edited comment on HIVE-27985 at 6/19/24 7:32 AM:
--

I agree with the direction proposed in this JIRA issue. In my opinion, Hive 
would be able to better handle the problem of duplicate files if it used the 
logic of sending/checking canCommit requests, which I think is actually the way 
that Tez applications are supposed to interact with Tez AM. By checking 
canCommit requests, Hive can ensure that only a single output commit attempt is 
made on the side of TezProcessor. I don't know why Hive chose not to exploit 
canCommit requests, but perhaps there was a good reason for this decision.

However, if you re-enable speculative execution, the original problem with 
removeTempOrDuplicateFiles() appears again. In the proposal of this JIRA issue, 
it may seem okay because a unique file name is chosen, but still there remains 
a correctness problem because some queries produce non-deterministic results 
(i.e., each time you run the same query, you may get different results). With 
several task attempts executed speculatively, you cannot really tell which is 
the final output.

So, my understanding (or my guess) is that Hive opted for a compromise: 1) do 
not allow speculative execution and always execute task attempts sequentially; 
2) pick the output by checking TaskAttemptID in removeTempOrDuplicateFiles().

In my opinion, lack of support of speculative execution is a non-trivial 
problem for the stability of Hive. For example, a task attempt may hang for a 
few minutes because of TCP connection failures, thus delaying the progress of 
the entire DAG. This is where speculative execution can be particularly useful.

The solution in Hive-MR3 is to allow speculative execution and check the size 
of output files in removeTempOrDuplicateFiles() (i.e., use the old logic in 
Hive 3 rather than the new logic in Hive 4). Of course, this violates the 
correctness issue in general, so we assume that query execution is 
deterministic. For the case that the query execution is not necessarily 
deterministic, we provide a configuration flag that disables speculative 
execution.




was (Author: glapark):
I agree with the direction proposed in this JIRA issue. In my opinion, Hive 
would be able to better handle the problem of duplicate files if it used the 
logic of sending/checking canCommit requests, which I think is actually the way 
that Tez applications are supposed to interact with Tez AM. By checking 
canCommit requests, Hive can ensure that only a single output commit attempt is 
made on the side of TezProcessor. I don't know why Hive chose not to exploit 
canCommit requests, but perhaps there was a good reason for this decision.

However, if you re-enable speculative execution, the original problem with 
removeTempOrDuplicateFiles() appears again. In the proposal of this JIRA issue, 
it may seem okay because a unique file name is chosen, but still there remains 
a correctness problem because some queries produce non-deterministic results 
(i.e., each time you run the same query, you may get different results). With 
several task attempts executed speculatively, you cannot really tell which is 
the final output.

So, my understanding (or my guess) is that Hive opted for a compromise: 1) do 
not allow speculative execution and always execute task attempts sequentially; 
2) pick the output by checking TaskAttemptID in removeTempOrDuplicateFiles().

In my opinion, lack of support of speculative execution is a non-trivial 
problem for the stability of Hive. For example, a task attempt may hang for a 
few minutes because of TPC-DS connection failures, thus delaying the progress 
of the entire DAG. This is where speculative execution can be particularly 
useful.

The solution in Hive-MR3 is to allow speculative execution and check the size 
of output files in removeTempOrDuplicateFiles() (i.e., use the old logic in 
Hive 3 rather than the new logic in Hive 4). Of course, this violates the 
correctness issue in general, so we assume that query execution is 
deterministic. For the case that the query execution is not necessarily 
deterministic, we provide a configuration flag that disables speculative 
execution.



> Avoid duplicate files.
> --
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes d

[jira] [Commented] (HIVE-27985) Avoid duplicate files.

2024-06-19 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856201#comment-17856201
 ] 

Sungwoo Park commented on HIVE-27985:
-

I agree with the direction proposed in this JIRA issue. In my opinion, Hive 
would be able to better handle the problem of duplicate files if it used the 
logic of sending/checking canCommit requests, which I think is actually the way 
that Tez applications are supposed to interact with Tez AM. By checking 
canCommit requests, Hive can ensure that only a single output commit attempt is 
made on the side of TezProcessor. I don't know why Hive chose not to exploit 
canCommit requests, but perhaps there was a good reason for this decision.

However, if you re-enable speculative execution, the original problem with 
removeTempOrDuplicateFiles() appears again. In the proposal of this JIRA issue, 
it may seem okay because a unique file name is chosen, but still there remains 
a correctness problem because some queries produce non-deterministic results 
(i.e., each time you run the same query, you may get different results). With 
several task attempts executed speculatively, you cannot really tell which is 
the final output.

So, my understanding (or my guess) is that Hive opted for a compromise: 1) do 
not allow speculative execution and always execute task attempts sequentially; 
2) pick the output by checking TaskAttemptID in removeTempOrDuplicateFiles().

In my opinion, lack of support of speculative execution is a non-trivial 
problem for the stability of Hive. For example, a task attempt may hang for a 
few minutes because of TPC-DS connection failures, thus delaying the progress 
of the entire DAG. This is where speculative execution can be particularly 
useful.

The solution in Hive-MR3 is to allow speculative execution and check the size 
of output files in removeTempOrDuplicateFiles() (i.e., use the old logic in 
Hive 3 rather than the new logic in Hive 4). Of course, this violates the 
correctness issue in general, so we assume that query execution is 
deterministic. For the case that the query execution is not necessarily 
deterministic, we provide a configuration flag that disables speculative 
execution.



> Avoid duplicate files.
> --
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. *This issue mainly solves this problem by 
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more 
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>  
> Let’s analyze this step:
>  * (1) {*}process records{*}: Process records.
>  * (2) {*}send canCommit request{*}: After all Records are processed, call 
> canCommit remotely to AM.
>  * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, 
> it will check whether there are other tasksattempts in the current task that 
> have already executed canCommit. If there is no other taskattempt to execute 
> canCommit first, return true. Otherwise return false. This ensures that only 
> one taskattempt is committed for each task.
>  * (4) {*}return canCommit response{*}: Task receives AM's response. If 
> returns true, it means it can be committed. If false is returned, it means 
> that another task attempt has already executed the commit first, and you 
> cannot commit. The task will jump into (2) loop to execute canCommit until it 
> is killed or other tasks fail.
>  * (5) {*}output.commit{*}: Execute commit, specifically rename the generated 
> temporary file to the final file.
>  * (6) {*}notify succeeded{*}: Although the task has completed the final 
> file, AM still needs to be notified that its work is completed. Therefore, AM 
> needs to be no

[jira] [Commented] (HIVE-26018) The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR

2024-06-18 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855915#comment-17855915
 ] 

Sungwoo Park commented on HIVE-26018:
-

The correctness issue of unique join is fixed in the pull request. It would be 
great if someone could review the pull request.

> The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR
> ---
>
> Key: HIVE-26018
> URL: https://issues.apache.org/jira/browse/HIVE-26018
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 3.1.0, 4.0.0
>Reporter: GuangMing Lu
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
>
> The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR, and 
> the result Is not correct, for example:
> CREATE TABLE T1_n1x(key STRING, val STRING) STORED AS orc;
> CREATE TABLE T2_n1x(key STRING, val STRING) STORED AS orc;
> insert into T1_n1x values('aaa', '111'),('bbb', '222'),('ccc', '333');
> insert into T2_n1x values('aaa', '111'),('ddd', '444'),('ccc', '333');
> SELECT a.key, b.key FROM UNIQUEJOIN PRESERVE T1_n1x a (a.key), PRESERVE  
> T2_n1x b (b.key);
> Hive on Tez result: wrong
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> |NULL  |ddd    |
> +--+
> Hive on MR result: right
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> +-+
> SELECT a.key, b.key FROM UNIQUEJOIN T1_n1x a (a.key), T2_n1x b (b.key);
> Hive on Tez result: wrong
> +---+
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> |NULL  |ddd    |
> +-+
> Hive on MR result: right
> |a.key  |b.key  |
> |aaa    |aaa    |
> |ccc    |ccc    |
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28331) java.lang.StackOverflowError in Metastore with Derby when loading 2GB TPC-DS ORC data

2024-06-17 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-28331:
---

 Summary: java.lang.StackOverflowError in Metastore with Derby when 
loading 2GB TPC-DS ORC data
 Key: HIVE-28331
 URL: https://issues.apache.org/jira/browse/HIVE-28331
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0
Reporter: Sungwoo Park


Loading 2GB ORC TPC-DS partitioned tables fails with 
java.lang.StackOverflowError in Metastore with Derby.

How to reproduce:
1. Run Metastore with Derby database.
2. Load 2GB ORC TPC-DS tables (from 2GB text TPC-DS datasets)
3. Loading partitioned tables (such as catalog_returns) fails with 
java.lang.StackOverflowError in Metastore.


{code:java}
Caused by: java.lang.StackOverflowError
  at 
org.apache.derby.impl.sql.compile.SetOperatorNode.replaceOrForbidDefaults(Unknown
 Source) ~[derby-10.14.2.0.jar:?]
{code}

This error occurs when HiveServer2 tries to add partitions (over 1000 
partitions) at once in Hive.java.


{code:java}
getSynchronizedMSC().add_partitions(partitions.stream().map(Partition::getTPartition)
  .collect(Collectors.toList()));

{code}

This might be a problem with the new version of Derby because it was not 
observed in Hive 3 or when using MySQL for backend database. Not sure how to 
fix this issue (other than adding one partition at a time) without losing the 
logic of roll-back when the operation fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-24207) LimitOperator can leverage ObjectCache to bail out quickly

2024-06-04 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851965#comment-17851965
 ] 

Sungwoo Park commented on HIVE-24207:
-

Seonggon created HIVE-28281 to report the problem in case 1.

For case 2, it's hard to reproduce the problem, but the bug seems obvious 
because two speculative task attempts are not supposed to update a common 
counter for the same LimitOperator.

> LimitOperator can leverage ObjectCache to bail out quickly
> --
>
> Key: HIVE-24207
> URL: https://issues.apache.org/jira/browse/HIVE-24207
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {noformat}
> select  ss_sold_date_sk from store_sales, date_dim where date_dim.d_year in 
> (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = date_dim.d_date_sk 
> limit 100;
>  select distinct ss_sold_date_sk from store_sales, date_dim where 
> date_dim.d_year in (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = 
> date_dim.d_date_sk limit 100;
>  {noformat}
> Queries like the above generate a large number of map tasks. Currently they 
> don't bail out after generating enough amount of data. 
> It would be good to make use of ObjectCache & retain the number of records 
> generated. LimitOperator/VectorLimitOperator can bail out for the later tasks 
> in the operator's init phase itself. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorLimitOperator.java#L57
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/LimitOperator.java#L58



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26018) The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR

2024-05-24 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849421#comment-17849421
 ] 

Sungwoo Park commented on HIVE-26018:
-

Currently uniquejoin.q passes because it uses MapReduce execution engine. If 
Tez execution engine is used, uniquejoin.q fails for the same reason described 
in this JIRA.

The difference in the outcome is due to different representations of empty rows 
in MapReduce and Tez. If there is no row for the given key,

1. MapReduce's JoinOperator: the storage is empty
2. Tez's MapJoinOperator/CommonMergeJoinOperator: the storage contains an dummy 
row.

Does anyone still use UNIQUEJOIN in production? This is a correctness issue, so 
we would like to investigate further if UNIQUEJOIN is still used.

cc. [~seonggon]

> The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR
> ---
>
> Key: HIVE-26018
> URL: https://issues.apache.org/jira/browse/HIVE-26018
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Affects Versions: 3.1.0, 4.0.0
>Reporter: GuangMing Lu
>Priority: Major
>
> The result of UNIQUEJOIN on Hive on Tez is inconsistent with that of MR, and 
> the result Is not correct, for example:
> CREATE TABLE T1_n1x(key STRING, val STRING) STORED AS orc;
> CREATE TABLE T2_n1x(key STRING, val STRING) STORED AS orc;
> insert into T1_n1x values('aaa', '111'),('bbb', '222'),('ccc', '333');
> insert into T2_n1x values('aaa', '111'),('ddd', '444'),('ccc', '333');
> SELECT a.key, b.key FROM UNIQUEJOIN PRESERVE T1_n1x a (a.key), PRESERVE  
> T2_n1x b (b.key);
> Hive on Tez result: wrong
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> |NULL  |ddd    |
> +--+
> Hive on MR result: right
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> +-+
> SELECT a.key, b.key FROM UNIQUEJOIN T1_n1x a (a.key), T2_n1x b (b.key);
> Hive on Tez result: wrong
> +---+
> |a.key  |b.key  |
> |aaa    |aaa    |
> |bbb    |NULL  |
> |ccc    |ccc    |
> |NULL  |ddd    |
> +-+
> Hive on MR result: right
> |a.key  |b.key  |
> |aaa    |aaa    |
> |ccc    |ccc    |
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-24207) LimitOperator can leverage ObjectCache to bail out quickly

2024-05-21 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848432#comment-17848432
 ] 

Sungwoo Park commented on HIVE-24207:
-

[~abstractdog] Hi, I have a couple of questions of this optimization.

1. An operator tree can contain multiple LimitOperators in general. It seems 
that this optimization works only if LimitOperator has a single child operator 
which should be either RS or TerminalOperator. In other words, a vertex should 
contain a single LimitOperator at most and it should be the last operator 
before emitting final records. Do you know if this property guaranteed by the 
Hive compiler?

2. This optimization may not work if speculative execution is enabled or 
multiple taskattempts are executed in the same LLAP daemon. Or, does this 
optimization assume no speculative execution?






> LimitOperator can leverage ObjectCache to bail out quickly
> --
>
> Key: HIVE-24207
> URL: https://issues.apache.org/jira/browse/HIVE-24207
> Project: Hive
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> {noformat}
> select  ss_sold_date_sk from store_sales, date_dim where date_dim.d_year in 
> (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = date_dim.d_date_sk 
> limit 100;
>  select distinct ss_sold_date_sk from store_sales, date_dim where 
> date_dim.d_year in (1998,1998+1,1998+2) and store_sales.ss_sold_date_sk = 
> date_dim.d_date_sk limit 100;
>  {noformat}
> Queries like the above generate a large number of map tasks. Currently they 
> don't bail out after generating enough amount of data. 
> It would be good to make use of ObjectCache & retain the number of records 
> generated. LimitOperator/VectorLimitOperator can bail out for the later tasks 
> in the operator's init phase itself. 
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorLimitOperator.java#L57
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/LimitOperator.java#L58



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28221) iceberg-handler/dynamic_partition_pruning.q fails in GenericUDFIn.getDisplayString() if assertion check is enabled

2024-04-26 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-28221:
---

 Summary: iceberg-handler/dynamic_partition_pruning.q fails in 
GenericUDFIn.getDisplayString() if assertion check is enabled
 Key: HIVE-28221
 URL: https://issues.apache.org/jira/browse/HIVE-28221
 Project: Hive
  Issue Type: Bug
  Components: Iceberg integration
Affects Versions: 4.0.0
Reporter: Sungwoo Park


If assertion checking is enabled, 
ql/src/test/queries/clientpositive/dynamic_partition_pruning.q fails with an 
assertion failure in the following method of 
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFIn.java:

{code:java}
   public String getDisplayString(String[] children) {
assert (children.length >= 2);
{code}

While dynamic_partition_pruning.q is running, getDisplayString() is called with 
children.length == 1.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28196) Preserve column stats when applying UDF upper/lower.

2024-04-17 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-28196:

Labels: hive-4.0.1-must performance pull-request-available  (was: 
pull-request-available)

> Preserve column stats when applying UDF upper/lower.
> 
>
> Key: HIVE-28196
> URL: https://issues.apache.org/jira/browse/HIVE-28196
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.1-must, performance, pull-request-available
> Fix For: 4.1.0
>
>
> Current Hive re-estimates column stats (including avgColLen) when it 
> encounters UDF.
> In the case of upper and lower, Hive sets avgColLen to 
> hive.stats.max.variable.length.
> But these UDFs do not change column stats and the default value(100) is too 
> high for string type key columns, on which upper/lower are usually applied.
> This patch keeps input data's avgColLen after applying UDF upper/lower to 
> make a better query plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28196) Preserve column stats when applying UDF upper/lower.

2024-04-11 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836034#comment-17836034
 ] 

Sungwoo Park commented on HIVE-28196:
-

To give the context in which this JIRA is investigated, we are trying to 
stabilize the performance of Hive 4.0.0 with 10TB TPC-DS benchmark.

When compared with Hive 3.1.3, query 24-b is hit the hardest: about 300 seconds 
in Hive 3.1.3 to about 750 seconds in Hive 4.0.0. We have observed that this 
noticeable slowdown is due to using the default value of avgColLen after 
applying UDF upper/lower.

After applying this patch, the running time decreases to about 300 seconds 
because the query plan correctly uses MapJoin instead of MergeJoin. This patch 
targets Hive 4.1.0, but ideally it should be merged to 4.0.1.


> Preserve column stats when applying UDF upper/lower.
> 
>
> Key: HIVE-28196
> URL: https://issues.apache.org/jira/browse/HIVE-28196
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Current Hive re-estimates column stats (including avgColLen) when it 
> encounters UDF.
> In the case of upper and lower, Hive sets avgColLen to 
> hive.stats.max.variable.length.
> But these UDFs do not change column stats and the default value(100) is too 
> high for string type key columns, on which upper/lower are usually applied.
> This patch keeps input data's avgColLen after applying UDF upper/lower to 
> make a better query plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28196) Preserve column stats when applying UDF upper/lower.

2024-04-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-28196:

Fix Version/s: 4.1.0

> Preserve column stats when applying UDF upper/lower.
> 
>
> Key: HIVE-28196
> URL: https://issues.apache.org/jira/browse/HIVE-28196
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Current Hive re-estimates column stats (including avgColLen) when it 
> encounters UDF.
> In the case of upper and lower, Hive sets avgColLen to 
> hive.stats.max.variable.length.
> But these UDFs do not change column stats and the default value(100) is too 
> high for string type key columns, on which upper/lower are usually applied.
> This patch keeps input data's avgColLen after applying UDF upper/lower to 
> make a better query plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28196) Preserve column stats when applying UDF upper/lower.

2024-04-10 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-28196:

Affects Version/s: 4.0.0

> Preserve column stats when applying UDF upper/lower.
> 
>
> Key: HIVE-28196
> URL: https://issues.apache.org/jira/browse/HIVE-28196
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
>
> Current Hive re-estimates column stats (including avgColLen) when it 
> encounters UDF.
> In the case of upper and lower, Hive sets avgColLen to 
> hive.stats.max.variable.length.
> But these UDFs do not change column stats and the default value(100) is too 
> high for string type key columns, on which upper/lower are usually applied.
> This patch keeps input data's avgColLen after applying UDF upper/lower to 
> make a better query plan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28004) DELETE on ACID table failed with NoClassDefFoundError: com/sun/tools/javac/util/List

2024-01-16 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807188#comment-17807188
 ] 

Sungwoo Park commented on HIVE-28004:
-

A similar problem was reported in HIVE-27929 where we test loading 1TB ORC 
TPC-DS with table property transactional=true.

> DELETE on ACID table failed with NoClassDefFoundError: 
> com/sun/tools/javac/util/List
> 
>
> Key: HIVE-28004
> URL: https://issues.apache.org/jira/browse/HIVE-28004
> Project: Hive
>  Issue Type: Bug
>Reporter: Butao Zhang
>Priority: Major
>
> I am not sure if it is a bug or usage question.
> Test on Hive master branch:
>  
> {code:java}
> set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.support.concurrency = true;
> create table testacid4(id int) stored as orc TBLPROPERTIES 
> ('transactional'='true');
> delete from testacid4 where id=110;
> {code}
>  
> *beeline console shows error:*
> {code:java}
> 0: jdbc:hive2://127.0.0.1:1/default> delete from testacid4 where id=110;
> INFO  : Compiling 
> command(queryId=hive_20240116180628_ec5ac4d8-473b-4b42-b0dd-eecebec71268): 
> delete from testacid4 where id=110
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:row__id, 
> type:struct, comment:null)], 
> properties:null)
> INFO  : Completed compiling 
> command(queryId=hive_20240116180628_ec5ac4d8-473b-4b42-b0dd-eecebec71268); 
> Time taken: 3.554 seconds
> INFO  : Operation QUERY obtained 1 locks
> ERROR : FAILED: Hive Internal Error: 
> org.apache.hadoop.hive.ql.lockmgr.LockException(org.apache.thrift.TApplicationException:
>  Internal error processing get_latest_txnid_in_conflict)
> org.apache.hadoop.hive.ql.lockmgr.LockException: 
> org.apache.thrift.TApplicationException: Internal error processing 
> get_latest_txnid_in_conflict
>         at 
> org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.getLatestTxnIdInConflict(DbTxnManager.java:1055)
>         at 
> org.apache.hadoop.hive.ql.DriverTxnHandler.isValidTxnListState(DriverTxnHandler.java:435)
>         at org.apache.hadoop.hive.ql.Driver.validateTxnList(Driver.java:250)
>         at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:199)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
>         at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
>         at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
>         at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:236)
>         at 
> org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:90)
>         at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:336)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
>         at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:356)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.thrift.TApplicationException: Internal error processing 
> get_latest_txnid_in_conflict
>         at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
>         at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_latest_txnid_in_conflict(ThriftHiveMetastore.java:6404)
>         at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_latest_txnid_in_conflict(ThriftHiveMetastore.java:6391)
>         at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getLatestTxnIdInConflict(HiveMetaStoreClient.java:4421)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213)
>         at com.sun.proxy.$Proxy32.getLatestTxnIdInConflict(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethod

[jira] [Commented] (HIVE-27929) Run TPC-DS queries and validate results correctness

2024-01-16 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807187#comment-17807187
 ] 

Sungwoo Park commented on HIVE-27929:
-

For loading Loading TPC-DS 1TB ORC data with transactional=true, the same 
problem is reported in HIVE-28004. So, HIVE-28004 could be marked as blocker.




> Run TPC-DS queries and validate results correctness
> ---
>
> Key: HIVE-27929
> URL: https://issues.apache.org/jira/browse/HIVE-27929
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Denys Kuzmenko
>Assignee: Simhadri Govindappa
>Priority: Major
>
> release branch: *branch-4.0*
> https://github.com/apache/hive/tree/branch-4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27929) Run TPC-DS queries and validate results correctness

2024-01-15 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807038#comment-17807038
 ] 

Sungwoo Park commented on HIVE-27929:
-

[~dkuzmenko] 
1. It is a Tez change. If Hive jars appear before Tez jars in the classpath, I 
think this change is unnecessary.
2. I don't know because we did not analyze why the execution got stuck. The 
second attempt succeeded, so the configuration that we used was perhaps okay.
3. Yes, and we wanted to double-check that hive.auto.convert.anti.join=true is 
okay.
4. I am not sure if this is caused by the same error. Let me try again after 
HIVE-27996 is fixed.


> Run TPC-DS queries and validate results correctness
> ---
>
> Key: HIVE-27929
> URL: https://issues.apache.org/jira/browse/HIVE-27929
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Denys Kuzmenko
>Assignee: Simhadri Govindappa
>Priority: Major
>
> release branch: *branch-4.0*
> https://github.com/apache/hive/tree/branch-4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27929) Run TPC-DS queries and validate results correctness

2024-01-12 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17806248#comment-17806248
 ] 

Sungwoo Park commented on HIVE-27929:
-

=== Build Tez 0.10.2 --> okay

pom.xml should be updated as follows. Without this update, Hive generates 
ClassNotFoundException if Tez jars appear before Hive jars in the classpath.

+
+  org.apache.commons
+  commons-lang3
+  3.12.0
+

=== Build Hive branch-4.0 --> okay

We use the latest commit as of Jan 12:

commit f355c82a5aa77ef1496b35c22b8ac9b84dfe1780
Author: seonggon 
Date:   Wed Jan 10 18:15:39 2024 +0900

HIVE-27988: Don't convert FullOuterJoin with filter to MapJoin (Seonggon 
Namgung, reviewed by Denys Kuzmenko)

=== Run Metastore and HiveServer2 --> okay

Use hive.execution.engine=tez and hive.execution.mode=container

=== Loading TPC-DS 1TB text data in external tables --> okay

Example:

create external table catalog_page(
  cp_catalog_page_skbigint
, cp_catalog_page_idstring
, cp_start_date_sk  bigint
, cp_end_date_skbigint
, cp_department string
, cp_catalog_number int
, cp_catalog_page_numberint
, cp_descriptionstring
, cp_type   string
)
row format delimited fields terminated by '|'
location 'hdfs://blue0:8020/tmp/tpcds-generate/1000/catalog_page';

=== Loading TPC-DS 1TB ORC data, transactional=false --> okay

Example:

create table catalog_page
stored as orc
TBLPROPERTIES('transactional'='false', 'transactional_properties'='default')
as select * from tpcds_text_1000_tez.catalog_page;

=== Run TPC-DS --> all okay with correct results

Query 64 gets stuck at the first attempt (killed after running for 6000 
seconds). It succeeds at the second attempt.

Some configuration keys in hive-site.xml that affected the correctness in 
earlier versions:

-- HIVE-26621

  hive.optimize.shared.work.dppunion
  false



  hive.optimize.shared.work.dppunion.merge.eventops
  false



  hive.optimize.shared.work.downstream.merge
  false



  hive.optimize.shared.work.parallel.edge.support
  false



  hive.optimize.shared.work.merge.ts.schema
  false



  hive.optimize.cte.materialize.threshold
  -1



  hive.tez.bloom.filter.merge.threads
  0



  hive.auto.convert.anti.join
  false


=== Test with hive.auto.convert.anti.join=true (HIVE-26659) --> okay

Query 16: correct
Query 69: correct
Query 94: correct

So, we can set hive.auto.convert.anti.join=true.

=== Test with hive.tez.bloom.filter.merge.threads=2/4 (HIVE-26655) --> okay

Query 17, hive.tez.bloom.filter.merge.threads=0: 173.713 seconds
Query 17, hive.tez.bloom.filter.merge.threads=2: 184.626 seconds
Query 17, hive.tez.bloom.filter.merge.threads=4: 184.726 seconds

=== Test with hive.optimize.shared.work.dppunion=true (HIVE-26621) --> okay

Query 2: correct

=== Loading TPC-DS 1TB ORC data, transactional=true --> fail

Error: Error while compiling statement: FAILED: Hive Internal Error: 
org.apache.hadoop.hive.ql.lockmgr.LockException(org.apache.thrift.TApplicationException:
 Internal error processing get_latest_txnid_in_conflict) (state=42000,code=13)

This error originates from Metastore:

24/01/13 03:32:33 ERROR thrift.ProcessFunction: Internal error processing 
get_latest_txnid_in_conflict
java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
com/sun/tools/javac/util/List
  at 
org.apache.hadoop.hive.metastore.txn.TransactionalRetryProxy.lambda$invoke$6(TransactionalRetryProxy.java:182)
...
Caused by: java.lang.ClassNotFoundException: com.sun.tools.javac.util.List
  at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
  ... 35 more





> Run TPC-DS queries and validate results correctness
> ---
>
> Key: HIVE-27929
> URL: https://issues.apache.org/jira/browse/HIVE-27929
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Denys Kuzmenko
>Assignee: Simhadri Govindappa
>Priority: Major
>
> release branch: *branch-4.0*
> https://github.com/apache/hive/tree/branch-4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-27929) Run TPC-DS queries and validate results correctness

2024-01-10 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805052#comment-17805052
 ] 

Sungwoo Park edited comment on HIVE-27929 at 1/10/24 10:18 AM:
---

Could someone suggest the Tez version or Tez commit that currently works with 
branch-4.0?

In pom.xml in branch branch-4.0, we find:
0.10.2

I tried to build Hive on Tez using Tez release 0.10.2, and it failed.

I instead tried the latest commit in Tez master branch (TEZ-4039: Addendum: Tez 
should inject dag id, query id into MDC), and I can build and run Metastore and 
HiveServer2.

However, I see some strange error while loading 1TB TPC-DS ORC (which did not 
occur when I tried Hive 4 last time). I don't think this is a problem related 
to Tez, but just want to make sure that I try a combination of Hive and TEZ 
that is known to work.

{code:java}
Caused by: org.apache.thrift.TApplicationException: Internal error processing 
get_latest_txnid_in_conflict
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_latest_txnid_in_conflict(ThriftHiveMetastore.java:6404)
{code}



was (Author: glapark):
Could someone suggest the Tez version or Tez commit that currently works with 
branch-4.0?

In pom.xml in branch branch-4.0, we find:
0.10.2

I tried to build Hive on Tez using Tez release 0.10.2, and it failed.

I instead tried the latest commit in Tez master branch (TEZ-4039: Addendum: Tez 
should inject dag id, query id into MDC), and I can build and run Metastore and 
HiveServer2.

However, I see some strange error while loading 1TB TPC-DS ORC (which did not 
occur when I tried Hive 4 last time). I don't think this is a problem related 
to Tez, but just want to make sure that I try a combination of Hive and TEZ 
that is know to work.

{code:java}
Caused by: org.apache.thrift.TApplicationException: Internal error processing 
get_latest_txnid_in_conflict
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_latest_txnid_in_conflict(ThriftHiveMetastore.java:6404)
{code}


> Run TPC-DS queries and validate results correctness
> ---
>
> Key: HIVE-27929
> URL: https://issues.apache.org/jira/browse/HIVE-27929
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Denys Kuzmenko
>Assignee: Simhadri Govindappa
>Priority: Major
>
> release branch: *branch-4.0*
> https://github.com/apache/hive/tree/branch-4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27929) Run TPC-DS queries and validate results correctness

2024-01-10 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805052#comment-17805052
 ] 

Sungwoo Park commented on HIVE-27929:
-

Could someone suggest the Tez version or Tez commit that currently works with 
branch-4.0?

In pom.xml in branch branch-4.0, we find:
0.10.2

I tried to build Hive on Tez using Tez release 0.10.2, and it failed.

I instead tried the latest commit in Tez master branch (TEZ-4039: Addendum: Tez 
should inject dag id, query id into MDC), and I can build and run Metastore and 
HiveServer2.

However, I see some strange error while loading 1TB TPC-DS ORC (which did not 
occur when I tried Hive 4 last time). I don't think this is a problem related 
to Tez, but just want to make sure that I try a combination of Hive and TEZ 
that is know to work.

{code:java}
Caused by: org.apache.thrift.TApplicationException: Internal error processing 
get_latest_txnid_in_conflict
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_latest_txnid_in_conflict(ThriftHiveMetastore.java:6404)
{code}


> Run TPC-DS queries and validate results correctness
> ---
>
> Key: HIVE-27929
> URL: https://issues.apache.org/jira/browse/HIVE-27929
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Denys Kuzmenko
>Assignee: Simhadri Govindappa
>Priority: Major
>
> release branch: *branch-4.0*
> https://github.com/apache/hive/tree/branch-4.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27952) Hive fails to create SslContextFactory when KeyStore has multiple certificates

2023-12-13 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27952:

Fix Version/s: 4.0.0

> Hive fails to create SslContextFactory when KeyStore has multiple certificates
> --
>
> Key: HIVE-27952
> URL: https://issues.apache.org/jira/browse/HIVE-27952
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> With Jetty 9.4.40, we should call SslContextFactory.Server(), instead of 
> SslContextFactory(), to create SslContextFactory. Otherwise we get the 
> following error when using a KeyStore with multiple certificates in it.
> {code:java}
> Caused by: java.lang.IllegalStateException: KeyStores with multiple 
> certificates are not supported on the base class 
> org.eclipse.jetty.util.ssl.SslContextFactory. (Use 
> org.eclipse.jetty.util.ssl.SslContextFactory$Server or 
> org.eclipse.jetty.util.ssl.SslContextFactory$Client instead) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27952) Hive fails to create SslContextFactory when KeyStore has multiple certificates

2023-12-13 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27952:

Affects Version/s: 4.0.0-beta-1

> Hive fails to create SslContextFactory when KeyStore has multiple certificates
> --
>
> Key: HIVE-27952
> URL: https://issues.apache.org/jira/browse/HIVE-27952
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-beta-1
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> With Jetty 9.4.40, we should call SslContextFactory.Server(), instead of 
> SslContextFactory(), to create SslContextFactory. Otherwise we get the 
> following error when using a KeyStore with multiple certificates in it.
> {code:java}
> Caused by: java.lang.IllegalStateException: KeyStores with multiple 
> certificates are not supported on the base class 
> org.eclipse.jetty.util.ssl.SslContextFactory. (Use 
> org.eclipse.jetty.util.ssl.SslContextFactory$Server or 
> org.eclipse.jetty.util.ssl.SslContextFactory$Client instead) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27948) Wrong results when using materialized views with non-deterministic/dynamic functions

2023-12-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27948:

Description: 
There are certain SQL functions that return different results across different 
executions. Usually we refer to these functions as non-deterministic or 
dynamic. Some examples are: UNIX_TIMESTAMP(), CURRENT_TIMESTAMP, CURRENT_DATE, 
etc.

When a materialized view definition contains such functions the queries that 
are using this view may return wrong results.

Consider the following scenario where we populate the employee table with 
timestamps representing the future. For making this easily reproable in 
self-contained test the timestamps are only a few seconds apart.
{code:sql}
CREATE TABLE EMPS (ENAME STRING, BIRTH_EPOCH_SECS INT) STORED AS ORC 
TBLPROPERTIES ('transactional'='true');

INSERT INTO EMPS
VALUES ('Victor', UNIX_TIMESTAMP()),
   ('Alex', UNIX_TIMESTAMP() + 2),
   ('Bob', UNIX_TIMESTAMP() + 5),
   ('Alice', UNIX_TIMESTAMP() + 10);

CREATE MATERIALIZED VIEW v_emp AS SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= 
UNIX_TIMESTAMP();
{code}
When the materialized view is created it is populated with only the rows that 
match the timestamp at the given time.

To demonstrate the problem run the following queries with view based rewritting 
disabled and enabled.
{code:sql}
set hive.materializedview.rewriting.sql=false;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
Bob 1702302791
{noformat}
{code:sql}
set hive.materializedview.rewriting.sql=true;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
{noformat}
Naturally the second query should return more rows than the first one since 
UNIX_TIMESTAMP is constantly growing. However, when view based rewritting is in 
use the second query will use the results from the materialized view which are 
by now obsolete (missing Bob entry).

  was:
{color:red}colored text{color}There are certain SQL functions that return 
different results across different executions. Usually we refer to these 
functions as non-deterministic or dynamic. Some examples are: UNIX_TIMESTAMP(), 
CURRENT_TIMESTAMP, CURRENT_DATE, etc.

When a materialized view definition contains such functions the queries that 
are using this view may return wrong results.

Consider the following scenario where we populate the employee table with 
timestamps representing the future. For making this easily reproable in 
self-contained test the timestamps are only a few seconds apart.
{code:sql}
CREATE TABLE EMPS (ENAME STRING, BIRTH_EPOCH_SECS INT) STORED AS ORC 
TBLPROPERTIES ('transactional'='true');

INSERT INTO EMPS
VALUES ('Victor', UNIX_TIMESTAMP()),
   ('Alex', UNIX_TIMESTAMP() + 2),
   ('Bob', UNIX_TIMESTAMP() + 5),
   ('Alice', UNIX_TIMESTAMP() + 10);

CREATE MATERIALIZED VIEW v_emp AS SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= 
UNIX_TIMESTAMP();
{code}
When the materialized view is created it is populated with only the rows that 
match the timestamp at the given time.

To demonstrate the problem run the following queries with view based rewritting 
disabled and enabled.
{code:sql}
set hive.materializedview.rewriting.sql=false;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
Bob 1702302791
{noformat}
{code:sql}
set hive.materializedview.rewriting.sql=true;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
{noformat}
Naturally the second query should return more rows than the first one since 
UNIX_TIMESTAMP is constantly growing. However, when view based rewritting is in 
use the second query will use the results from the materialized view which are 
by now obsolete (missing Bob entry).


> Wrong results when using materialized views with non-deterministic/dynamic 
> functions
> 
>
> Key: HIVE-27948
> URL: https://issues.apache.org/jira/browse/HIVE-27948
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Priority: Major
> Attachments: materialized_view_unix_timestamp.q
>
>
> There are certain SQL functions that return different results across 
> different executions. Usually we refer to these functions as 
> non-deterministic or dynamic. Some examples are: UNIX_TIMESTAMP(), 
> CURRENT_TIMESTAMP, CURRENT_DATE, etc.
> When a materialized view definition contains such functions the queries that 
> are using this view may return wrong results.
> Consider the following scenario where we populate the employee table with 
> timestamps represent

[jira] [Updated] (HIVE-27948) Wrong results when using materialized views with non-deterministic/dynamic functions

2023-12-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27948:

Description: 
{color:red}colored text{color}There are certain SQL functions that return 
different results across different executions. Usually we refer to these 
functions as non-deterministic or dynamic. Some examples are: UNIX_TIMESTAMP(), 
CURRENT_TIMESTAMP, CURRENT_DATE, etc.

When a materialized view definition contains such functions the queries that 
are using this view may return wrong results.

Consider the following scenario where we populate the employee table with 
timestamps representing the future. For making this easily reproable in 
self-contained test the timestamps are only a few seconds apart.
{code:sql}
CREATE TABLE EMPS (ENAME STRING, BIRTH_EPOCH_SECS INT) STORED AS ORC 
TBLPROPERTIES ('transactional'='true');

INSERT INTO EMPS
VALUES ('Victor', UNIX_TIMESTAMP()),
   ('Alex', UNIX_TIMESTAMP() + 2),
   ('Bob', UNIX_TIMESTAMP() + 5),
   ('Alice', UNIX_TIMESTAMP() + 10);

CREATE MATERIALIZED VIEW v_emp AS SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= 
UNIX_TIMESTAMP();
{code}
When the materialized view is created it is populated with only the rows that 
match the timestamp at the given time.

To demonstrate the problem run the following queries with view based rewritting 
disabled and enabled.
{code:sql}
set hive.materializedview.rewriting.sql=false;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
Bob 1702302791
{noformat}
{code:sql}
set hive.materializedview.rewriting.sql=true;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
{noformat}
Naturally the second query should return more rows than the first one since 
UNIX_TIMESTAMP is constantly growing. However, when view based rewritting is in 
use the second query will use the results from the materialized view which are 
by now obsolete (missing Bob entry).

  was:
There are certain SQL functions that return different results across different 
executions. Usually we refer to these functions as non-deterministic or 
dynamic. Some examples are: UNIX_TIMESTAMP(), CURRENT_TIMESTAMP, CURRENT_DATE, 
etc.

When a materialized view definition contains such functions the queries that 
are using this view may return wrong results.

Consider the following scenario where we populate the employee table with 
timestamps representing the future. For making this easily reproable in 
self-contained test the timestamps are only a few seconds apart.
{code:sql}
CREATE TABLE EMPS (ENAME STRING, BIRTH_EPOCH_SECS INT) STORED AS ORC 
TBLPROPERTIES ('transactional'='true');

INSERT INTO EMPS
VALUES ('Victor', UNIX_TIMESTAMP()),
   ('Alex', UNIX_TIMESTAMP() + 2),
   ('Bob', UNIX_TIMESTAMP() + 5),
   ('Alice', UNIX_TIMESTAMP() + 10);

CREATE MATERIALIZED VIEW v_emp AS SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= 
UNIX_TIMESTAMP();
{code}
When the materialized view is created it is populated with only the rows that 
match the timestamp at the given time.

To demonstrate the problem run the following queries with view based rewritting 
disabled and enabled.
{code:sql}
set hive.materializedview.rewriting.sql=false;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
Bob 1702302791
{noformat}
{code:sql}
set hive.materializedview.rewriting.sql=true;
SELECT * FROM EMPS WHERE BIRTH_EPOCH_SECS <= UNIX_TIMESTAMP();
{code}
{noformat}
Victor  1702302786
Alex1702302788
{noformat}
Naturally the second query should return more rows than the first one since 
UNIX_TIMESTAMP is constantly growing. However, when view based rewritting is in 
use the second query will use the results from the materialized view which are 
by now obsolete (missing Bob entry).


> Wrong results when using materialized views with non-deterministic/dynamic 
> functions
> 
>
> Key: HIVE-27948
> URL: https://issues.apache.org/jira/browse/HIVE-27948
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: Stamatis Zampetakis
>Priority: Major
> Attachments: materialized_view_unix_timestamp.q
>
>
> {color:red}colored text{color}There are certain SQL functions that return 
> different results across different executions. Usually we refer to these 
> functions as non-deterministic or dynamic. Some examples are: 
> UNIX_TIMESTAMP(), CURRENT_TIMESTAMP, CURRENT_DATE, etc.
> When a materialized view definition contains such functions the queries that 
> are using this view may return wrong results.
> Consider the following scenario where we populate the employee tabl

[jira] [Commented] (HIVE-27226) FullOuterJoin with filter expressions is not computed correctly

2023-11-23 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789199#comment-17789199
 ] 

Sungwoo Park commented on HIVE-27226:
-

[~dkuzmenko] I think this is a valid test case because Hive returns wrong 
results. (Seonggon, could you confirm this?) 

>From my own testing, HIVE-18908 (of 2018) introduces this bug. For example, if 
>you backport HIVE-18908 (along with HIVE-21288 and HIVE-21923), full outer 
>joins run noticeably faster on TPC-DS 10TB benchmark, but start to return 
>wrong values when used with filter expressions.

> FullOuterJoin with filter expressions is not computed correctly
> ---
>
> Key: HIVE-27226
> URL: https://issues.apache.org/jira/browse/HIVE-27226
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must
>
> I tested many OuterJoin queries as an extension of HIVE-27138, and I found 
> that Hive returns incorrect result for a query containing FullOuterJoin with 
> filter expressions. In a nutshell, all JoinOperators that run on Tez engine 
> return incorrect result for OuterJoin queries, and one of the reason for 
> incorrect computation comes from CommonJoinOperator, which is the base of all 
> JoinOperators. I attached the queries and configuration that I used at the 
> bottom of the document. I am still inspecting this problems, and I will share 
> an update once when I find out another reason. Also any comments and opinions 
> would be appreciated.
> First of all, I observed that current Hive ignores filter expressions 
> contained in MapJoinOperator. For example, the attached result of query1 
> shows that MapJoinOperator performs inner join, not full outer join. This 
> problem stems from removal of filterMap. When converting JoinOperator to 
> MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() 
> removes filterMap of MapJoinOperator. Because MapJoinOperator does not 
> evaluate filter expressions if filterMap is null, this change makes 
> MapJoinOperator ignore filter expressions and it always joins tables 
> regardless whether they satisfy filter expressions or not. To solve this 
> problem, I disable FullOuterMapJoinOptimization and apply path for 
> HIVE-27138, which prevents NPE. (The patch is available at the following 
> link: LINK.) The rest of this document uses this modified Hive, but most of 
> problems happen to current Hive, too.
> The second problem I found is that Hive returns the same left-null or 
> right-null rows multiple time when it uses MapJoinOperator or 
> CommonMergeJoinOperator. This is caused by the logic of current 
> CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. 
> First, they create RowContainers, each of which is a group of rows from one 
> table and has the same key. Second, they call 
> CommonJoinOperator#checkAndGenObject() with created RowContainers. This 
> method checks filterTag of each row in RowContainers and forwards joined row 
> if they meet all filter conditions. For OuterJoin, checkAndGenObject() 
> forwards non-matching rows if there is no matching row in RowContainer. The 
> problem happens when there are multiple RowContainer for the same key and 
> table. For example, suppose that there are two left RowContainers and one 
> right RowContainer. If none of the row in two left RowContainers satisfies 
> filter condition, then checkAndGenObject() will forward Left-Null row for 
> each right row. Because checkAndGenObject() is called with each left 
> RowContainer, there will be two duplicated Left-Null rows for every right row.
> In the case of MapJoinOperator, it always creates singleton RowContainer for 
> big table. Therefore, it always produces duplicated non-matching rows. 
> CommonMergeJoinOperator also creates multiple RowContainer for big table, 
> whose size is hive.join.emit.interval. In the below experiment, I also set 
> hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to 
> disable specialized algorithm for OuterJoin of 2 tables and force calling 
> checkAndGenObject() before all rows with the same keys are gathered. I didn't 
> observe this problem when using VectorMapJoinOperator, and I will inspect 
> VectorMapJoinOperator whether we can reproduce the problem with it.
> I think the second problem is not limited to FullOuterJoin, but I couldn't 
> find such query as of now. This will also be added to this issue if I can 
> write a query that reproduces the second problem without FullOuterJoin.
> I also found that Hive returns wrong result for query2 even when I used 
> VectorMapJoinOperator. I am still inspecting this problem and I will add an 
> update on it when I find out the reason.
>  
> Experiment:
>  
> {code:java}
> ---

[jira] [Commented] (HIVE-27899) Killed speculative execution task attempt should not commit file

2023-11-22 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788971#comment-17788971
 ] 

Sungwoo Park commented on HIVE-27899:
-

Calling canCommit() may not be a complete solution. For example, can we have a 
bad scenario like this?

TaskAttempt#1 calls canCommit(), writes output, and then fails for some reason.
Later TaskAttempt#2 calls canCommit(), writes output, and then completes 
successfully.


> Killed speculative execution task attempt should not commit file
> 
>
> Key: HIVE-27899
> URL: https://issues.apache.org/jira/browse/HIVE-27899
> Project: Hive
>  Issue Type: Bug
>  Components: Tez
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Major
> Attachments: reproduce_bug.md
>
>
> As I mentioned in HIVE-25561, when tez turns on speculative execution, the 
> data file produced by hive may be duplicated. I mentioned in HIVE-25561 that 
> if the speculatively executed task is killed, some data may be submitted 
> unexpectedly. However, after HIVE-25561, there is still a situation that has 
> not been solved. If two task attempts commit file at the same time, the 
> problem of duplicate data files may also occur. Although the probability of 
> this happening is very, very low, it does happen.
>  
> Why?
> There are two key steps:
> (1)FileSinkOperator::closeOp
> TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp 
> --> fsp.commit
> When the OP is closed, the process of closing the OP will be triggered, and 
> eventually the call to fsp.commit will be triggered.
> (2)removeTempOrDuplicateFiles
> (2.a)Firstly, listStatus the files in the temporary directory. 
> (2.b)Secondly check whether there are multiple incorrect commit, and finally 
> move the correct results to the final directory.
> When speculative execution is enabled, when one attempt of a Task is 
> completed, other attempts will be killed. However, AM only sends the kill 
> event and does not ensure that all cleanup actions are completed, that is, 
> closeOp may be executed between 2.a and 2.b. Therefore, 
> removeTempOrDuplicateFiles will not delete the file generated by the kill 
> attempt.
> How?
> The problem is that both speculatively executed tasks commit the file. This 
> will not happen in the Tez examples because they will try canCommit, which 
> can guarantee that one and only one task attempt commit successfully. If one 
> task attempt executes canCommti successfully, the other one will be stuck by 
> canCommit until it receives a kill signal.
> detail see: 
> [https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26660) TPC-DS query 71 returns wrong results

2023-11-17 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787220#comment-17787220
 ] 

Sungwoo Park commented on HIVE-26660:
-

Actually HIVE-26986 was created to solve the problem described in this JIRA.

> TPC-DS query 71 returns wrong results
> -
>
> Key: HIVE-26660
> URL: https://issues.apache.org/jira/browse/HIVE-26660
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
>
> TPC-DS query 71 returns wrong results when tested with 100GB dataset. The 
> query fails with an error:
>  
> Caused by: org.apache.hadoop.hive.common.NoDynamicValuesException: Value does 
> not exist in registry: RS_39_item_i_item_sk_min
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez.getValue(DynamicValueRegistryTez.java:77)
>     at 
> org.apache.hadoop.hive.ql.plan.DynamicValue.getValue(DynamicValue.java:128)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterLongColumnBetweenDynamicValue.evaluate(FilterLongColumnBetweenDynamicValue.java:88)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>     at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27858) OOM happens when selecting many columns and JOIN.

2023-11-07 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783905#comment-17783905
 ] 

Sungwoo Park commented on HIVE-27858:
-

Do you remember approximately how long it takes to finish the query in Hive 2?


> OOM happens when selecting many columns and  JOIN.
> --
>
> Key: HIVE-27858
> URL: https://issues.apache.org/jira/browse/HIVE-27858
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1
>Reporter: Ryu Kobayashi
>Priority: Major
> Attachments: ddl.sql, query.sql
>
>
> OOM happens when executing [^query.sql] using a table in [^ddl.sql]. These 
> did not happen in Hive 2 previously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27303) select query result is different when enable/disable mapjoin with UNION ALL

2023-08-23 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758371#comment-17758371
 ] 

Sungwoo Park commented on HIVE-27303:
-

Could someone update the commit log so that Seonggon gets credit for the pull 
request resolving this JIRA?

> select query result is different when enable/disable mapjoin with UNION ALL
> ---
>
> Key: HIVE-27303
> URL: https://issues.apache.org/jira/browse/HIVE-27303
> Project: Hive
>  Issue Type: Bug
>Reporter: Mahesh Raju Somalaraju
>Assignee: Mahesh Raju Somalaraju
>Priority: Major
>  Labels: pull-request-available
>
> select query result is different when enable/disable mapjoin with UNION ALL
> Below are the reproduce steps.
> As per query when map.join is disabled it should not give rows(duplicate). 
> Same is working fine with map.join=true.
> Expected result: Empty rows.
> Problem: returning duplicate rows.
> Steps:
> --
> SET hive.server2.tez.queue.access.check=true;
> SET tez.queue.name=default
> SET hive.query.results.cache.enabled=false;
> SET hive.fetch.task.conversion=none;
> SET hive.execution.engine=tez;
> SET hive.stats.autogather=true;
> SET hive.server2.enable.doAs=false;
> SET hive.auto.convert.join=false;
> drop table if exists hive1_tbl_data;
> drop table if exists hive2_tbl_data;
> drop table if exists hive3_tbl_data;
> drop table if exists hive4_tbl_data;
> CREATE EXTERNAL TABLE hive1_tbl_data (COLUMID string,COLUMN_FN 
> string,COLUMN_LN string,EMAIL string,COL_UPDATED_DATE timestamp, PK_COLUM 
> string) 
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  TBLPROPERTIES (                                    
>    'TRANSLATED_TO_EXTERNAL'='true',                 
>    'bucketing_version'='2',                         
>    'external.table.purge'='true',                   
>    'parquet.compression'='SNAPPY');
> CREATE EXTERNAL TABLE hive2_tbl_data (COLUMID string,COLUMN_FN 
> string,COLUMN_LN string,EMAIL string,COL_UPDATED_DATE timestamp, PK_COLUM 
> string) 
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  TBLPROPERTIES (                                    
>    'TRANSLATED_TO_EXTERNAL'='true',                 
>    'bucketing_version'='2',                         
>    'external.table.purge'='true',                   
>    'parquet.compression'='SNAPPY');
> CREATE EXTERNAL TABLE hive3_tbl_data (COLUMID string,COLUMN_FN 
> string,COLUMN_LN string,EMAIL string,COL_UPDATED_DATE timestamp, PK_COLUM 
> string) 
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  TBLPROPERTIES (                                    
>    'TRANSLATED_TO_EXTERNAL'='true',                 
>    'bucketing_version'='2',                         
>    'external.table.purge'='true',                   
>    'parquet.compression'='SNAPPY');
>    CREATE EXTERNAL TABLE hive4_tbl_data (COLUMID string,COLUMN_FN 
> string,COLUMN_LN string,EMAIL string,COL_UPDATED_DATE timestamp, PK_COLUM 
> string) 
>  ROW FORMAT SERDE                                   
>    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  
>  STORED AS INPUTFORMAT                              
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
>  OUTPUTFORMAT                                       
>    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
>  TBLPROPERTIES (                                    
>    'TRANSLATED_TO_EXTERNAL'='true',                 
>    'bucketing_version'='2',                         
>    'external.table.purge'='true',                   
>    'parquet.compression'='SNAPPY');
>  
> insert into table hive1_tbl_data select 
> '1','john','doe','j...@hotmail.com','2014-01-01 12:01:02','4000-1';
> insert into table hive1_tbl_data select 
> '2','john','doe','j...@hotmail.com','20

[jira] [Assigned] (HIVE-27375) SharedWorkOptimizer assigns a common cache key to MapJoin operators that should not share MapJoin tables

2023-05-26 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park reassigned HIVE-27375:
---

Assignee: Seonggon Namgung

> SharedWorkOptimizer assigns a common cache key to MapJoin operators that 
> should not share MapJoin tables
> 
>
> Key: HIVE-27375
> URL: https://issues.apache.org/jira/browse/HIVE-27375
> Project: Hive
>  Issue Type: Bug
>Reporter: Sungwoo Park
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
>
> When hive.optimize.shared.work.mapjoin.cache.reuse is set to true, 
> SharedWorkOptimizer sometimes assigns a common cache key to MapJoin operators 
> that should not share MapJoin tables. This bug occurs only for MapJoin 
> operators with 3 or more parent operators.
> Example:
> MAPJOIN[575] (RS_83, GBY_66, RS_85)
> MAPJOIN[585] (RS_212, RS_213, GBY_210)
> In this example, both MAPJOIN[575] and MAPJOIN[585] have three parent 
> operators. The current implementation assigns a common cache key to 
> MAPJOIN[575] and MAPJOIN[585] because RS_83 are RS_212 are equivalent.
> However, MAPJOIN[575] uses GBY_66 for its big table whereas MAPJOIN[585] uses 
> GBY_210 for its big table. As a result, the MapJoin table loaded by one 
> operator cannot be used by the other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27375) SharedWorkOptimizer assigns a common cache key to MapJoin operators that should not share MapJoin tables

2023-05-25 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-27375:
---

 Summary: SharedWorkOptimizer assigns a common cache key to MapJoin 
operators that should not share MapJoin tables
 Key: HIVE-27375
 URL: https://issues.apache.org/jira/browse/HIVE-27375
 Project: Hive
  Issue Type: Bug
Reporter: Sungwoo Park


When hive.optimize.shared.work.mapjoin.cache.reuse is set to true, 
SharedWorkOptimizer sometimes assigns a common cache key to MapJoin operators 
that should not share MapJoin tables. This bug occurs only for MapJoin 
operators with 3 or more parent operators.

Example:
MAPJOIN[575] (RS_83, GBY_66, RS_85)
MAPJOIN[585] (RS_212, RS_213, GBY_210)

In this example, both MAPJOIN[575] and MAPJOIN[585] have three parent 
operators. The current implementation assigns a common cache key to 
MAPJOIN[575] and MAPJOIN[585] because RS_83 are RS_212 are equivalent.

However, MAPJOIN[575] uses GBY_66 for its big table whereas MAPJOIN[585] uses 
GBY_210 for its big table. As a result, the MapJoin table loaded by one 
operator cannot be used by the other.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27226) FullOuterJoin with filter expressions is not computed correctly

2023-04-25 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716128#comment-17716128
 ] 

Sungwoo Park commented on HIVE-27226:
-

For the record, the case for MergeJoin should not set 
hive.join.shortcut.unmatched.rows to false because this configuration key is 
for internal use and should be set to true. By setting it to false, we 
effectively revert HIVE-25822.

> FullOuterJoin with filter expressions is not computed correctly
> ---
>
> Key: HIVE-27226
> URL: https://issues.apache.org/jira/browse/HIVE-27226
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must
>
> I tested many OuterJoin queries as an extension of HIVE-27138, and I found 
> that Hive returns incorrect result for a query containing FullOuterJoin with 
> filter expressions. In a nutshell, all JoinOperators that run on Tez engine 
> return incorrect result for OuterJoin queries, and one of the reason for 
> incorrect computation comes from CommonJoinOperator, which is the base of all 
> JoinOperators. I attached the queries and configuration that I used at the 
> bottom of the document. I am still inspecting this problems, and I will share 
> an update once when I find out another reason. Also any comments and opinions 
> would be appreciated.
> First of all, I observed that current Hive ignores filter expressions 
> contained in MapJoinOperator. For example, the attached result of query1 
> shows that MapJoinOperator performs inner join, not full outer join. This 
> problem stems from removal of filterMap. When converting JoinOperator to 
> MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() 
> removes filterMap of MapJoinOperator. Because MapJoinOperator does not 
> evaluate filter expressions if filterMap is null, this change makes 
> MapJoinOperator ignore filter expressions and it always joins tables 
> regardless whether they satisfy filter expressions or not. To solve this 
> problem, I disable FullOuterMapJoinOptimization and apply path for 
> HIVE-27138, which prevents NPE. (The patch is available at the following 
> link: LINK.) The rest of this document uses this modified Hive, but most of 
> problems happen to current Hive, too.
> The second problem I found is that Hive returns the same left-null or 
> right-null rows multiple time when it uses MapJoinOperator or 
> CommonMergeJoinOperator. This is caused by the logic of current 
> CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. 
> First, they create RowContainers, each of which is a group of rows from one 
> table and has the same key. Second, they call 
> CommonJoinOperator#checkAndGenObject() with created RowContainers. This 
> method checks filterTag of each row in RowContainers and forwards joined row 
> if they meet all filter conditions. For OuterJoin, checkAndGenObject() 
> forwards non-matching rows if there is no matching row in RowContainer. The 
> problem happens when there are multiple RowContainer for the same key and 
> table. For example, suppose that there are two left RowContainers and one 
> right RowContainer. If none of the row in two left RowContainers satisfies 
> filter condition, then checkAndGenObject() will forward Left-Null row for 
> each right row. Because checkAndGenObject() is called with each left 
> RowContainer, there will be two duplicated Left-Null rows for every right row.
> In the case of MapJoinOperator, it always creates singleton RowContainer for 
> big table. Therefore, it always produces duplicated non-matching rows. 
> CommonMergeJoinOperator also creates multiple RowContainer for big table, 
> whose size is hive.join.emit.interval. In the below experiment, I also set 
> hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to 
> disable specialized algorithm for OuterJoin of 2 tables and force calling 
> checkAndGenObject() before all rows with the same keys are gathered. I didn't 
> observe this problem when using VectorMapJoinOperator, and I will inspect 
> VectorMapJoinOperator whether we can reproduce the problem with it.
> I think the second problem is not limited to FullOuterJoin, but I couldn't 
> find such query as of now. This will also be added to this issue if I can 
> write a query that reproduces the second problem without FullOuterJoin.
> I also found that Hive returns wrong result for query2 even when I used 
> VectorMapJoinOperator. I am still inspecting this problem and I will add an 
> update on it when I find out the reason.
>  
> Experiment:
>  
> {code:java}
>  Configuration
> set hive.optimize.shared.work=false;
> -- Std MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=false;
> -- Vec MapJoi

[jira] [Commented] (HIVE-27269) VectorizedMapJoin returns wrong result for TPC-DS query 97

2023-04-19 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713935#comment-17713935
 ] 

Sungwoo Park commented on HIVE-27269:
-

I tested with 10TB TPC-DS, and found the difference was much larger.

> VectorizedMapJoin returns wrong result for TPC-DS query 97
> --
>
> Key: HIVE-27269
> URL: https://issues.apache.org/jira/browse/HIVE-27269
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Seonggon Namgung
>Priority: Major
>
> TPC-DS query 97 returns wrong results when hive.auto.convert.join and 
> hive.vectorized.execution.enabled are set to true.
>  
> Result of query 97 on 1TB text dataset:
> CommonMergeJoinOperator(hive.auto.convert.join=false): 534151529, 
> 284185{*}746{*}, 84163
> MapJoinOperator(hive.auto.convert.join=true, 
> hive.vectorized.execution.enabled=false): 534151529, 284185{*}746{*}, 84163
> VectorMapJoinOperator(hive.auto.convert.join=true, 
> hive.vectorized.execution.enabled=true): 534151529, 284185{*}388{*}, 84163
>  
> Also I observed that VectorizedMapJoin returns different results for 100GB 
> dataset when I run query 97 twice, but I could not reproduce it since then.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-30 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707060#comment-17707060
 ] 

Sungwoo Park edited comment on HIVE-26654 at 3/30/23 10:38 PM:
---

[~brahmareddy] We did not try with MapReduce engine and tested only with Tez 
engine.


was (Author: glapark):
@brahmareddy We did not try with MapReduce engine and tested only with Tez 
engine.

> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>  Labels: hive-4.0.0-must
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-30 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707060#comment-17707060
 ] 

Sungwoo Park commented on HIVE-26654:
-

@brahmareddy We did not try with MapReduce engine and tested only with Tez 
engine.

> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>  Labels: hive-4.0.0-must
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27138) MapJoinOperator throws NPE when computing OuterJoin with filter expressions on small table

2023-03-23 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27138:

Labels: hive-4.0.0-must pull-request-available  (was: 
pull-request-available)

> MapJoinOperator throws NPE when computing OuterJoin with filter expressions 
> on small table
> --
>
> Key: HIVE-27138
> URL: https://issues.apache.org/jira/browse/HIVE-27138
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hive throws NPE when running mapjoin_filter_on_outerjoin.q using Tez engine. 
> (I used TestMiniLlapCliDriver.)
> The NPE is thrown by CommonJoinOperator.getFilterTag(), which just retreives 
> the last object from the given list.
> To the best of my knowledge, if Hive selects MapJoin to perform Join 
> operation, filterTag should be computed and appended to a row before the row 
> is passed to MapJoinOperator.
> In the case of MapReduce engine, this is done by HashTableSinkOperator.
> However, I cannot find any logic pareparing filterTag for small tables when 
> Hive uses Tez engine.
> I think there are 2 available options:
> 1. Don't use MapJoinOperator if a small table has filter expression.
> 2. Add a new logic that computes and passes filterTag to MapJoinOperator.
> I am working on the second option and ready to discuss about it.
> It would be grateful if you could give any opinion about this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27138) MapJoinOperator throws NPE when computing OuterJoin with filter expressions on small table

2023-03-23 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704423#comment-17704423
 ] 

Sungwoo Park commented on HIVE-27138:
-

This commit also fixes a bug in LazyBinaryStruct.java:

https://github.com/apache/hive/blob/a1ff44ccd434373c7eef56fc081b40c343a23f33/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java#L224

Here fieldStart and fieldLength should not be declared as local variables 
because they shadow private fields of the class. Because of this bug, assert() 
in the method getShort() is violated.


> MapJoinOperator throws NPE when computing OuterJoin with filter expressions 
> on small table
> --
>
> Key: HIVE-27138
> URL: https://issues.apache.org/jira/browse/HIVE-27138
> Project: Hive
>  Issue Type: Bug
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hive throws NPE when running mapjoin_filter_on_outerjoin.q using Tez engine. 
> (I used TestMiniLlapCliDriver.)
> The NPE is thrown by CommonJoinOperator.getFilterTag(), which just retreives 
> the last object from the given list.
> To the best of my knowledge, if Hive selects MapJoin to perform Join 
> operation, filterTag should be computed and appended to a row before the row 
> is passed to MapJoinOperator.
> In the case of MapReduce engine, this is done by HashTableSinkOperator.
> However, I cannot find any logic pareparing filterTag for small tables when 
> Hive uses Tez engine.
> I think there are 2 available options:
> 1. Don't use MapJoinOperator if a small table has filter expression.
> 2. Add a new logic that computes and passes filterTag to MapJoinOperator.
> I am working on the second option and ready to discuss about it.
> It would be grateful if you could give any opinion about this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-11 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699313#comment-17699313
 ] 

Sungwoo Park commented on HIVE-26654:
-

HIVE-27006 has been added as a subtask because it describes the same problem as 
described in HIVE-26621.

> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27006) ParallelEdgeFixer inserts misconfigured operator and does not connect it in Tez DAG

2023-03-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27006:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> ParallelEdgeFixer inserts misconfigured operator and does not connect it in 
> Tez DAG
> ---
>
> Key: HIVE-27006
> URL: https://issues.apache.org/jira/browse/HIVE-27006
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Attachments: after.PEF.png, tez-dag.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Hive fails to run the below query on 1TB ORC formatted TPC-DS dataset because 
> of runtime error happens in one Operator.
> I found that the problematic operator is inserted by ParallelEdgeFixer.
> Also I observed that the corresponding vertex has no descendant vertex 
> although its ReduceSinkOperator has a SemiJoin edge connected to 
> TableScanOperator.
> (I attached the figure of Tez DAG and OperatorGraph. One can check that 
> Cluster6 and Cluster7 are connected while Reducer4 and Map7 are not.)
>  
> Query
> {code:java}
> set hive.optimize.shared.work=true;
> set hive.optimize.shared.work.parallel.edge.support=true;
> with
>   inv00 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2000),
>   inv01 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2001),
>   inv02 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2002),
>   sd00 as (select inv_item_sk id, w_zip zip from inv00 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD'),
>   sd01 as (select inv_item_sk id, w_zip zip from inv01 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD'),
>   sd02 as (select inv_item_sk id, w_zip zip from inv02 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD')
> select * from sd00, sd01, sd02 where sd00.id = sd01.id and sd00.id = sd02.id; 
> {code}
>  
> Error message
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:385)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:301)
>         ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col0 from []
>         at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:384)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>         at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>         ... 19 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from []
>         at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:550)
>         at 
> org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:153)
>         at 
> org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:56)
>         at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:1073)
>         at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1099)
>         at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:305)
>         ... 22 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-11 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699312#comment-17699312
 ] 

Sungwoo Park edited comment on HIVE-26654 at 3/12/23 7:14 AM:
--

HIVE-26986 has been added as a subtask because TPD-DS query 71 returns wrong 
results on 1TB ORC table. HIVE-26986 also fixes HIVE-26660.


was (Author: glapark):
HIVE-26986 has been added as a subtask because TPD-DS query 71 returns wrong 
results on 1TB ORC table. HIVE-26986 also fixes HIVE-26621 and HIVE-26660.

> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-11 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699312#comment-17699312
 ] 

Sungwoo Park commented on HIVE-26654:
-

HIVE-26986 has been added as a subtask because TPD-DS query 71 returns wrong 
results on 1TB ORC table. HIVE-26986 also fixes HIVE-26621 and HIVE-26660.

> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.

2023-03-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26986:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> A DAG created by OperatorGraph is not equal to the Tez DAG.
> ---
>
> Key: HIVE-26986
> URL: https://issues.apache.org/jira/browse/HIVE-26986
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
> Attachments: Query71 OperatorGraph.png, Query71 TezDAG.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> A DAG created by OperatorGraph is not equal to the corresponding DAG that is 
> submitted to Tez.
> Because of this problem, ParallelEdgeFixer reports a pair of normal edges to 
> a parallel edge.
> We observe this problem by comparing OperatorGraph and Tez DAG when running 
> TPC-DS query 71 on 1TB ORC format managed table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27134) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-03-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park resolved HIVE-27134.
-
Resolution: Duplicate

> SharedWorkOptimizer merges TableScan operators that have different DPP parents
> --
>
> Key: HIVE-27134
> URL: https://issues.apache.org/jira/browse/HIVE-27134
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26654) Test with the TPC-DS benchmark

2023-03-11 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17699310#comment-17699310
 ] 

Sungwoo Park commented on HIVE-26654:
-

HIVE-26968 has been added as a subtask because TPC-DS query64 returns wrong 
results on 1TB Iceberg table.


> Test with the TPC-DS benchmark 
> ---
>
> Key: HIVE-26654
> URL: https://issues.apache.org/jira/browse/HIVE-26654
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> This Jira reports the result of running system tests using the TPC-DS 
> benchmark. The test scenario is:
> 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
> text dataset
> 2) load a database consisting of ORC tables
> 3) compute column statistics
> 4) run TPC-DS queries
> 5) check the results for correctness
> For step 5), we will compare the results against Hive 3 (which has been 
> tested against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26968) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-03-11 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26968:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> SharedWorkOptimizer merges TableScan operators that have different DPP parents
> --
>
> Key: HIVE-26968
> URL: https://issues.apache.org/jira/browse/HIVE-26968
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Critical
>  Labels: pull-request-available
> Attachments: TPC-DS Query64 OperatorGraph.pdf
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> SharedWorkOptimizer merges TableScan operators that have different DPP 
> parents, which leads to the creation of semantically wrong query plan.
> In our environment, running TPC-DS query64 on 1TB Iceberg format table 
> returns no rows  because of this problem. (The correct result has 7094 rows.)
> We use hive.optimize.shared.work=true, 
> hive.optimize.shared.work.extended=true, and 
> hive.optimize.shared.work.dppunion=false to reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26659) TPC-DS query 16, 69, 94 return wrong results.

2023-02-19 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690998#comment-17690998
 ] 

Sungwoo Park commented on HIVE-26659:
-

For testing hive.auto.convert.anti.join, we used Hive on Tez.

1) create a database consisting of external tables from a 100GB TPC-DS text 
dataset
2) load a database consisting of ORC tables
3) compute column statistics
4) run TPC-DS queries
5) check the results for correctness

With hive.auto.convert.anti.join set to true (default), wrong results are 
returned. Setting hive.auto.convert.anti.join to false fixed the errors.

> TPC-DS query 16, 69, 94 return wrong results.
> -
>
> Key: HIVE-26659
> URL: https://issues.apache.org/jira/browse/HIVE-26659
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> TPC-DS query 16, 69, 94 return wrong results when hive.auto.convert.anti.join 
> is set to true.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688515#comment-17688515
 ] 

Sungwoo Park commented on HIVE-27082:
-

https://github.com/apache/hive/pull/4062

> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of 
> default partition name
> 
>
> Key: HIVE-27082
> URL: https://issues.apache.org/jira/browse/HIVE-27082
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 3.1.3, 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Assignee: Sungwoo Park
>Priority: Major
>  Labels: pull-request-available
>
> This JIRA deals with non-determinisitic behavior of Hive in generating DAGs.
> The non-determinstic behavior of Hive in generating DAGs is due to the logic 
> in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), 
> as well as the disproportionate distribution of Nulls in 
> HIVE_DEFAULT_PARTITION.
> Here is what is happening in the case of the TPC-DS dataset. Let us use 
> web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
> running example.
> In the course of running TPC-DS queries, Hive asks MetaStore about the column 
> statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
> either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.
> --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
> --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
> almost twice as many.
> The first call to MetaStore returns the correct result, but all subsequent 
> requests are likely to return the same result from the cache, irrespective of 
> the inclusion of HIVE_DEFAULT_PARTITION. This is because 
> AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
> way as other partNames, and the difference in the size of partNames[] is just 
> 1. The outcome depends on the duration of intervening queries, so everything 
> is now non-deterministic.
> If a wrong value of numNulls is returned, Hive generates a different DAG 
> which make takes much longer than the correct one. The problem is 
> particularly pronounced here because of the huge number of nulls in 
> HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so 
> efficient that a single wrong guess of numNulls creates a very inefficient 
> DAG.
> Note that this behavior cannot be avoided by setting 
> hive.metastore.aggregate.stats.cache.max.variance to zero because the 
> difference in the number of partNames[] between the argument and the entry in 
> the cache is just 1.
> So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION 
> in a special way, by not returning the result in the cache if there is a 
> difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
> provide the use with an option to activate this feature).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27082:

External issue URL: https://github.com/apache/hive/pull/4062

> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of 
> default partition name
> 
>
> Key: HIVE-27082
> URL: https://issues.apache.org/jira/browse/HIVE-27082
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 3.1.3, 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Assignee: Sungwoo Park
>Priority: Major
>  Labels: pull-request-available
>
> This JIRA deals with non-determinisitic behavior of Hive in generating DAGs.
> The non-determinstic behavior of Hive in generating DAGs is due to the logic 
> in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), 
> as well as the disproportionate distribution of Nulls in 
> HIVE_DEFAULT_PARTITION.
> Here is what is happening in the case of the TPC-DS dataset. Let us use 
> web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
> running example.
> In the course of running TPC-DS queries, Hive asks MetaStore about the column 
> statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
> either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.
> --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
> --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
> almost twice as many.
> The first call to MetaStore returns the correct result, but all subsequent 
> requests are likely to return the same result from the cache, irrespective of 
> the inclusion of HIVE_DEFAULT_PARTITION. This is because 
> AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
> way as other partNames, and the difference in the size of partNames[] is just 
> 1. The outcome depends on the duration of intervening queries, so everything 
> is now non-deterministic.
> If a wrong value of numNulls is returned, Hive generates a different DAG 
> which make takes much longer than the correct one. The problem is 
> particularly pronounced here because of the huge number of nulls in 
> HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so 
> efficient that a single wrong guess of numNulls creates a very inefficient 
> DAG.
> Note that this behavior cannot be avoided by setting 
> hive.metastore.aggregate.stats.cache.max.variance to zero because the 
> difference in the number of partNames[] between the argument and the entry in 
> the cache is just 1.
> So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION 
> in a special way, by not returning the result in the cache if there is a 
> difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
> provide the use with an option to activate this feature).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27082:

External issue URL:   (was: https://github.com/apache/hive/pull/4062)

> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of 
> default partition name
> 
>
> Key: HIVE-27082
> URL: https://issues.apache.org/jira/browse/HIVE-27082
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 3.1.3, 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Assignee: Sungwoo Park
>Priority: Major
>  Labels: pull-request-available
>
> This JIRA deals with non-determinisitic behavior of Hive in generating DAGs.
> The non-determinstic behavior of Hive in generating DAGs is due to the logic 
> in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), 
> as well as the disproportionate distribution of Nulls in 
> HIVE_DEFAULT_PARTITION.
> Here is what is happening in the case of the TPC-DS dataset. Let us use 
> web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
> running example.
> In the course of running TPC-DS queries, Hive asks MetaStore about the column 
> statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
> either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.
> --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
> --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
> almost twice as many.
> The first call to MetaStore returns the correct result, but all subsequent 
> requests are likely to return the same result from the cache, irrespective of 
> the inclusion of HIVE_DEFAULT_PARTITION. This is because 
> AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
> way as other partNames, and the difference in the size of partNames[] is just 
> 1. The outcome depends on the duration of intervening queries, so everything 
> is now non-deterministic.
> If a wrong value of numNulls is returned, Hive generates a different DAG 
> which make takes much longer than the correct one. The problem is 
> particularly pronounced here because of the huge number of nulls in 
> HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so 
> efficient that a single wrong guess of numNulls creates a very inefficient 
> DAG.
> Note that this behavior cannot be avoided by setting 
> hive.metastore.aggregate.stats.cache.max.variance to zero because the 
> difference in the number of partNames[] between the argument and the entry in 
> the cache is just 1.
> So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION 
> in a special way, by not returning the result in the cache if there is a 
> difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
> provide the use with an option to activate this feature).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27082:

Description: 
This JIRA deals with non-determinisitic behavior of Hive in generating DAGs.

The non-determinstic behavior of Hive in generating DAGs is due to the logic in 
AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), as 
well as the disproportionate distribution of Nulls in HIVE_DEFAULT_PARTITION.

Here is what is happening in the case of the TPC-DS dataset. Let us use 
web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
running example.

In the course of running TPC-DS queries, Hive asks MetaStore about the column 
statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.

--- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
--- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
almost twice as many.

The first call to MetaStore returns the correct result, but all subsequent 
requests are likely to return the same result from the cache, irrespective of 
the inclusion of HIVE_DEFAULT_PARTITION. This is because 
AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
way as other partNames, and the difference in the size of partNames[] is just 
1. The outcome depends on the duration of intervening queries, so everything is 
now non-deterministic.

If a wrong value of numNulls is returned, Hive generates a different DAG which 
make takes much longer than the correct one. The problem is particularly 
pronounced here because of the huge number of nulls in HIVE_DEFAULT_PARTITION. 
It is ironic to see that the query optimizer is so efficient that a single 
wrong guess of numNulls creates a very inefficient DAG.

Note that this behavior cannot be avoided by setting 
hive.metastore.aggregate.stats.cache.max.variance to zero because the 
difference in the number of partNames[] between the argument and the entry in 
the cache is just 1.

So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION in 
a special way, by not returning the result in the cache if there is a 
difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
provide the use with an option to activate this feature).


  was:
This pull request deals with non-determinisitic behavior of hive in generating 
DAGS. From the discussion thread:

The non-determinstic behavior of Hive in generating DAGs is due to the logic in 
AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), as 
well as the disproportionate distribution of Nulls in HIVE_DEFAULT_PARTITION.

Here is what is happening in the case of the TPC-DS dataset. Let us use 
web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
running example.

In the course of running TPC-DS queries, Hive asks MetaStore about the column 
statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.

--- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.

--- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
almost twice as many.

The first call to MetaStore returns the correct result, but all subsequent 
requests are likely to return the same result from the cache, irrespective of 
the inclusion of HIVE_DEFAULT_PARTITION. This is because 
AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
way as other partNames, and the difference in the size of partNames[] is just 
1. The outcome depends on the duration of intervening queries, so everything is 
now non-deterministic.

If a wrong value of numNulls is returned, Hive generates a different DAG which 
make takes much longer than the correct one. The problem is particularly 
pronounced here because of the huge number of nulls in HIVE_DEFAULT_PARTITION. 
It is ironic to see that the query optimizer is so efficient that a single 
wrong guess of numNulls creates a very inefficient DAG.

Note that this behavior cannot be avoided by setting 
hive.metastore.aggregate.stats.cache.max.variance to zero because the 
difference in the number of partNames[] between the argument and the entry in 
the cache is just 1.

So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION in 
a special way, by not returning the result in the cache if there is a 
difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
provide the use with an option to activate this feature).



> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of 
> default partition name
> 
>
> Key: HIVE-27082
> URL

[jira] [Assigned] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park reassigned HIVE-27082:
---


> AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of 
> default partition name
> 
>
> Key: HIVE-27082
> URL: https://issues.apache.org/jira/browse/HIVE-27082
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Affects Versions: 4.0.0-alpha-2, 3.1.3
>Reporter: Sungwoo Park
>Assignee: Sungwoo Park
>Priority: Major
>  Labels: pull-request-available
>
> This pull request deals with non-determinisitic behavior of hive in 
> generating DAGS. From the discussion thread:
> The non-determinstic behavior of Hive in generating DAGs is due to the logic 
> in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), 
> as well as the disproportionate distribution of Nulls in 
> HIVE_DEFAULT_PARTITION.
> Here is what is happening in the case of the TPC-DS dataset. Let us use 
> web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
> running example.
> In the course of running TPC-DS queries, Hive asks MetaStore about the column 
> statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
> either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.
> --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.
> --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
> almost twice as many.
> The first call to MetaStore returns the correct result, but all subsequent 
> requests are likely to return the same result from the cache, irrespective of 
> the inclusion of HIVE_DEFAULT_PARTITION. This is because 
> AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
> way as other partNames, and the difference in the size of partNames[] is just 
> 1. The outcome depends on the duration of intervening queries, so everything 
> is now non-deterministic.
> If a wrong value of numNulls is returned, Hive generates a different DAG 
> which make takes much longer than the correct one. The problem is 
> particularly pronounced here because of the huge number of nulls in 
> HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so 
> efficient that a single wrong guess of numNulls creates a very inefficient 
> DAG.
> Note that this behavior cannot be avoided by setting 
> hive.metastore.aggregate.stats.cache.max.variance to zero because the 
> difference in the number of partNames[] between the argument and the entry in 
> the cache is just 1.
> So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION 
> in a special way, by not returning the result in the cache if there is a 
> difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
> provide the use with an option to activate this feature).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27006) ParallelEdgeFixer inserts misconfigured operator and does not connect it in Tez DAG

2023-01-31 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-27006:

Affects Version/s: 4.0.0-alpha-2

> ParallelEdgeFixer inserts misconfigured operator and does not connect it in 
> Tez DAG
> ---
>
> Key: HIVE-27006
> URL: https://issues.apache.org/jira/browse/HIVE-27006
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
> Attachments: after.PEF.png, tez-dag.png
>
>
> Hive fails to run the below query on 1TB ORC formatted TPC-DS dataset because 
> of runtime error happens in one Operator.
> I found that the problematic operator is inserted by ParallelEdgeFixer.
> Also I observed that the corresponding vertex has no descendant vertex 
> although its ReduceSinkOperator has a SemiJoin edge connected to 
> TableScanOperator.
> (I attached the figure of Tez DAG and OperatorGraph. One can check that 
> Cluster6 and Cluster7 are connected while Reducer4 and Map7 are not.)
>  
> Query
> {code:java}
> set hive.optimize.shared.work=true;
> set hive.optimize.shared.work.parallel.edge.support=true;
> with
>   inv00 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2000),
>   inv01 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2001),
>   inv02 as (select inv_item_sk, inv_warehouse_sk from inventory, date_dim 
> where inv_date_sk = d_date_sk and d_year = 2002),
>   sd00 as (select inv_item_sk id, w_zip zip from inv00 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD'),
>   sd01 as (select inv_item_sk id, w_zip zip from inv01 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD'),
>   sd02 as (select inv_item_sk id, w_zip zip from inv02 full outer join 
> warehouse on inv_warehouse_sk = w_warehouse_sk where w_state = 'SD')
> select * from sd00, sd01, sd02 where sd00.id = sd01.id and sd00.id = sd02.id; 
> {code}
>  
> Error message
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:385)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:301)
>         ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col0 from []
>         at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:384)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>         at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>         at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>         ... 19 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from []
>         at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:550)
>         at 
> org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:153)
>         at 
> org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:56)
>         at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:1073)
>         at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1099)
>         at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:305)
>         ... 22 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-22912) Support native submission of Hive queries to a Kubernetes Cluster

2023-01-25 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680597#comment-17680597
 ] 

Sungwoo Park commented on HIVE-22912:
-

[~surbhi04] Hive on MR3 is a solution matching your description (HS2 and worker 
Pods all run directly on Kubernetes without requiring Hadoop), except the 
execution engine MR3 (which replaces Tez) is not an open source project. Hive 
3.1.3 on MR3 on Kubernetes is stable and currently used in production 
environments.

> Support native submission of Hive queries to a Kubernetes Cluster
> -
>
> Key: HIVE-22912
> URL: https://issues.apache.org/jira/browse/HIVE-22912
> Project: Hive
>  Issue Type: New Feature
>Reporter: Surbhi Aggarwal
>Priority: Major
>
> So many big data applications are already integrated or trying to natively 
> integrate with Kubernetes engine. Should we not work together to support hive 
> with this engine?
> If efforts are already being spent on this, please point me to it. Thanks !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.

2023-01-25 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26986:

Affects Version/s: 4.0.0-alpha-2

> A DAG created by OperatorGraph is not equal to the Tez DAG.
> ---
>
> Key: HIVE-26986
> URL: https://issues.apache.org/jira/browse/HIVE-26986
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>
> A DAG created by OperatorGraph is not equal to the corresponding DAG that is 
> submitted to Tez.
> Because of this problem, ParallelEdgeFixer reports a pair of normal edges to 
> a parallel edge.
> We observe this problem by comparing OperatorGraph and Tez DAG when running 
> TPC-DS query 71 on 1TB ORC format managed table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26968) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-01-25 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680530#comment-17680530
 ] 

Sungwoo Park commented on HIVE-26968:
-

Considering the fact that shared work optimizer (with the default value of 
hive.optimize.shared.work.extended=true) is extensively used, I suggest that 
the priority should be set to critical.

> SharedWorkOptimizer merges TableScan operators that have different DPP parents
> --
>
> Key: HIVE-26968
> URL: https://issues.apache.org/jira/browse/HIVE-26968
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> SharedWorkOptimizer merges TableScan operators that have different DPP 
> parents, which leads to the creation of semantically wrong query plan.
> In our environment, running TPC-DS query64 on 1TB Iceberg format table 
> returns no rows  because of this problem. (The correct result has 7094 rows.)
> We use hive.optimize.shared.work=true, 
> hive.optimize.shared.work.extended=true, and 
> hive.optimize.shared.work.dppunion=false to reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26968) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-01-25 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26968:

Affects Version/s: 4.0.0-alpha-2

> SharedWorkOptimizer merges TableScan operators that have different DPP parents
> --
>
> Key: HIVE-26968
> URL: https://issues.apache.org/jira/browse/HIVE-26968
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> SharedWorkOptimizer merges TableScan operators that have different DPP 
> parents, which leads to the creation of semantically wrong query plan.
> In our environment, running TPC-DS query64 on 1TB Iceberg format table 
> returns no rows  because of this problem. (The correct result has 7094 rows.)
> We use hive.optimize.shared.work=true, 
> hive.optimize.shared.work.extended=true, and 
> hive.optimize.shared.work.dppunion=false to reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26968) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-01-25 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26968:

Priority: Critical  (was: Major)

> SharedWorkOptimizer merges TableScan operators that have different DPP parents
> --
>
> Key: HIVE-26968
> URL: https://issues.apache.org/jira/browse/HIVE-26968
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> SharedWorkOptimizer merges TableScan operators that have different DPP 
> parents, which leads to the creation of semantically wrong query plan.
> In our environment, running TPC-DS query64 on 1TB Iceberg format table 
> returns no rows  because of this problem. (The correct result has 7094 rows.)
> We use hive.optimize.shared.work=true, 
> hive.optimize.shared.work.extended=true, and 
> hive.optimize.shared.work.dppunion=false to reproduce the bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26157) Change Iceberg storage handler authz URI to metadata location

2023-01-03 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654074#comment-17654074
 ] 

Sungwoo Park commented on HIVE-26157:
-

Calling IcebergTableUtil.getTable() in the new code throws the same 
NullPointerException as reported in HIVE-25964, e.g., when creating Iceberg 
tables from existing ORC tables using CTAS with 
hive.security.authorization.enabled set to true. Is this patch an improvement 
that can be reverted at the cost of efficiency?

 

> Change Iceberg storage handler authz URI to metadata location
> -
>
> Key: HIVE-26157
> URL: https://issues.apache.org/jira/browse/HIVE-26157
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In HIVE-25964, the authz URI has been changed to "iceberg://db.table".
> It is possible to set the metadata pointers of table A to point to table B, 
> and therefore you could read table B's data via querying table A.
> {code:sql}
> alter table A set tblproperties 
> ('metadata_location'='/path/to/B/snapshot.json', 
> 'previous_metadata_location'='/path/to/B/prev_snapshot.json');  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26732) Iceberg uses "null" and does not use the configuration key "hive.exec.default.partition.name" for default partitions.

2022-11-28 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17639894#comment-17639894
 ] 

Sungwoo Park commented on HIVE-26732:
-

We have started looking into this issue. Not sure how long it is going to take, 
though. Let me report back when we make any progress.

 

> Iceberg uses "null" and does not use the configuration key 
> "hive.exec.default.partition.name" for default partitions.
> -
>
> Key: HIVE-26732
> URL: https://issues.apache.org/jira/browse/HIVE-26732
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> When creating an Iceberg table from an existing ORC table with "insert 
> overwrite", the directory corresponding to the default partition uses "null" 
> instead of the value for the configuration key 
> "hive.exec.default.partition.name".
> For example, we create an Iceberg table from an existing ORC table 
> tpcds_bin_partitioned_orc_1000.catalog_sales:
> {code:java}
> create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
> bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
> cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
> bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, 
> cs_ship_addr_sk     bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    
> bigint, cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk    
>   bigint, cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     
> int, cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    
> double, cs_ext_discount_amt   double, cs_ext_sales_price    double, 
> cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
> double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
> double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
> cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
> (cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
> insert overwrite table catalog_sales select * from 
> tpcds_bin_partitioned_orc_1000.catalog_sales;
> {code}
> Iceberg creates a directory for the default partition like:
> {code:java}
> /hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null{code}
> which should be:
> {code:java}
> /hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=__HIVE_DEFAULT_PARTITION__{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26732) Iceberg uses "null" and does not use the configuration key "hive.exec.default.partition.name" for default partitions.

2022-11-13 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26732:

Description: 
When creating an Iceberg table from an existing ORC table with "insert 
overwrite", the directory corresponding to the default partition uses "null" 
instead of the value for the configuration key 
"hive.exec.default.partition.name".

For example, we create an Iceberg table from an existing ORC table 
tpcds_bin_partitioned_orc_1000.catalog_sales:
{code:java}
create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, cs_ship_addr_sk 
    bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    bigint, 
cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk      bigint, 
cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     int, 
cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    double, 
cs_ext_discount_amt   double, cs_ext_sales_price    double, 
cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
(cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
insert overwrite table catalog_sales select * from 
tpcds_bin_partitioned_orc_1000.catalog_sales;
{code}
Iceberg creates a directory for the default partition like:
{code:java}
/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null{code}
which should be:
{code:java}
/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=__HIVE_DEFAULT_PARTITION__{code}
 

  was:
When creating an Iceberg table from an existing ORC table with "insert 
overwrite", the directory corresponding to the default partition uses "null" 
instead of the value for the configuration key 
"hive.exec.default.partition.name".

For example, we create an Iceberg table from an existing ORC table 
tpcds_bin_partitioned_orc_1000.catalog_sales:
{code:java}
create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, cs_ship_addr_sk 
    bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    bigint, 
cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk      bigint, 
cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     int, 
cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    double, 
cs_ext_discount_amt   double, cs_ext_sales_price    double, 
cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
(cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
insert overwrite table catalog_sales select * from 
tpcds_bin_partitioned_orc_1000.catalog_sales;
{code}
Iceberg creates a directory for the default partition like:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null

which should be:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=_{_}HIVE_DEFAULT_PARTITION{_}_

 


> Iceberg uses "null" and does not use the configuration key 
> "hive.exec.default.partition.name" for default partitions.
> -
>
> Key: HIVE-26732
> URL: https://issues.apache.org/jira/browse/HIVE-26732
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> When creating an Iceberg table from an existing ORC table with "insert 
> overwrite", the directory corresponding to the default partition uses "null" 
> instead of the value for the configuration key 
> "hive.exec.default.partition.name".
> For example, we create an Iceberg table from an existing ORC table 
> tpcds_bin_partitioned_orc_1000.catalog_sales:
> {code:java}
> create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
> bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
> cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_

[jira] [Updated] (HIVE-26732) Iceberg uses "null" and does not use the configuration key "hive.exec.default.partition.name" for default partitions.

2022-11-13 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26732:

Description: 
When creating an Iceberg table from an existing ORC table with "insert 
overwrite", the directory corresponding to the default partition uses "null" 
instead of the value for the configuration key 
"hive.exec.default.partition.name".

For example, we create an Iceberg table from an existing ORC table 
tpcds_bin_partitioned_orc_1000.catalog_sales:
{code:java}
create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, cs_ship_addr_sk 
    bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    bigint, 
cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk      bigint, 
cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     int, 
cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    double, 
cs_ext_discount_amt   double, cs_ext_sales_price    double, 
cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
(cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
insert overwrite table catalog_sales select * from 
tpcds_bin_partitioned_orc_1000.catalog_sales;
{code}
Iceberg creates a directory for the default partition like:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null

which should be:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=_{_}HIVE_DEFAULT_PARTITION{_}_

 

  was:
When creating an Iceberg table from an existing ORC table with "insert 
overwrite", the directory corresponding to the default partition uses "null" 
instead of the value for the configuration key 
"hive.exec.default.partition.name".

For example, we create a Iceberg table from an existing ORC table 
tpcds_bin_partitioned_orc_1000.catalog_sales:
{code:java}
create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, cs_ship_addr_sk 
    bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    bigint, 
cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk      bigint, 
cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     int, 
cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    double, 
cs_ext_discount_amt   double, cs_ext_sales_price    double, 
cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
(cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
insert overwrite table catalog_sales select * from 
tpcds_bin_partitioned_orc_1000.catalog_sales;
{code}
Iceberg creates a directory for the default partition like:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null

which should be:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=__HIVE_DEFAULT_PARTITION__

 


> Iceberg uses "null" and does not use the configuration key 
> "hive.exec.default.partition.name" for default partitions.
> -
>
> Key: HIVE-26732
> URL: https://issues.apache.org/jira/browse/HIVE-26732
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> When creating an Iceberg table from an existing ORC table with "insert 
> overwrite", the directory corresponding to the default partition uses "null" 
> instead of the value for the configuration key 
> "hive.exec.default.partition.name".
> For example, we create an Iceberg table from an existing ORC table 
> tpcds_bin_partitioned_orc_1000.catalog_sales:
> {code:java}
> create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
> bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
> cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
> bigint, cs_ship_

[jira] [Commented] (HIVE-26655) TPC-DS query 17 returns wrong results

2022-10-24 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623108#comment-17623108
 ] 

Sungwoo Park commented on HIVE-26655:
-

This case is a bit different because the query is completed successfully but 
with wrong results, which is worse than failing to compile or finish the query.

I think users often run queries without computing statistics, e.g., because 
they forget to compute statistics or because they use text datasets, and so on.

> TPC-DS query 17 returns wrong results
> -
>
> Key: HIVE-26655
> URL: https://issues.apache.org/jira/browse/HIVE-26655
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
>
> When tested with 100GB ORC tables, the number of rows returned by query 17 is 
> not stable. It returns fewer rows than the correct result (55 rows).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26655) TPC-DS query 17 returns wrong results

2022-10-23 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622800#comment-17622800
 ] 

Sungwoo Park commented on HIVE-26655:
-

I tested again with the latest release in the master branch (Hive on Tez), 
using 100GB TPC-DS benchmark.

 

1. Before computing statistics with ANALYZE TABLE

With hive.tez.bloom.filter.merge.threads=0, query 17 returns correct results 
(55 rows) consistently.

With hive.tez.bloom.filter.merge.threads=1, query 17 returns wrong results 
inconsistently (7 rows, 18 rows, 17 rows, and so on).

 

2. After computing statistics with ANALYZE TABLE

Query 17 returns correct results (55 rows) whether 
hive.tez.bloom.filter.merge.threads is set to 0 or 1.

 

Here is an example of running EXPLAIN ANALYZE.

 
{code:java}
0: jdbc:hive2://blue0:9852/> set hive.tez.bloom.filter.merge.threads=1;
No rows affected (0.005 seconds)
0: jdbc:hive2://blue0:9852/> EXPLAIN ANALYZE select  i_item_id ,i_item_desc 
,s_state ,count(ss_quantity) as store_sales_quantitycount ,avg(ss_quantity) as 
store_sales_quantityave ,stddev_samp(ss_quantity) as store_sales_quantitystdev 
,stddev_samp(ss_quantity)/avg(ss_quantity) as store_sales_quantitycov 
,count(sr_return_quantity) as_store_returns_quantitycount 
,avg(sr_return_quantity) as_store_returns_quantityave 
,stddev_samp(sr_return_quantity) as_store_returns_quantitystdev 
,stddev_samp(sr_return_quantity)/avg(sr_return_quantity) as 
store_returns_quantitycov ,count(cs_quantity) as catalog_sales_quantitycount 
,avg(cs_quantity) as catalog_sales_quantityave 
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitystdev 
,stddev_samp(cs_quantity)/avg(cs_quantity) as catalog_sales_quantitycov from 
store_sales ,store_returns ,catalog_sales ,date_dim d1 ,date_dim d2 ,date_dim 
d3 ,store ,item where d1.d_quarter_name = '2000Q1' and d1.d_date_sk = 
ss_sold_date_sk and i_item_sk = ss_item_sk and s_store_sk = ss_store_sk and 
ss_customer_sk = sr_customer_sk and ss_item_sk = sr_item_sk and 
ss_ticket_number = sr_ticket_number and sr_returned_date_sk = d2.d_date_sk and 
d2.d_quarter_name in ('2000Q1','2000Q2','2000Q3') and sr_customer_sk = 
cs_bill_customer_sk and sr_item_sk = cs_item_sk and cs_sold_date_sk = 
d3.d_date_sk and d3.d_quarter_name in ('2000Q1','2000Q2','2000Q3') group by 
i_item_id ,i_item_desc ,s_state order by i_item_id ,i_item_desc ,s_state limit 
100;
++
|                      Explain                       |
++
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Map 1 <- Map 10 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE), Map 5 
(BROADCAST_EDGE), Map 7 (BROADCAST_EDGE), Reducer 9 (BROADCAST_EDGE) |
| Map 7 <- Map 4 (BROADCAST_EDGE), Map 8 (BROADCAST_EDGE), Reducer 6 
(BROADCAST_EDGE) |
| Map 8 <- Map 4 (BROADCAST_EDGE)                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
| Reducer 3 <- Reducer 2 (SIMPLE_EDGE)               |
| Reducer 6 <- Map 5 (CUSTOM_SIMPLE_EDGE)            |
| Reducer 9 <- Map 8 (CUSTOM_SIMPLE_EDGE)            |
|                                                    | 
| Stage-0                                            |
|   Fetch Operator                                   | 
|     limit:100                                      |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_294]                |
|         Limit [LIM_293] (rows=100/13 width=466)    |
|           Number of rows:100                       |
|           Select Operator [SEL_292] (rows=15835/13 width=466) |
|             
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"]
 |
|           <-Reducer 2 [SIMPLE_EDGE] vectorized     |
|             SHUFFLE [RS_291]                       |
|               Select Operator [SEL_290] (rows=15835/13 width=458) |
|                 
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13"]
 |
|                 Group By Operator [GBY_289] (rows=15835/13 width=466) |
|                   
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10","_col11","_col12","_col13","_col14"],aggregations:["count(VALUE._col0)","sum(VALUE._col1)","sum(VALUE._col2)","sum(VALUE._col3)","count(VALUE._col4)","sum(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)","sum(VALUE._col9)","sum(VALUE._col10)","sum(VALUE._col11)"],keys:KEY._col0,
 KEY._col1, KEY._col2 | |                 <-Map 1 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_288]                 |
|            

[jira] [Updated] (HIVE-26660) TPC-DS query 71 returns wrong results

2022-10-22 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26660:

Description: 
TPC-DS query 71 returns wrong results when tested with 100GB dataset. The query 
fails with an error:

 

Caused by: org.apache.hadoop.hive.common.NoDynamicValuesException: Value does 
not exist in registry: RS_39_item_i_item_sk_min
    at 
org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez.getValue(DynamicValueRegistryTez.java:77)
    at 
org.apache.hadoop.hive.ql.plan.DynamicValue.getValue(DynamicValue.java:128)
    at 
org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterLongColumnBetweenDynamicValue.evaluate(FilterLongColumnBetweenDynamicValue.java:88)
    at 
org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
    at 
org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
    at org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
    at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171)
    at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
    at 
org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)

 

 

  was:
TPC-DS query 71 returns wrong results when tested with 100GB dataset.

 

 


> TPC-DS query 71 returns wrong results
> -
>
> Key: HIVE-26660
> URL: https://issues.apache.org/jira/browse/HIVE-26660
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
>
> TPC-DS query 71 returns wrong results when tested with 100GB dataset. The 
> query fails with an error:
>  
> Caused by: org.apache.hadoop.hive.common.NoDynamicValuesException: Value does 
> not exist in registry: RS_39_item_i_item_sk_min
>     at 
> org.apache.hadoop.hive.ql.exec.tez.DynamicValueRegistryTez.getValue(DynamicValueRegistryTez.java:77)
>     at 
> org.apache.hadoop.hive.ql.plan.DynamicValue.getValue(DynamicValue.java:128)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.gen.FilterLongColumnBetweenDynamicValue.evaluate(FilterLongColumnBetweenDynamicValue.java:88)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.expressions.FilterExprAndExpr.evaluate(FilterExprAndExpr.java:42)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorFilterOperator.process(VectorFilterOperator.java:125)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.vectorForward(Operator.java:919)
>     at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:171)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.deliverVectorizedRowBatch(VectorMapOperator.java:809)
>     at 
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:842)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26660) TPC-DS query 71 returns wrong results

2022-10-22 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26660:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> TPC-DS query 71 returns wrong results
> -
>
> Key: HIVE-26660
> URL: https://issues.apache.org/jira/browse/HIVE-26660
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
>
> TPC-DS query 71 returns wrong results when tested with 100GB dataset.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26659) TPC-DS query 16, 69, 94 return wrong results.

2022-10-22 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26659:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> TPC-DS query 16, 69, 94 return wrong results.
> -
>
> Key: HIVE-26659
> URL: https://issues.apache.org/jira/browse/HIVE-26659
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Sungwoo Park
>Priority: Major
>
> TPC-DS query 16, 69, 94 return wrong results when hive.auto.convert.anti.join 
> is set to true.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2022-10-20 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621458#comment-17621458
 ] 

Sungwoo Park commented on HIVE-23880:
-

Created a new ticket: HIVE-26655

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 436 465 696 bits, so 
> merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR 
> operation, which is very hot codepath, but can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26655) TPC-DS query 17 returns wrong results

2022-10-20 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26655:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> TPC-DS query 17 returns wrong results
> -
>
> Key: HIVE-26655
> URL: https://issues.apache.org/jira/browse/HIVE-26655
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Sungwoo Park
>Priority: Major
> Fix For: 4.0.0-alpha-2
>
>
> When tested with 100GB ORC tables, the number of rows returned by query 17 is 
> not stable. It returns fewer rows than the correct result (55 rows).
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2022-10-20 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621449#comment-17621449
 ] 

Sungwoo Park commented on HIVE-24167:
-

I created an umbrella Jira ticket HIVE-26654. I think we could move this Jira 
as a sub-task of HIVE-26654.

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: Zoltan Haindrich
>Priority: Major
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
> at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(Q

[jira] (HIVE-26621) TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 from []

2022-10-20 Thread Sungwoo Park (Jira)


[ https://issues.apache.org/jira/browse/HIVE-26621 ]


Sungwoo Park deleted comment on HIVE-26621:
-

was (Author: glapark):
I created an umbrella Jira ticket HIVE-26654. I think we could move this Jira 
as a sub-task of HIVE-26654.

> TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 
> from []
> -
>
> Key: HIVE-26621
> URL: https://issues.apache.org/jira/browse/HIVE-26621
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> How to reproduce:
> 1. Build Hive using the master branch. Tez version is 0.10.2.
> 2. Generate 100GB TPC-DS text data.
> 3. Load the TPC-DS data using ORC format.
> 4. Run TPC-DS queries. Query 2 fails with the error shown below.
>  
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:384)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>     at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>     ... 19 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:550)
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:153)
>     at 
> org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:56)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:1073)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1099)
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:305)
>     ... 22 more
>  
> {code}
>  
> A slightly simplified query that produces the same error is:
> {code:java}
> with wscs as
>  (select sold_date_sk
>         ,sales_price
>   from (select ws_sold_date_sk sold_date_sk
>               ,ws_ext_sales_price sales_price
>         from web_sales) x
>         union all
>        (select cs_sold_date_sk sold_date_sk
>               ,cs_ext_sales_price sales_price
>         from catalog_sales)),
>  wswscs as
>  (select d_week_seq
>  from wscs
>      ,date_dim
>  where d_date_sk = sold_date_sk
>  group by d_week_seq)
>  select d_week_seq1
>  from
>  (select wswscs.d_week_seq d_week_seq1
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001) y,
>  (select wswscs.d_week_seq d_week_seq2
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001+1) z
>  where d_week_seq1=d_week_seq2-53
>  order by d_week_seq1;{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-26621) TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 from []

2022-10-20 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621448#comment-17621448
 ] 

Sungwoo Park commented on HIVE-26621:
-

I created an umbrella Jira ticket HIVE-26654. I think we could move this Jira 
as a sub-task of HIVE-26654.

> TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 
> from []
> -
>
> Key: HIVE-26621
> URL: https://issues.apache.org/jira/browse/HIVE-26621
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> How to reproduce:
> 1. Build Hive using the master branch. Tez version is 0.10.2.
> 2. Generate 100GB TPC-DS text data.
> 3. Load the TPC-DS data using ORC format.
> 4. Run TPC-DS queries. Query 2 fails with the error shown below.
>  
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:384)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>     at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>     ... 19 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:550)
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:153)
>     at 
> org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:56)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:1073)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1099)
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:305)
>     ... 22 more
>  
> {code}
>  
> A slightly simplified query that produces the same error is:
> {code:java}
> with wscs as
>  (select sold_date_sk
>         ,sales_price
>   from (select ws_sold_date_sk sold_date_sk
>               ,ws_ext_sales_price sales_price
>         from web_sales) x
>         union all
>        (select cs_sold_date_sk sold_date_sk
>               ,cs_ext_sales_price sales_price
>         from catalog_sales)),
>  wswscs as
>  (select d_week_seq
>  from wscs
>      ,date_dim
>  where d_date_sk = sold_date_sk
>  group by d_week_seq)
>  select d_week_seq1
>  from
>  (select wswscs.d_week_seq d_week_seq1
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001) y,
>  (select wswscs.d_week_seq d_week_seq2
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001+1) z
>  where d_week_seq1=d_week_seq2-53
>  order by d_week_seq1;{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-26621) TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 from []

2022-10-20 Thread Sungwoo Park (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-26621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sungwoo Park updated HIVE-26621:

Parent: HIVE-26654
Issue Type: Sub-task  (was: Bug)

> TPC-DS query 2 fails with java.lang.RuntimeException: cannot find field _col0 
> from []
> -
>
> Key: HIVE-26621
> URL: https://issues.apache.org/jira/browse/HIVE-26621
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-1
>Reporter: Sungwoo Park
>Priority: Major
>
> How to reproduce:
> 1. Build Hive using the master branch. Tez version is 0.10.2.
> 2. Generate 100GB TPC-DS text data.
> 3. Load the TPC-DS data using ORC format.
> 4. Run TPC-DS queries. Query 2 fails with the error shown below.
>  
> {code:java}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:384)
>     at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>     at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>     at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>     ... 19 more
> Caused by: java.lang.RuntimeException: cannot find field _col0 from []
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:550)
>     at 
> org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:153)
>     at 
> org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:56)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:1073)
>     at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:1099)
>     at 
> org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.process(ReduceSinkOperator.java:305)
>     ... 22 more
>  
> {code}
>  
> A slightly simplified query that produces the same error is:
> {code:java}
> with wscs as
>  (select sold_date_sk
>         ,sales_price
>   from (select ws_sold_date_sk sold_date_sk
>               ,ws_ext_sales_price sales_price
>         from web_sales) x
>         union all
>        (select cs_sold_date_sk sold_date_sk
>               ,cs_ext_sales_price sales_price
>         from catalog_sales)),
>  wswscs as
>  (select d_week_seq
>  from wscs
>      ,date_dim
>  where d_date_sk = sold_date_sk
>  group by d_week_seq)
>  select d_week_seq1
>  from
>  (select wswscs.d_week_seq d_week_seq1
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001) y,
>  (select wswscs.d_week_seq d_week_seq2
>   from wswscs,date_dim
>   where date_dim.d_week_seq = wswscs.d_week_seq and d_year = 2001+1) z
>  where d_week_seq1=d_week_seq2-53
>  order by d_week_seq1;{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-23880) Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge

2022-10-20 Thread Sungwoo Park (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-23880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621440#comment-17621440
 ] 

Sungwoo Park commented on HIVE-23880:
-

We are testing Hive 4 using the TPC-DS 1TB benchmark. So far, we have found 
that three queries (query 2, query 14-2, query 17) either fail or return wrong 
results. Let me create a new Jira ticket with three sub-tasks. More sub-tasks 
will be added if needed.

It wonder if anyone runs system tests with the TPC-DS benchmark. From our 
previous testing, checking query plans (e.g., using HIVE-23965) is not enough. 
For example, loading ORC tables fails can fail.

> Bloom filters can be merged in a parallel way in VectorUDAFBloomFilterMerge
> ---
>
> Key: HIVE-23880
> URL: https://issues.apache.org/jira/browse/HIVE-23880
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 4.0.0-alpha-1
>
> Attachments: lipwig-output3605036885489193068.svg
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Merging bloom filters in semijoin reduction can become the main bottleneck in 
> case of large number of source mapper tasks (~1000, Map 1 in below example) 
> and a large amount of expected entries (50M) in bloom filters.
> For example in TPCDS Q93:
> {code}
> select /*+ semi(store_returns, sr_item_sk, store_sales, 7000)*/ 
> ss_customer_sk
> ,sum(act_sales) sumsales
>   from (select ss_item_sk
>   ,ss_ticket_number
>   ,ss_customer_sk
>   ,case when sr_return_quantity is not null then 
> (ss_quantity-sr_return_quantity)*ss_sales_price
> else 
> (ss_quantity*ss_sales_price) end act_sales
> from store_sales left outer join store_returns on (sr_item_sk = 
> ss_item_sk
>and 
> sr_ticket_number = ss_ticket_number)
> ,reason
> where sr_reason_sk = r_reason_sk
>   and r_reason_desc = 'reason 66') t
>   group by ss_customer_sk
>   order by sumsales, ss_customer_sk
> limit 100;
> {code}
> On 10TB-30TB scale there is a chance that from 3-4 mins of query runtime 1-2 
> mins are spent with merging bloom filters (Reducer 2), as in:  
> [^lipwig-output3605036885489193068.svg] 
> {code}
> --
> VERTICES  MODESTATUS  TOTAL  COMPLETED  RUNNING  PENDING  
> FAILED  KILLED
> --
> Map 3 ..  llap SUCCEEDED  1  100  
>  0   0
> Map 1 ..  llap SUCCEEDED   1263   126300  
>  0   0
> Reducer 2 llap   RUNNING  1  010  
>  0   0
> Map 4 llap   RUNNING   6154  0  207 5947  
>  0   0
> Reducer 5 llapINITED 43  00   43  
>  0   0
> Reducer 6 llapINITED  1  001  
>  0   0
> --
> VERTICES: 02/06  [>>--] 16%   ELAPSED TIME: 149.98 s
> --
> {code}
> For example, 70M entries in bloom filter leads to a 436 465 696 bits, so 
> merging 1263 bloom filters means running ~ 1263 * 436 465 696 bitwise OR 
> operation, which is very hot codepath, but can be parallelized.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)