[jira] [Commented] (HIVE-27788) Exception in Sort Merge join with Group By + PTF Operator

2023-11-15 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786646#comment-17786646
 ] 

Stamatis Zampetakis commented on HIVE-27788:


Thanks for the clarification [~kkasa]! If PTF is irrelevant then let's update 
at least the summary accordingly so that we have a more meaningful entry in the 
release notes.

> Exception in Sort Merge join with Group By + PTF Operator
> -
>
> Key: HIVE-27788
> URL: https://issues.apache.org/jira/browse/HIVE-27788
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 4.0.0-beta-1
>Reporter: Riju Trivedi
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: auto_sortmerge_join_17.q
>
>
> Sort- merge join with Group By + PTF operator leads  to Runtime exception 
> {code:java}
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:313)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:291)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293)
>   ... 15 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:387)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:303)
>   ... 17 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:392)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:372)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:316)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:127)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.handleOutputRows(PTFOperator.java:337)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.processRow(PTFOperator.java:325)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:139)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:372)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:534)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:488)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:390)
>   ... 31 more
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:313)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:522)
>   ... 33 more {code}
> Issue can be reproduced with [^auto_sortmerge_join_17.q]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27877) Bump org.apache.avro:avro from 1.11.1 to 1.11.3

2023-11-15 Thread Ayush Saxena (Jira)
Ayush Saxena created HIVE-27877:
---

 Summary: Bump org.apache.avro:avro from 1.11.1 to 1.11.3 
 Key: HIVE-27877
 URL: https://issues.apache.org/jira/browse/HIVE-27877
 Project: Hive
  Issue Type: Improvement
Reporter: Ayush Saxena


PR from [dependabot|https://github.com/apps/dependabot]

https://github.com/apache/hive/pull/4764



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27871) Fix some formatting problems is YarnQueueHelper

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27871:
--
Labels: newbie pull-request-available  (was: newbie)

> Fix some formatting problems is YarnQueueHelper
> ---
>
> Key: HIVE-27871
> URL: https://issues.apache.org/jira/browse/HIVE-27871
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: Mahesh Raju Somalaraju
>Priority: Major
>  Labels: newbie, pull-request-available
>
> https://github.com/apache/hive/blob/cbc5d2d7d650f90882c5c4ad0026a94d2e586acb/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/YarnQueueHelper.java#L54-L57
> {code}
>   private static String webapp_conf_key = YarnConfiguration.RM_WEBAPP_ADDRESS;
>   private static String webapp_ssl_conf_key = 
> YarnConfiguration.RM_WEBAPP_HTTPS_ADDRESS;
>   private static String yarn_HA_enabled = YarnConfiguration.RM_HA_ENABLED;
>   private static String yarn_HA_rmids = YarnConfiguration.RM_HA_IDS;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27843) Add QueryOperation to Hive proto logger for post execution hook information

2023-11-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27843:
--
Labels: pull-request-available  (was: )

> Add QueryOperation to Hive proto logger for post execution hook information
> ---
>
> Key: HIVE-27843
> URL: https://issues.apache.org/jira/browse/HIVE-27843
> Project: Hive
>  Issue Type: Task
>Reporter: Ramesh Kumar Thangarajan
>Assignee: Ramesh Kumar Thangarajan
>Priority: Major
>  Labels: pull-request-available
>
> Currently the query operation type is missing in the proto logger
> Add QueryOperation to Hive proto logger for post execution hook information



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-27876) Incorrect query results on tables with ClusterBy & SortBy

2023-11-15 Thread Naresh P R (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naresh P R updated HIVE-27876:
--
Description: 
Repro:

 
{code:java}
create external table test_bucket(age int, name string, dept string) clustered 
by (age, name) sorted by (age asc, name asc) into 2 buckets stored as orc;
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');

//empty wrong results
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--+---+--+
| age  | name  | _c2  |
+--+---+--+
+--+---+--+

// Workaround
set hive.map.aggr=false;
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--++--+
| age  |  name  | _c2  |
+--++--+
| 1    | user1  | 2    |
| 2    | user2  | 2    |
+--++--+ {code}
 

 

  was:
Repro:

 
{code:java}
create external table test_bucket(age int, name string, dept string) clustered 
by (age, name) sorted by (age asc, name asc) into 2 buckets stored as orc;
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');

//empty wrong results with default CDP configs
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--+---+--+
| age  | name  | _c2  |
+--+---+--+
+--+---+--+

// Workaround
set hive.map.aggr=false;
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--++--+
| age  |  name  | _c2  |
+--++--+
| 1    | user1  | 2    |
| 2    | user2  | 2    |
+--++--+ {code}
 

 


> Incorrect query results on tables with ClusterBy & SortBy
> -
>
> Key: HIVE-27876
> URL: https://issues.apache.org/jira/browse/HIVE-27876
> Project: Hive
>  Issue Type: Bug
>Reporter: Naresh P R
>Priority: Major
>
> Repro:
>  
> {code:java}
> create external table test_bucket(age int, name string, dept string) 
> clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored 
> as orc;
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> //empty wrong results
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +--+---+--+
> | age  | name  | _c2  |
> +--+---+--+
> +--+---+--+
> // Workaround
> set hive.map.aggr=false;
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +--++--+
> | age  |  name  | _c2  |
> +--++--+
> | 1    | user1  | 2    |
> | 2    | user2  | 2    |
> +--++--+ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27876) Incorrect query results on tables with ClusterBy & SortBy

2023-11-15 Thread Naresh P R (Jira)
Naresh P R created HIVE-27876:
-

 Summary: Incorrect query results on tables with ClusterBy & SortBy
 Key: HIVE-27876
 URL: https://issues.apache.org/jira/browse/HIVE-27876
 Project: Hive
  Issue Type: Bug
Reporter: Naresh P R


Repro:

 
{code:java}
create external table test_bucket(age int, name string, dept string) clustered 
by (age, name) sorted by (age asc, name asc) into 2 buckets stored as orc;
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');

//empty wrong results with default CDP configs
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--+---+--+
| age  | name  | _c2  |
+--+---+--+
+--+---+--+

// Workaround
set hive.map.aggr=false;
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1; 
+--++--+
| age  |  name  | _c2  |
+--++--+
| 1    | user1  | 2    |
| 2    | user2  | 2    |
+--++--+ {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27875) OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream

2023-11-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-27875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor resolved HIVE-27875.
-
Resolution: Duplicate

> OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream
> --
>
> Key: HIVE-27875
> URL: https://issues.apache.org/jira/browse/HIVE-27875
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>
> This codepath:
> {code}
> hiveserver2 <14>1 2023-11-15T16:05:10.504Z hiveserver2-0 hiveserver2 1 
> a51a9165-623b-4837-b087-818cd7e78d88 [mdc@18060 class="s3a.S3AInputStream" 
> level="INFO" operationLogLevel="EXECUTION" 
> queryId="hive_20231115160510_cdab039d-efd4-4711-b75c-c382798b7640" 
> sessionId="e7f1b1b3-ad51-4823-8a9b-17228c2216ef" 
> thread="HiveServer2-Background-Pool: Thread-164"] Reopen called, 
> trace\rjava.lang.RuntimeException
> at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:262)
> at 
> org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:437)
> at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
> at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
> at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
> at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
> at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
> at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
> at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
> at 
> org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:429)
> at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:547)
> at 
> org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:838)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:118)
> at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:702)
> at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:806)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:567)
> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:61)
> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:112)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.validateInput(OrcInputFormat.java:655)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.checkInputFormat(HiveFileFormatUtils.java:207)
> at 
> org.apache.hadoop.hive.ql.exec.MoveTask.checkFileFormats(MoveTask.java:826)
> at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:493)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
> at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:356)
> at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:329)
> at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
> at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:107)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:809)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:546)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:540)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:190)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:235)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:92)
> at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:829)\r
> {code}
> ReaderImpl.extractFileTail creates an FSDataInputStream but we never call 
> close on that reader from OrcInputFormat



--
This message was sen

[jira] [Updated] (HIVE-27875) OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream

2023-11-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-27875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-27875:

Description: 
This codepath:
{code}
hiveserver2 <14>1 2023-11-15T16:05:10.504Z hiveserver2-0 hiveserver2 1 
a51a9165-623b-4837-b087-818cd7e78d88 [mdc@18060 class="s3a.S3AInputStream" 
level="INFO" operationLogLevel="EXECUTION" 
queryId="hive_20231115160510_cdab039d-efd4-4711-b75c-c382798b7640" 
sessionId="e7f1b1b3-ad51-4823-8a9b-17228c2216ef" 
thread="HiveServer2-Background-Pool: Thread-164"] Reopen called, 
trace\rjava.lang.RuntimeException
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:262)
at 
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$lazySeek$1(S3AInputStream.java:437)
at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$3(Invoker.java:284)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
at org.apache.hadoop.fs.s3a.Invoker.lambda$maybeRetry$5(Invoker.java:408)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:404)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:282)
at org.apache.hadoop.fs.s3a.Invoker.maybeRetry(Invoker.java:326)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:429)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:547)
at 
org.apache.hadoop.fs.s3a.S3AInputStream.readFully(S3AInputStream.java:838)
at 
org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:118)
at org.apache.orc.impl.ReaderImpl.read(ReaderImpl.java:702)
at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:806)
at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:567)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:61)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:112)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.validateInput(OrcInputFormat.java:655)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.checkInputFormat(HiveFileFormatUtils.java:207)
at 
org.apache.hadoop.hive.ql.exec.MoveTask.checkFileFormats(MoveTask.java:826)
at org.apache.hadoop.hive.ql.exec.MoveTask.execute(MoveTask.java:493)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:356)
at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:329)
at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:246)
at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:107)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:809)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:546)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:540)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:190)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:235)
at 
org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:92)
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:340)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at 
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:360)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)\r
{code}

ReaderImpl.extractFileTail creates an FSDataInputStream but we never call close 
on that reader from OrcInputFormat

> OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream
> --
>
> Key: HIVE-27875
> URL: https://issues.apache.org/jira/browse/HIVE-27875
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>
> This codepath:
> {code}
> hiveserver2 <14>1 2023-11-15T16:05:10.504Z hiveserver2-0 hiveserver2 1 
> a51a9165-623b-4837-b087-818cd7e78d88 [mdc@18060 class="s3a.S3AInputStream" 
> level="IN

[jira] [Assigned] (HIVE-27875) OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream

2023-11-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-27875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor reassigned HIVE-27875:
---

Assignee: László Bodor

> OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream
> --
>
> Key: HIVE-27875
> URL: https://issues.apache.org/jira/browse/HIVE-27875
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Work started] (HIVE-27875) OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream

2023-11-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-27875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27875 started by László Bodor.
---
> OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream
> --
>
> Key: HIVE-27875
> URL: https://issues.apache.org/jira/browse/HIVE-27875
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27875) OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed input stream

2023-11-15 Thread Jira
László Bodor created HIVE-27875:
---

 Summary: OrcInputFormat leaks a CLOSE_WAIT socket with an unclosed 
input stream
 Key: HIVE-27875
 URL: https://issues.apache.org/jira/browse/HIVE-27875
 Project: Hive
  Issue Type: Improvement
Reporter: László Bodor






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27874) Parallelize JDBC datatype conversion

2023-11-15 Thread Kurt Deschler (Jira)
Kurt Deschler created HIVE-27874:


 Summary: Parallelize JDBC datatype conversion 
 Key: HIVE-27874
 URL: https://issues.apache.org/jira/browse/HIVE-27874
 Project: Hive
  Issue Type: Improvement
Reporter: Kurt Deschler
Assignee: Kurt Deschler


JDBC datatype conversion is currently performed synchronously by client 
applications as part of the getObject() calls. This can bottleneck client 
application that are rapidly fetching rows via JDBC.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27869) Iceberg: Select on HadoopTable fails at HiveIcebergStorageHandler#canProvideColStats

2023-11-15 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko resolved HIVE-27869.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Iceberg: Select on HadoopTable fails at 
> HiveIcebergStorageHandler#canProvideColStats
> 
>
> Key: HIVE-27869
> URL: https://issues.apache.org/jira/browse/HIVE-27869
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: zhangbutao
>Assignee: zhangbutao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Step to reproduce:(latest master code)
> 1) Create path-based HadoopTable by Spark:
>  
> {code:java}
> ./spark-3.3.1-bin-hadoop3/bin/spark-sql \--master local \--deploy-mode client 
> \--conf 
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \--conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
> \--conf spark.sql.catalog.spark_catalog.type=hadoop \--conf 
> spark.sql.catalog.spark_catalog.warehouse=hdfs://localhost:8028/tmp/testiceberg;
> create table ice_test_001(id int) using iceberg;
> insert into ice_test_001(id) values(1),(2),(3);{code}
>  
> 2) Create iceberg table based on the HadoopTable by Hive:
> {code:java}
> CREATE EXTERNAL TABLE ice_test_001 STORED BY 
> 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 
> 'hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001' TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table'); {code}
> 3)Select the HadoopTable by Hive
> // launch tez task to scan data
> *set hive.fetch.task.conversion=none;*
> {code:java}
> jdbc:hive2://localhost:1/default> select * from ice_test_001;
> Error: Error while compiling statement: FAILED: IllegalArgumentException 
> Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename. (state=42000,code=4) {code}
> Full stacktrace:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:256)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1752)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1749)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1764)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) 
> ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStats(HiveIcebergStorageHandler.java:540)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStatistics(HiveIcebergStorageHandler.java:533)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.getTableColumnStats(StatsUtils.java:1073)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:302)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:193)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:181)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:173)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
>  ~[hive-exec-4.0.0-be

[jira] [Commented] (HIVE-27869) Iceberg: Select on HadoopTable fails at HiveIcebergStorageHandler#canProvideColStats

2023-11-15 Thread Denys Kuzmenko (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786403#comment-17786403
 ] 

Denys Kuzmenko commented on HIVE-27869:
---

Merged to master
[~zhangbutao], thanks for the patch!

> Iceberg: Select on HadoopTable fails at 
> HiveIcebergStorageHandler#canProvideColStats
> 
>
> Key: HIVE-27869
> URL: https://issues.apache.org/jira/browse/HIVE-27869
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: zhangbutao
>Assignee: zhangbutao
>Priority: Major
>  Labels: pull-request-available
>
> Step to reproduce:(latest master code)
> 1) Create path-based HadoopTable by Spark:
>  
> {code:java}
> ./spark-3.3.1-bin-hadoop3/bin/spark-sql \--master local \--deploy-mode client 
> \--conf 
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \--conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
> \--conf spark.sql.catalog.spark_catalog.type=hadoop \--conf 
> spark.sql.catalog.spark_catalog.warehouse=hdfs://localhost:8028/tmp/testiceberg;
> create table ice_test_001(id int) using iceberg;
> insert into ice_test_001(id) values(1),(2),(3);{code}
>  
> 2) Create iceberg table based on the HadoopTable by Hive:
> {code:java}
> CREATE EXTERNAL TABLE ice_test_001 STORED BY 
> 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 
> 'hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001' TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table'); {code}
> 3)Select the HadoopTable by Hive
> // launch tez task to scan data
> *set hive.fetch.task.conversion=none;*
> {code:java}
> jdbc:hive2://localhost:1/default> select * from ice_test_001;
> Error: Error while compiling statement: FAILED: IllegalArgumentException 
> Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename. (state=42000,code=4) {code}
> Full stacktrace:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:256)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1752)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1749)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1764)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) 
> ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStats(HiveIcebergStorageHandler.java:540)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStatistics(HiveIcebergStorageHandler.java:533)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.getTableColumnStats(StatsUtils.java:1073)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:302)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:193)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:181)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:173)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.jav

[jira] [Updated] (HIVE-27869) Iceberg: Select on HadoopTable fails at HiveIcebergStorageHandler#canProvideColStats

2023-11-15 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27869:
--
Summary: Iceberg: Select on HadoopTable fails at 
HiveIcebergStorageHandler#canProvideColStats  (was: Iceberg: Select on 
HadoopTables fails at HiveIcebergStorageHandler#canProvideColStats)

> Iceberg: Select on HadoopTable fails at 
> HiveIcebergStorageHandler#canProvideColStats
> 
>
> Key: HIVE-27869
> URL: https://issues.apache.org/jira/browse/HIVE-27869
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: zhangbutao
>Assignee: zhangbutao
>Priority: Major
>  Labels: pull-request-available
>
> Step to reproduce:(latest master code)
> 1) Create path-based HadoopTable by Spark:
>  
> {code:java}
> ./spark-3.3.1-bin-hadoop3/bin/spark-sql \--master local \--deploy-mode client 
> \--conf 
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \--conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
> \--conf spark.sql.catalog.spark_catalog.type=hadoop \--conf 
> spark.sql.catalog.spark_catalog.warehouse=hdfs://localhost:8028/tmp/testiceberg;
> create table ice_test_001(id int) using iceberg;
> insert into ice_test_001(id) values(1),(2),(3);{code}
>  
> 2) Create iceberg table based on the HadoopTable by Hive:
> {code:java}
> CREATE EXTERNAL TABLE ice_test_001 STORED BY 
> 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 
> 'hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001' TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table'); {code}
> 3)Select the HadoopTable by Hive
> // launch tez task to scan data
> *set hive.fetch.task.conversion=none;*
> {code:java}
> jdbc:hive2://localhost:1/default> select * from ice_test_001;
> Error: Error while compiling statement: FAILED: IllegalArgumentException 
> Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename. (state=42000,code=4) {code}
> Full stacktrace:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:256)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1752)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1749)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1764)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) 
> ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStats(HiveIcebergStorageHandler.java:540)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStatistics(HiveIcebergStorageHandler.java:533)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.getTableColumnStats(StatsUtils.java:1073)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:302)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:193)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:181)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:173)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.

[jira] [Updated] (HIVE-27869) Iceberg: Select on HadoopTables fails at HiveIcebergStorageHandler#canProvideColStats

2023-11-15 Thread Denys Kuzmenko (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-27869:
--
Summary: Iceberg: Select on HadoopTables fails at 
HiveIcebergStorageHandler#canProvideColStats  (was: Iceberg: Select  
HadoopTables will fail at HiveIcebergStorageHandler::canProvideColStats)

> Iceberg: Select on HadoopTables fails at 
> HiveIcebergStorageHandler#canProvideColStats
> -
>
> Key: HIVE-27869
> URL: https://issues.apache.org/jira/browse/HIVE-27869
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: zhangbutao
>Assignee: zhangbutao
>Priority: Major
>  Labels: pull-request-available
>
> Step to reproduce:(latest master code)
> 1) Create path-based HadoopTable by Spark:
>  
> {code:java}
> ./spark-3.3.1-bin-hadoop3/bin/spark-sql \--master local \--deploy-mode client 
> \--conf 
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \--conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
> \--conf spark.sql.catalog.spark_catalog.type=hadoop \--conf 
> spark.sql.catalog.spark_catalog.warehouse=hdfs://localhost:8028/tmp/testiceberg;
> create table ice_test_001(id int) using iceberg;
> insert into ice_test_001(id) values(1),(2),(3);{code}
>  
> 2) Create iceberg table based on the HadoopTable by Hive:
> {code:java}
> CREATE EXTERNAL TABLE ice_test_001 STORED BY 
> 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 
> 'hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001' TBLPROPERTIES 
> ('iceberg.catalog'='location_based_table'); {code}
> 3)Select the HadoopTable by Hive
> // launch tez task to scan data
> *set hive.fetch.task.conversion=none;*
> {code:java}
> jdbc:hive2://localhost:1/default> select * from ice_test_001;
> Error: Error while compiling statement: FAILED: IllegalArgumentException 
> Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename. (state=42000,code=4) {code}
> Full stacktrace:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Pathname 
> /tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  from 
> hdfs://localhost:8028/tmp/testiceberg/default/ice_test_001/stats/hdfs:/localhost:8028/tmp/testiceberg/default/ice_test_0018020750642632422610
>  is not a valid DFS filename.
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:256)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1752)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1749)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1764)
>  ~[hadoop-hdfs-client-3.3.1.jar:?]
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760) 
> ~[hadoop-common-3.3.1.jar:?]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStats(HiveIcebergStorageHandler.java:540)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.iceberg.mr.hive.HiveIcebergStorageHandler.canProvideColStatistics(HiveIcebergStorageHandler.java:533)
>  ~[hive-iceberg-handler-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.getTableColumnStats(StatsUtils.java:1073)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:302)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:193)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.stats.StatsUtils.collectStatistics(StatsUtils.java:181)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hive.ql.optimizer.stats.annotation.StatsRulesProcFactory$TableScanStatsRule.process(StatsRulesProcFactory.java:173)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>         at 

[jira] [Created] (HIVE-27873) Slow JDBC fetch from Impala HS2

2023-11-15 Thread Kurt Deschler (Jira)
Kurt Deschler created HIVE-27873:


 Summary: Slow JDBC fetch from Impala HS2
 Key: HIVE-27873
 URL: https://issues.apache.org/jira/browse/HIVE-27873
 Project: Hive
  Issue Type: Improvement
  Components: JDBC
Reporter: Kurt Deschler
Assignee: Kurt Deschler


The fix for HIVE-20621 leads to poor performance fetching from Impala since 
isHasResultSet is not set by Impala. The existing logic calls 
getOperationStatus() on every row fetched which severely impacts fetch 
performance and also results in a completion message being logged for each row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-27872) Support multi-stream hs2 fetch from JDBC driver

2023-11-15 Thread Kurt Deschler (Jira)
Kurt Deschler created HIVE-27872:


 Summary: Support multi-stream hs2 fetch from JDBC driver
 Key: HIVE-27872
 URL: https://issues.apache.org/jira/browse/HIVE-27872
 Project: Hive
  Issue Type: Improvement
  Components: JDBC
Reporter: Kurt Deschler
Assignee: Kurt Deschler


Thrift hs2 protocol supports sharing of session and statement handles as well 
as providing start row offset for result batches. These primitives can be used 
to connect multiple client streams to a query result, transfer results in 
parallel, and provide a properly ordered result to the client application via 
standard JDBC interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-27788) Exception in Sort Merge join with Group By + PTF Operator

2023-11-15 Thread Krisztian Kasa (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786353#comment-17786353
 ] 

Krisztian Kasa commented on HIVE-27788:
---

[~zabetak], [~amansinha]
I think the summary of this jira is misleading. As per my analysis the issue is 
caused by
 * The operator tree in a reducer has a merge join operator and any of the join 
branches has more than one GBY: 
{code}
RS-...-GBY-...-GBY-...-MERGEJOIN-...
   RS-...-/
{code}
 * the data have unique values in GBY key(s) processed by that branch or at 
least the last 3 records in the record stream.

The presence of PTF operator is irrelevant in this issue. It can be anything. 
Please see another example:
[https://github.com/apache/hive/blob/17525f169b9a08cd715bfb42899e45b7c689c77a/ql/src/test/results/clientpositive/llap/subquery_in_having.q.out#L263-L391]

 

> Exception in Sort Merge join with Group By + PTF Operator
> -
>
> Key: HIVE-27788
> URL: https://issues.apache.org/jira/browse/HIVE-27788
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 4.0.0-beta-1
>Reporter: Riju Trivedi
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: auto_sortmerge_join_17.q
>
>
> Sort- merge join with Group By + PTF operator leads  to Runtime exception 
> {code:java}
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:313)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:291)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293)
>   ... 15 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:387)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:303)
>   ... 17 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:392)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:372)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:316)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:127)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.handleOutputRows(PTFOperator.java:337)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.processRow(PTFOperator.java:325)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:139)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:372)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:534)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:488)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:390)
>   ... 31 more
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource

[jira] [Commented] (HIVE-27788) Exception in Sort Merge join with Group By + PTF Operator

2023-11-15 Thread Stamatis Zampetakis (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786303#comment-17786303
 ] 

Stamatis Zampetakis commented on HIVE-27788:


Hey [~kkasa] , can you elaborate a bit on the role of the PTF in this problem. 
The way that the summary and description are written they imply that PTF is 
necessary for the problem to occur but the extended analysis above does not 
refer to it. I agree with Aman that it would be useful to clarify this part and 
if possible simplify the repro further.

> Exception in Sort Merge join with Group By + PTF Operator
> -
>
> Key: HIVE-27788
> URL: https://issues.apache.org/jira/browse/HIVE-27788
> Project: Hive
>  Issue Type: Bug
>  Components: Operators
>Affects Versions: 4.0.0-beta-1
>Reporter: Riju Trivedi
>Assignee: Krisztian Kasa
>Priority: Major
>  Labels: pull-request-available
> Attachments: auto_sortmerge_join_17.q
>
>
> Sort- merge join with Group By + PTF operator leads  to Runtime exception 
> {code:java}
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
> processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:313)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.run(ReduceRecordProcessor.java:291)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:293)
>   ... 15 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:387)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:303)
>   ... 17 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:392)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:372)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:316)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:127)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.handleOutputRows(PTFOperator.java:337)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator$PTFInvocation.processRow(PTFOperator.java:325)
>   at 
> org.apache.hadoop.hive.ql.exec.PTFOperator.process(PTFOperator.java:139)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:888)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:94)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:372)
>   ... 18 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:534)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchNextGroup(CommonMergeJoinOperator.java:488)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.joinOneGroup(CommonMergeJoinOperator.java:390)
>   ... 31 more
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Attempting to overwrite 
> nextKeyWritables[1]
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:313)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.fetchOneRow(CommonMergeJoinOperator.java:522)
>   ... 33 more {code}
> Issue can be reproduced with [^auto_sortmerge_join_17.q]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.

2023-11-15 Thread Seonggon Namgung (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786224#comment-17786224
 ] 

Seonggon Namgung edited comment on HIVE-26986 at 11/15/23 8:22 AM:
---

[~kkasa] 

1. This issue is not about data correctness; this issue addresses the insertion 
of unnecessary ReduceSink operators, which causes unnecessary shuffle during 
runtime.

The unnecessary insertion is performed by ParallelEdgeFixer(PEF), and it makes 
a wrong decision because OperatorGraph creates wrong a DAG from the given query 
plan. My previous comments explains how OperatorGraph groups operators into a 
vertex(cluster in terms of OperatorGraph) in the wrong way.

Since this issue originates from OperatorGraph, not PEF or 
SharedWorkOptimizer(SWO), the submitted PR introduces TestOperatorGraph, which 
tests the behaviour of OperatorGraph. You can check the problem by running this 
test using master branch. The following comment explains about the added test 
for the sake of your better understanding.

 

The test compares 2 DAGs generated by OperatorGraph and TezCompiler. The 
following graph represents the query plan used in the test.
TS1┐
TS2┴UNION─SEL─RS─GBY─RS

The correct DAG corresponding to the query plan should be:
Map1: \{TS1, SEL, RS1}
Map2: \{TS2, SEL, RS1}
Reduce: \{GBY, RS2}

But current OperatorGraph groups operator into 2 groups as following:
Cluster1: \{TS1, TS2, UNION, SEL, RS1}
Cluster2: \{GBY, RS2}

 

2. As I mentioned above, this issue is unrelated to data correctness. Moreover, 
PEF is applied on a query plan regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. I think the test attached in 
the PR is sufficient to verify this issue.

FYI, `hive.optimize.shared.work.parallel.edge.support` controls the types of 
edges that are allowed to construct a parallel edge. If it is set to true, 
DynamicPartitionPruning(DPP), SemiJoinReduction, and Broadcast edges can 
construct parallel edge. If not, only DPP edges can construct parallel edge. As 
a consequence, SWO can make parallel edges regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. So Hive always runs PEF 
after SWO in order to resolve parallel edges by adding extra RS operators.

 


was (Author: JIRAUSER298608):
@kkasa

1. This issue is not about data correctness; this issue addresses the insertion 
of unnecessary ReduceSink operators, which causes unnecessary shuffle during 
runtime.

The unnecessary insertion is performed by ParallelEdgeFixer(PEF), and it makes 
a wrong decision because OperatorGraph creates wrong a DAG from the given query 
plan. My previous comments explains how OperatorGraph groups operators into a 
vertex(cluster in terms of OperatorGraph) in the wrong way.

Since this issue originates from OperatorGraph, not PEF or 
SharedWorkOptimizer(SWO), the submitted PR introduces TestOperatorGraph, which 
tests the behaviour of OperatorGraph. You can check the problem by running this 
test using master branch. The following comment explains about the added test 
for the sake of your better understanding.

 

The test compares 2 DAGs generated by OperatorGraph and TezCompiler. The 
following graph represents the query plan used in the test.
TS1┐
TS2┴UNION─SEL─RS─GBY─RS

The correct DAG corresponding to the query plan should be:
Map1: \{TS1, SEL, RS1}
Map2: \{TS2, SEL, RS1}
Reduce: \{GBY, RS2}

But current OperatorGraph groups operator into 2 groups as following:
Cluster1: \{TS1, TS2, UNION, SEL, RS1}
Cluster2: \{GBY, RS2}

 

2. As I mentioned above, this issue is unrelated to data correctness. Moreover, 
PEF is applied on a query plan regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. I think the test attached in 
the PR is sufficient to verify this issue.

FYI, `hive.optimize.shared.work.parallel.edge.support` controls the types of 
edges that are allowed to construct a parallel edge. If it is set to true, 
DynamicPartitionPruning(DPP), SemiJoinReduction, and Broadcast edges can 
construct parallel edge. If not, only DPP edges can construct parallel edge. As 
a consequence, SWO can make parallel edges regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. So Hive always runs PEF 
after SWO in order to resolve parallel edges by adding extra RS operators.

 

> A DAG created by OperatorGraph is not equal to the Tez DAG.
> ---
>
> Key: HIVE-26986
> URL: https://issues.apache.org/jira/browse/HIVE-26986
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must, pull-request-available
> Attachments: Query71 OperatorGraph.png, Qu

[jira] [Commented] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.

2023-11-15 Thread Seonggon Namgung (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786224#comment-17786224
 ] 

Seonggon Namgung commented on HIVE-26986:
-

@kkasa

1. This issue is not about data correctness; this issue addresses the insertion 
of unnecessary ReduceSink operators, which causes unnecessary shuffle during 
runtime.

The unnecessary insertion is performed by ParallelEdgeFixer(PEF), and it makes 
a wrong decision because OperatorGraph creates wrong a DAG from the given query 
plan. My previous comments explains how OperatorGraph groups operators into a 
vertex(cluster in terms of OperatorGraph) in the wrong way.

Since this issue originates from OperatorGraph, not PEF or 
SharedWorkOptimizer(SWO), the submitted PR introduces TestOperatorGraph, which 
tests the behaviour of OperatorGraph. You can check the problem by running this 
test using master branch. The following comment explains about the added test 
for the sake of your better understanding.

 

The test compares 2 DAGs generated by OperatorGraph and TezCompiler. The 
following graph represents the query plan used in the test.
TS1┐
TS2┴UNION─SEL─RS─GBY─RS

The correct DAG corresponding to the query plan should be:
Map1: \{TS1, SEL, RS1}
Map2: \{TS2, SEL, RS1}
Reduce: \{GBY, RS2}

But current OperatorGraph groups operator into 2 groups as following:
Cluster1: \{TS1, TS2, UNION, SEL, RS1}
Cluster2: \{GBY, RS2}

 

2. As I mentioned above, this issue is unrelated to data correctness. Moreover, 
PEF is applied on a query plan regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. I think the test attached in 
the PR is sufficient to verify this issue.

FYI, `hive.optimize.shared.work.parallel.edge.support` controls the types of 
edges that are allowed to construct a parallel edge. If it is set to true, 
DynamicPartitionPruning(DPP), SemiJoinReduction, and Broadcast edges can 
construct parallel edge. If not, only DPP edges can construct parallel edge. As 
a consequence, SWO can make parallel edges regardless of the value of 
`hive.optimize.shared.work.parallel.edge.support`. So Hive always runs PEF 
after SWO in order to resolve parallel edges by adding extra RS operators.

 

> A DAG created by OperatorGraph is not equal to the Tez DAG.
> ---
>
> Key: HIVE-26986
> URL: https://issues.apache.org/jira/browse/HIVE-26986
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0-alpha-2
>Reporter: Seonggon Namgung
>Assignee: Seonggon Namgung
>Priority: Major
>  Labels: hive-4.0.0-must, pull-request-available
> Attachments: Query71 OperatorGraph.png, Query71 TezDAG.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A DAG created by OperatorGraph is not equal to the corresponding DAG that is 
> submitted to Tez.
> Because of this problem, ParallelEdgeFixer reports a pair of normal edges to 
> a parallel edge.
> We observe this problem by comparing OperatorGraph and Tez DAG when running 
> TPC-DS query 71 on 1TB ORC format managed table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)