[jira] [Comment Edited] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2018-09-03 Thread Saurabh Seth (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16601916#comment-16601916
 ] 

Saurabh Seth edited comment on HIVE-17921 at 9/3/18 9:24 AM:
-

I took a stab at debugging this. OrcStruct.canUseLlapIo thinks vectorization is 
being used when it's not and ends up using LlapRecordReader for the delta 
splits (wrapped in OrcOiBatchToRowReader because LlapInputFormat knows 
vectorization isn't being used). For the original data splits, 
OrcInputFormat.NullKeyRecordReader is used. These 2 RecordReaders create 
OrcStructs with different schemas (through createValue) - OrcOiBatchToRowReader 
adds an extra field for the ROW__ID but NullKeyRecordReader doesn't. Since this 
struct is cached (by MRReaderMapred), depending on which split is first within 
TezGroupedSplit, the cached OrcStruct may or may not have the extra field for 
ROW__ID. In this test case, an original file split is first and hence 
NullKeyRecordReader's OrcStruct is used. When this OrcStruct is given to 
OrcOiBatchToRowReader to fetch values (from the delta splits), it doesn't 
populate the record identifier - neither in the OrcStruct nor in the iocontext 
(in HiveContextAwareRecordReader). So all modified records in the delta splits 
end up having null ROW__IDs.

I have fixed OrcStruct.canUseLlapIo and the patch is attached.

A related question - Should OrcOiBatchToRowReader and NullKeyRecordReader be 
"compatible" and work when they're used from a TezGroupedSplitsRecordReader?


was (Author: saurabh.s...@gmail.com):
I took a stab at debugging this. OrcStruct.canUseLlapIo thinks vectorization is 
being used when it's not and ends up using LlapRecordReader for the delta 
splits (wrapped in OrcOiBatchToRowReader because LlapInputFormat knows 
vectorization isn't being used). For the original data splits, 
OrcInputFormat.NullKeyRecordReader is used. These 2 RecordReaders create 
OrcStructs with different schemas (through createValue) - OrcOiBatchToRowReader 
adds an extra field for the ROW__ID but NullKeyRecordReader doesn't. Since this 
struct is cached (by MRReaderMapred), depending on which split is first within 
TezGroupedSplit, the cached OrcStruct may or may not have the extra field for 
ROW__ID. In this test case, an original file split is first and hence 
NullKeyRecordReader's OrcStruct is used. When this OrcStruct is given to 
OrcOiBatchToRowReader to fetch values (from the delta splits), it doesn't 
populate the record identifier - neither in the OrcStruct nor in the iocontext 
(in HiveContextAwareRecordReader). So all modified records in the delta splits 
end up having null ROW__IDs.

I have fixed OrcStruct.canUseLlapIo and the patch is attached.

A related question - Should OrcOiBatchToRowReader and NullKeyRecordReader be 
"compatible" and work when they're used from a TezGroupedSplitsRecordReader?

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Assignee: Saurabh Seth
>Priority: Blocker
> Attachments: HIVE-17921.patch
>
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-17921) Aggregation with struct in LLAP produces wrong result

2017-12-07 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282666#comment-16282666
 ] 

Eugene Koifman edited comment on HIVE-17921 at 12/7/17 11:00 PM:
-

I also have {noformat}select ROW__ID from T group by ROW__ID having count(*) > 
1{noformat}
in TestTxnNoBuckets.testInsertFromUnion() which runs MR - works OK


was (Author: ekoifman):
I also have "select ROW__ID from T group by ROW__ID having count(*) > 1"
in TestTxnNoBuckets.testInsertFromUnion() which runs MR - works OK

> Aggregation with struct in LLAP produces wrong result
> -
>
> Key: HIVE-17921
> URL: https://issues.apache.org/jira/browse/HIVE-17921
> Project: Hive
>  Issue Type: Sub-task
>  Components: llap, Transactions
>Affects Versions: 3.0.0
>Reporter: Eugene Koifman
>Priority: Blocker
>
> Consider 
> {noformat}
> select ROW__ID, count(*) from over10k_orc_bucketed group by ROW__ID having 
> count(*) > 1;
> {noformat}
>  in acid_vectorization_original.q (available since HIVE-17458)
> when run using TestMiniLlapCliDriver produces "NULL, N" where N varies from 
> run to run.
> The right answer is empty results set as can be seen by running
> {noformat}
> select ROW__ID, * from over10k_orc_bucketed where ROW__ID is null
> {noformat}
> in the same test.
> This is with 
> {noformat}
> set hive.vectorized.execution.enabled=true;
> set hive.vectorized.row.identifier.enabled=true;
> {noformat}
> It fails with TestMiniLlapCliDriver but not TestMiniTezCliDriver.  See 
> acid_vectorization_original_tez.q which has identical query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)