[ 
https://issues.apache.org/jira/browse/HIVE-16177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-16177:
----------------------------------
    Description: 
{noformat}
create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc 
TBLPROPERTIES('transactional'='false')
insert into T(a,b) values(1,2)
insert into T(a,b) values(1,3)
alter table T SET TBLPROPERTIES ('transactional'='true')
{noformat}

    //we should now have bucket files 000001_0 and 000001_0_copy_1

but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be 
copy_N files and numbers rows in each bucket from 0 thus generating duplicate 
IDs

{noformat}
select ROW__ID, INPUT__FILE__NAME, a, b from T
{noformat}

produces 
{noformat}
{"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
{"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
{noformat}

[~owen.omalley], do you have any thoughts on a good way to handle this?

attached patch has a few changes to make Acid even recognize copy_N but this is 
just a pre-requisite.  The new UT demonstrates the issue.


  was:
insert into T(a,b) values(1,2)
insert into T(a,b) values(1,3)

    //we should now have bucket files 000001_0 and 000001_0_copy_1

but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can be 
copy_N files and numbers rows in each bucket from 0 thus generating duplicate 
IDs


[~owen.omalley], do you have any thoughts on a good way to handle this?

attached patch has a few changes to make Acid even recognize copy_N but this is 
just a pre-requisite.  The new UT demonstrates the issue.



> non Acid to acid conversion doesn't handle _copy_N files
> --------------------------------------------------------
>
>                 Key: HIVE-16177
>                 URL: https://issues.apache.org/jira/browse/HIVE-16177
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Eugene Koifman
>            Priority: Critical
>         Attachments: HIVE-16177.01.patch
>
>
> {noformat}
> create table T(a int, b int) clustered by (a)  into 2 buckets stored as orc 
> TBLPROPERTIES('transactional'='false')
> insert into T(a,b) values(1,2)
> insert into T(a,b) values(1,3)
> alter table T SET TBLPROPERTIES ('transactional'='true')
> {noformat}
>     //we should now have bucket files 000001_0 and 000001_0_copy_1
> but OrcRawRecordMerger.OriginalReaderPair.next() doesn't know that there can 
> be copy_N files and numbers rows in each bucket from 0 thus generating 
> duplicate IDs
> {noformat}
> select ROW__ID, INPUT__FILE__NAME, a, b from T
> {noformat}
> produces 
> {noformat}
> {"transactionid":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0,1,2
> {"transactionid\":0,"bucketid":1,"rowid":0},file:/Users/ekoifman/dev/hiverwgit/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnCommands.../warehouse/nonacidorctbl/000001_0_copy_1,1,3
> {noformat}
> [~owen.omalley], do you have any thoughts on a good way to handle this?
> attached patch has a few changes to make Acid even recognize copy_N but this 
> is just a pre-requisite.  The new UT demonstrates the issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to