Hi David, Not entirely sure what you are doing here :), my guess is that you are trying to write ACID tables outside of hive. Am I right? What is the exact use-case? There might be better solutions out there than writing the files by hand.
As for your question below: Yes, the files should be ordered by: originalTransacion, bucket, rowId triple, otherwise you will get wrong results. Thanks, Peter > On Nov 19, 2019, at 13:30, David Morin <[email protected]> wrote: > > here after more details about ORC content and the fact we have duplicate rows: > > /delta_0011365_0011365_0000/bucket_00003 > > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}} > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}} > > > /delta_0011368_0011368_0000/bucket_00003 > > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null} > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null} > > /delta_0011369_0011369_0000/bucket_00003 > > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}} > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}} > > +-------------------------------------------------+-------+--+ > | row__id | cle | > +-------------------------------------------------+-------+--+ > | {"transactionid":11367,"bucketid":0,"rowid":0} | 5209 | > | {"transactionid":11369,"bucketid":0,"rowid":0} | 5211 | > | {"transactionid":11369,"bucketid":1,"rowid":0} | 5210 | > | {"transactionid":11369,"bucketid":2,"rowid":0} | 5214 | > | {"transactionid":11369,"bucketid":2,"rowid":1} | 5215 | > | {"transactionid":11365,"bucketid":3,"rowid":0} | 5218 | > | {"transactionid":11365,"bucketid":3,"rowid":1} | 5216 | > | {"transactionid":11369,"bucketid":3,"rowid":1} | 5216 | > | {"transactionid":11369,"bucketid":3,"rowid":0} | 5218 | > | {"transactionid":11369,"bucketid":4,"rowid":0} | 5217 | > | {"transactionid":11369,"bucketid":4,"rowid":1} | 5213 | > | {"transactionid":11369,"bucketid":7,"rowid":0} | 5212 | > +-------------------------------------------------+-------+--+ > > As you can see we have duplicate rows for column "cle" 5216 and 5218 > Do we have to keep the rowids ordered ? because this is the only difference I > have noticed based on some tests with beeline. > > Thanks > > > > Le mar. 19 nov. 2019 à 00:18, David Morin <[email protected] > <mailto:[email protected]>> a écrit : > Hello, > > I'm trying to understand the purpose of the rowid column inside ORC delta file > {"transactionid":11359,"bucketid":5,"rowid":0} > Orc view: > {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...} > I use HDP 2.6 => Hive 2 > > If I want to be idempotent with INSERT / DELETE / INSERT. > Do we have to keep the same rowid ? > It seems that when the rowid is changed during the second INSERT I have a > duplicate row. > For me, I can create a new rowid for the new transaction during the second > INSERT but that seems to generate duplicate records. > > Regards, > David > > >
