Re: ORC: duplicate record - rowid meaning ?

Peter Vary Fri, 29 Nov 2019 03:18:54 -0800

Hi David,

Not entirely sure what you are doing here :), my guess is that you are trying 
to write ACID tables outside of hive. Am I right? What is the exact use-case? 
There might be better solutions out there than writing the files by hand.


As for your question below: Yes, the files should be ordered by: 
originalTransacion, bucket, rowId triple, otherwise you will get wrong results.

Thanks,
Peter

> On Nov 19, 2019, at 13:30, David Morin <[email protected]> wrote:
> 
> here after more details about ORC content and the fact we have duplicate rows:
> 
> /delta_0011365_0011365_0000/bucket_00003
> 
> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}}
> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}}
> 
> 
> /delta_0011368_0011368_0000/bucket_00003
> 
> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null}
> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null}
> 
> /delta_0011369_0011369_0000/bucket_00003
> 
> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}}
> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}}
> 
> +-------------------------------------------------+-------+--+
> |                     row__id                     |  cle  |
> +-------------------------------------------------+-------+--+
> | {"transactionid":11367,"bucketid":0,"rowid":0}  | 5209  |
> | {"transactionid":11369,"bucketid":0,"rowid":0}  | 5211  |
> | {"transactionid":11369,"bucketid":1,"rowid":0}  | 5210  |
> | {"transactionid":11369,"bucketid":2,"rowid":0}  | 5214  |
> | {"transactionid":11369,"bucketid":2,"rowid":1}  | 5215  |
> | {"transactionid":11365,"bucketid":3,"rowid":0}  | 5218  |
> | {"transactionid":11365,"bucketid":3,"rowid":1}  | 5216  |
> | {"transactionid":11369,"bucketid":3,"rowid":1}  | 5216  |
> | {"transactionid":11369,"bucketid":3,"rowid":0}  | 5218  |
> | {"transactionid":11369,"bucketid":4,"rowid":0}  | 5217  |
> | {"transactionid":11369,"bucketid":4,"rowid":1}  | 5213  |
> | {"transactionid":11369,"bucketid":7,"rowid":0}  | 5212  |
> +-------------------------------------------------+-------+--+
> 
> As you can see we have duplicate rows for column "cle" 5216 and 5218
> Do we have to keep the rowids ordered ? because this is the only difference I 
> have noticed based on some tests with beeline.
> 
> Thanks
> 
> 
> 
> Le mar. 19 nov. 2019 à 00:18, David Morin <[email protected] 
> <mailto:[email protected]>> a écrit :
> Hello,
> 
> I'm trying to understand the purpose of the rowid column inside ORC delta file
> {"transactionid":11359,"bucketid":5,"rowid":0}
> Orc view: 
> {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...}
> I use HDP 2.6 => Hive 2
> 
> If I want to be idempotent with INSERT / DELETE / INSERT. 
> Do we have to keep the same rowid ?
> It seems that when the rowid is changed during the second INSERT I have a 
> duplicate row.
> For me, I can create a new rowid for the new transaction during the second 
> INSERT but that seems to generate duplicate records.
> 
> Regards,
> David
> 
> 
>

Re: ORC: duplicate record - rowid meaning ?

Reply via email to