On Sun, May 19, 2019 at 11:21 PM Bhargav Bipinchandra Naik (Seller Platform-BLR) <bhargav.n...@flipkart.com> wrote:
> Hi Alan, > > > Are write_ids monotonically increasing? > They are assigned monotonically, but the transactions they are a part of may commit at different times, so you can't use it as a low water mark. That is, if you looked at the state of the table at time t1 and say that write_id1 and write_id3 had been committed, it does not mean that there won't be a write_id2 the next time you look, as the transaction for write_id2 could have started before the transaction for write_id3 but finished after. > Are write_ids accessible in the hive query? > For e.g.: > select * from table_name where write_id > N; > No. For full ACID (ORC) tables the write_id is part of a pseudo-column struct called row__id (not to be confused with the row_id mentioned before, sorry we overloaded the term). For insert only ACID (Non-ORC tables) the write id is inferred from the filename. In both cases the metastore doesn't know about these columns, and thus I believe will fail the query saying "no such column". Alan. > > Basically I am trying to understand if I can use write_id to consume only > updated rows. > Store the maximum write_id(X) seen in the result and next time query for > all rows with row_id greater than X. > > Thanks, > Bhargav > > On Fri, May 17, 2019 at 10:37 PM Alan Gates <alanfga...@gmail.com> wrote: > >> Sorry, looks like you sent this earlier and I missed it. >> >> A couple of things. One, write_id is per transaction per table. So for >> table T, all rows written in w1 will have the same write_id, though they >> will each have their own monotonically increasing row_ids. Row_ids are >> scoped by a write_id, so if both w1 and w2 insert a 100 rows, w1 would have >> write_id 1, and row_ids 0-99 while w2's rows would have write_id 2 and >> row_ids 0-99. >> >> Two, If w1 and w2 both attempted to update or delete (not insert) records >> from the same partition of table T, then w1 would fail at commit time >> because it would see that w2 had already committed and there's a possible >> conflict. This avoids lost updates and deleted records magically >> reappearing. >> >> Alan. >> >> On Fri, May 17, 2019 at 4:44 AM Bhargav Bipinchandra Naik (Seller >> Platform-BLR) <bhargav.n...@flipkart.com> wrote: >> >>> Is the following scenario supported? >>> >>> *timestamp:* t1 < t2 < t3 < t4 < t5 < t6 >>> >>> *w1 -* transaction which updates subset of rows in table T {start_time: >>> t1, end_time: t5} >>> *w2 -* transaction which updates subset of rows in table T {start_time: >>> t2, end_time: t3} >>> *r1 - *job which reads rows from table T {start_time: t4} >>> *r2 - *job which reads rows from table T {start_time: t6} >>> >>> - Is the write_id strictly increasing number across rows? >>> - Is the write_id a version number per row and not a global construct? >>> - Will the subset of rows updated by w1 have write_ids greater than >>> write_ids of row updated by w2? >>> >>> Say if job r1 consumed the data at t4 had maximum write_id 100. >>> Will rows updated by job w1 (end_time: t5) always have write_id > 100? >>> >>> Basically I need some kind of checkpoint using which the next run of the >>> read job can read only the data updated since the checkpoint. >>> >>> Thanks, >>> -Bhargav >>> >>> >>> >>> >>>