I realized I used some wrong terminology there — sequenceId gets stashed in
WALKey. The value is the same as what used to be on the cells in the
corresponding WALEdit.

So WALCellMapper where we have WALKey and WALEdit, currently we pull the
cells off the edit and produce them to the mapper context output. To solve
this we would pull WALKey.getSequenceId() and set it onto all of the cells
in the correspond edit before sending to context.

On Thu, Feb 16, 2023 at 5:19 PM Bryan Beaudreault <bbeaudrea...@gmail.com>
wrote:

> Hey all,
>
> We've been working on integrating backup/restore into our stack. We have
> some user tables which override cells -- meaning write the same
> row/cf/qf/timestamp but with different values. Normally HBase would handle
> deduping those and returning the most recently written. This is due to the
> usage of sequenceId in the memstore as a tiebreaker in CellComparator.
>
> We noticed when trying to do an incremental restore (which uses WALPlayer)
> of one of these tables, we'd non-deterministically get different values
> returned for these cells... often not the latest. I believe this is because
> we lose the sequenceId context in WALPlayer.
>
> Our WAL encoding drops sequenceIds from cells, but stashes the same
> sequenceId in each WALEdit. I think we could update WALPlayer (which reads
> WALEdit and WALEntry) to pull the sequenceId from the WALedit and inject
> into the cell that gets written to the context.
>
> The next step would be to update CellSerialization to pass it along there
> as well. At this point our existing CellSortReducer would handle
> appropriately sorting based on sequenceId when timestamps are equal, and
> the HFiles written by WALPlayer would more accurately reflect what a normal
> hbase write would do.  The sequenceIds would eventually be pruned out by
> compactions as they usually are.
>
> Any concerns with this approach?
>
> See jira https://issues.apache.org/jira/browse/HBASE-27649
>

Reply via email to