We're migrating data from a previous iteration of a table to a new one and
this process involved a MR job that scans data from the source table and
writes the equivalent data in the new table. The source table has 6000+
regions and it frequently splits because we're still ingesting time series
data into it. We used buffered writing on the other end when writing to the
new table and we have a yarn resource pool to limit the concurrent writing.

First, I should say that this job took a long time but still mostly worked.
However, we've built a mechanism to compare requested data fetched from
each one of the tables and found that some rows (0.02%) are missing from
the destination. We've ruled out a few things already:

* Functional bug in the job that would have resulted in skipping that 0.02%
of the rows.
* Potential for that data not having existed when the migration job
initially ran.

At a high-level, the suspects could be:

* The source table splitting could have resulted in some input keys not
being read. However, since a hbase split is comprised of a startKey/endKey,
this seems like this would not be expected unless there was a bug in there
somehow.
* The writing/flushing losing a batch. Since we're buffering writes and
flush everything on the clean up of map tasks, we would expect write
failures to cause task failures/retries and therefore to not be a problem
in the end. Given that this flush is synchronous and, according to our
understanding, completes when the data is in the WAL and memstore, this
also seems unlikely unless there's a bug.

I should add that we've extracted a sample of 1% of the source rows (doing
all of them is really time consuming because of the size of data) and found
that missing data often appears in clusters of the source hbase row keys.
This doesn't really help pointing at a problem with the scan side of things
or the write side of things (since a failure in either would result in a
similar output) but we thought it was interesting. That said, we do have a
few keys that are missing that aren't clustered. This could be because
we've only ran the comparison for 1% of the data or it could be that
whatever is causing this can affect very isolated cases.

We're now trying to understand how this could have happened in order to
understand how it could impact other jobs/applications and also to increase
our confidence that we write a modified version of the migration job to
re-migrate the skipped/missing data.

Any ideas or advice would be much appreciated.

Thanks!

-- 
Alex

Reply via email to