You can run rowcounter on the source tables multiple times. With region servers under load, you would observe inconsistent results from different runs.
On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand < alexandre.norm...@gmail.com> wrote: > Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed > versions so this might be related. This is somewhat reassuring to think > that this would be missed data on the scan/source side because this would > mean that our other ingest/write workloads wouldn't be affected. > > From reading the jira description, it sounds like it would be difficult to > confirm that we've been affected by this bug. Am I right? > > On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhih...@gmail.com> wrote: > > > Which release of hbase are you using ? > > > > To be specific, does the release have HBASE-15378 ? > > > > Cheers > > > > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand < > > alexandre.norm...@gmail.com> wrote: > > > > > We're migrating data from a previous iteration of a table to a new one > > and > > > this process involved a MR job that scans data from the source table > and > > > writes the equivalent data in the new table. The source table has 6000+ > > > regions and it frequently splits because we're still ingesting time > > series > > > data into it. We used buffered writing on the other end when writing to > > the > > > new table and we have a yarn resource pool to limit the concurrent > > writing. > > > > > > First, I should say that this job took a long time but still mostly > > worked. > > > However, we've built a mechanism to compare requested data fetched from > > > each one of the tables and found that some rows (0.02%) are missing > from > > > the destination. We've ruled out a few things already: > > > > > > * Functional bug in the job that would have resulted in skipping that > > 0.02% > > > of the rows. > > > * Potential for that data not having existed when the migration job > > > initially ran. > > > > > > At a high-level, the suspects could be: > > > > > > * The source table splitting could have resulted in some input keys not > > > being read. However, since a hbase split is comprised of a > > startKey/endKey, > > > this seems like this would not be expected unless there was a bug in > > there > > > somehow. > > > * The writing/flushing losing a batch. Since we're buffering writes and > > > flush everything on the clean up of map tasks, we would expect write > > > failures to cause task failures/retries and therefore to not be a > problem > > > in the end. Given that this flush is synchronous and, according to our > > > understanding, completes when the data is in the WAL and memstore, this > > > also seems unlikely unless there's a bug. > > > > > > I should add that we've extracted a sample of 1% of the source rows > > (doing > > > all of them is really time consuming because of the size of data) and > > found > > > that missing data often appears in clusters of the source hbase row > keys. > > > This doesn't really help pointing at a problem with the scan side of > > things > > > or the write side of things (since a failure in either would result in > a > > > similar output) but we thought it was interesting. That said, we do > have > > a > > > few keys that are missing that aren't clustered. This could be because > > > we've only ran the comparison for 1% of the data or it could be that > > > whatever is causing this can affect very isolated cases. > > > > > > We're now trying to understand how this could have happened in order to > > > understand how it could impact other jobs/applications and also to > > increase > > > our confidence that we write a modified version of the migration job to > > > re-migrate the skipped/missing data. > > > > > > Any ideas or advice would be much appreciated. > > > > > > Thanks! > > > > > > -- > > > Alex > > > > > > -- > Alex >