You can run rowcounter on the source tables multiple times.

With region servers under load, you would observe inconsistent results from
different runs.

On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand <
alexandre.norm...@gmail.com> wrote:

> Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed
> versions so this might be related. This is somewhat reassuring to think
> that this would be missed data on the scan/source side because this would
> mean that our other ingest/write workloads wouldn't be affected.
>
> From reading the jira description, it sounds like it would be difficult to
> confirm that we've been affected by this bug. Am I right?
>
> On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Which release of hbase are you using ?
> >
> > To be specific, does the release have HBASE-15378 ?
> >
> > Cheers
> >
> > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
> > alexandre.norm...@gmail.com> wrote:
> >
> > > We're migrating data from a previous iteration of a table to a new one
> > and
> > > this process involved a MR job that scans data from the source table
> and
> > > writes the equivalent data in the new table. The source table has 6000+
> > > regions and it frequently splits because we're still ingesting time
> > series
> > > data into it. We used buffered writing on the other end when writing to
> > the
> > > new table and we have a yarn resource pool to limit the concurrent
> > writing.
> > >
> > > First, I should say that this job took a long time but still mostly
> > worked.
> > > However, we've built a mechanism to compare requested data fetched from
> > > each one of the tables and found that some rows (0.02%) are missing
> from
> > > the destination. We've ruled out a few things already:
> > >
> > > * Functional bug in the job that would have resulted in skipping that
> > 0.02%
> > > of the rows.
> > > * Potential for that data not having existed when the migration job
> > > initially ran.
> > >
> > > At a high-level, the suspects could be:
> > >
> > > * The source table splitting could have resulted in some input keys not
> > > being read. However, since a hbase split is comprised of a
> > startKey/endKey,
> > > this seems like this would not be expected unless there was a bug in
> > there
> > > somehow.
> > > * The writing/flushing losing a batch. Since we're buffering writes and
> > > flush everything on the clean up of map tasks, we would expect write
> > > failures to cause task failures/retries and therefore to not be a
> problem
> > > in the end. Given that this flush is synchronous and, according to our
> > > understanding, completes when the data is in the WAL and memstore, this
> > > also seems unlikely unless there's a bug.
> > >
> > > I should add that we've extracted a sample of 1% of the source rows
> > (doing
> > > all of them is really time consuming because of the size of data) and
> > found
> > > that missing data often appears in clusters of the source hbase row
> keys.
> > > This doesn't really help pointing at a problem with the scan side of
> > things
> > > or the write side of things (since a failure in either would result in
> a
> > > similar output) but we thought it was interesting. That said, we do
> have
> > a
> > > few keys that are missing that aren't clustered. This could be because
> > > we've only ran the comparison for 1% of the data or it could be that
> > > whatever is causing this can affect very isolated cases.
> > >
> > > We're now trying to understand how this could have happened in order to
> > > understand how it could impact other jobs/applications and also to
> > increase
> > > our confidence that we write a modified version of the migration job to
> > > re-migrate the skipped/missing data.
> > >
> > > Any ideas or advice would be much appreciated.
> > >
> > > Thanks!
> > >
> > > --
> > > Alex
> > >
> >
> --
> Alex
>

Reply via email to