Reporting back with some results.

We ran several RowCounters and each one gives us the same count back. It
could be because RowCounter is much more lightweight than our migration job
(which reads every cell and turns back to write an equivalent version in
another table) but it's hard to tell.

Taking a step back, it looks like the bug described in HBASE-15378 was
introduced in 1.1.0 which wouldn't affect us since we're still
on 1.0.0-cdh5.5.4.

I guess that puts us back to square one. Any other ideas?

On Sun, Feb 5, 2017 at 1:10 PM Alexandre Normand <
alexandre.norm...@gmail.com> wrote:

> That's a good suggestion. I'll give that a try.
>
> Thanks again!
>
> On Sun, Feb 5, 2017 at 1:07 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
> You can run rowcounter on the source tables multiple times.
>
> With region servers under load, you would observe inconsistent results from
> different runs.
>
> On Sun, Feb 5, 2017 at 12:54 PM, Alexandre Normand <
> alexandre.norm...@gmail.com> wrote:
>
> > Thanks, Ted. We're running HBase 1.0.0-cdh5.5.4 which isn't in the fixed
> > versions so this might be related. This is somewhat reassuring to think
> > that this would be missed data on the scan/source side because this would
> > mean that our other ingest/write workloads wouldn't be affected.
> >
> > From reading the jira description, it sounds like it would be difficult
> to
> > confirm that we've been affected by this bug. Am I right?
> >
> > On Sun, Feb 5, 2017 at 12:36 PM Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Which release of hbase are you using ?
> > >
> > > To be specific, does the release have HBASE-15378 ?
> > >
> > > Cheers
> > >
> > > On Sun, Feb 5, 2017 at 11:32 AM, Alexandre Normand <
> > > alexandre.norm...@gmail.com> wrote:
> > >
> > > > We're migrating data from a previous iteration of a table to a new
> one
> > > and
> > > > this process involved a MR job that scans data from the source table
> > and
> > > > writes the equivalent data in the new table. The source table has
> 6000+
> > > > regions and it frequently splits because we're still ingesting time
> > > series
> > > > data into it. We used buffered writing on the other end when writing
> to
> > > the
> > > > new table and we have a yarn resource pool to limit the concurrent
> > > writing.
> > > >
> > > > First, I should say that this job took a long time but still mostly
> > > worked.
> > > > However, we've built a mechanism to compare requested data fetched
> from
> > > > each one of the tables and found that some rows (0.02%) are missing
> > from
> > > > the destination. We've ruled out a few things already:
> > > >
> > > > * Functional bug in the job that would have resulted in skipping that
> > > 0.02%
> > > > of the rows.
> > > > * Potential for that data not having existed when the migration job
> > > > initially ran.
> > > >
> > > > At a high-level, the suspects could be:
> > > >
> > > > * The source table splitting could have resulted in some input keys
> not
> > > > being read. However, since a hbase split is comprised of a
> > > startKey/endKey,
> > > > this seems like this would not be expected unless there was a bug in
> > > there
> > > > somehow.
> > > > * The writing/flushing losing a batch. Since we're buffering writes
> and
> > > > flush everything on the clean up of map tasks, we would expect write
> > > > failures to cause task failures/retries and therefore to not be a
> > problem
> > > > in the end. Given that this flush is synchronous and, according to
> our
> > > > understanding, completes when the data is in the WAL and memstore,
> this
> > > > also seems unlikely unless there's a bug.
> > > >
> > > > I should add that we've extracted a sample of 1% of the source rows
> > > (doing
> > > > all of them is really time consuming because of the size of data) and
> > > found
> > > > that missing data often appears in clusters of the source hbase row
> > keys.
> > > > This doesn't really help pointing at a problem with the scan side of
> > > things
> > > > or the write side of things (since a failure in either would result
> in
> > a
> > > > similar output) but we thought it was interesting. That said, we do
> > have
> > > a
> > > > few keys that are missing that aren't clustered. This could be
> because
> > > > we've only ran the comparison for 1% of the data or it could be that
> > > > whatever is causing this can affect very isolated cases.
> > > >
> > > > We're now trying to understand how this could have happened in order
> to
> > > > understand how it could impact other jobs/applications and also to
> > > increase
> > > > our confidence that we write a modified version of the migration job
> to
> > > > re-migrate the skipped/missing data.
> > > >
> > > > Any ideas or advice would be much appreciated.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Alex
> > > >
> > >
> > --
> > Alex
> >
>
> --
> Alex
>
-- 
Alex

Reply via email to