I made some testing and found an interesting behavior that you might be
able to comment on.

When running the test against apache/branch-1.1 and apache/branch-1.2 using
the following command the tests consistently failed for me:
`mvn -pl hbase-it -am -Dtest=NoUnitTests
-Dit.test=IntegrationTestRegionReplicaReplication verify`

If I remove line 103 from the test then the test passes on both apache
branch and CDH based on v.1.2.
    conf.setLong(HConstants.HREGION_MEMSTORE_FLUSH_SIZE, 1024L * 1024 * 4);
// flush every 4 MB

Do you know why setting hbase.hregion.memstore.flush.size is needed? As far
as I understand the test verifies that async WAL replication works. Don't
we bypass that functionality if we flush too frequently?

Thanks,
Peter

On Mon, Jun 19, 2017 at 2:55 AM, Devaraj Das <d...@hortonworks.com> wrote:

> If it is failing consistently I'd suspect we have introduced a bug in the
> 1.2 line or something. We do run the same test with a version based on
> 1.1.2 (HDP-2.3 and beyond) and it works fine
>
>
>
>
> On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" <
> psomo...@cloudera.com<mailto:psomo...@cloudera.com>> wrote:
>
>
> I'm using hbase based on 1.2 version.
>
> On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das  wrote:
>
> > Peter which version of HBase are tou testing with?
> >
> >
> >
> >
> > On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> > psomo...@cloudera.com> wrote:
> >
> >
> > I tried with those parameters but the test still failed.
> > I noticed that some of the rows were not replicated to the replicas just
> > after I called flush manually. I think memstore replication is not
> working
> > on my system even though it is enabled in the configuration.
> > I will look into it today.
> >
> > On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
> >
> > > Peter, do have a look at IntegrationTestRegionReplicaReplication.java
> ..
> > > At the top of the file, the ways to specify the options are documented
> ..
> > > You need to add something like -DIntegrationTestRegionReplicaR
> > eplication.read_delay_ms
> > > ..
> > > ________________________________________
> > > From: Josh Elser
> > > Sent: Thursday, June 15, 2017 10:40 AM
> > > To: dev@hbase.apache.org
> > > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >
> > > I'd start trying a read_delay_ms=60000, region_replication=2,
> > > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > > reader and writer threads.
> > >
> > > Again, this can be quite dependent on the kind of hardware you have.
> > > You'll definitely have to tweak ;)
> > >
> > > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > > Thanks Josh and Devaraj!
> > > >
> > > > I will try to increase the timeouts. Devaraj, could you share the
> > > > parameters you used for this test which worked?
> > > >
> > > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > > wrote:
> > > >
> > > >> That sounds about right, Josh. Peter, in our internal testing we
> have
> > > seen
> > > >> this test failing and increasing timeouts (look at the test code
> > > options to
> > > >> do with increasing timeout) helped quite some.
> > > >> ________________________________________
> > > >> From: Josh Elser
> > > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > > >> To: dev@hbase.apache.org
> > > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > > >>
> > > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > > >>> Hi,
> > > >>>
> > > >>> As one of my first task with HBase I started to look into
> > > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> > get
> > > >> some
> > > >>> suggestions from you.
> > > >>>
> > > >>> I noticed when I run the test using normal cluster or minicluster I
> > get
> > > >> the
> > > >>> same error messages: "Error checking data for key [null], no data
> > > >>> returned". I looked into the code and here are my conclusions.
> > > >>>
> > > >>> There are multiple threads writing data parallel which are read by
> > > >> multiple
> > > >>> reader threads simultaneously. Each writer gets a portion of the
> keys
> > > to
> > > >>> write (e.g. 0-2000) and these keys are added to a
> ConstantDelayQueue.
> > > >>> The reader threads get the elements (e.g. key=1000) from the queue
> > and
> > > >>> these reader threads assume that all the keys up to this are
> already
> > in
> > > >> the
> > > >>> database. Since we're using multiple writers it can happen that
> > another
> > > >>> thread has not yet written key=500 and verifying these keys will
> > cause
> > > >> the
> > > >>> test failure.
> > > >>>
> > > >>> Do you think my assumption is correct?
> > > >>
> > > >> Hi Peter,
> > > >>
> > > >> No, as my memory serves, this is not correct. Readers are not made
> > aware
> > > >> of keys to verify until the write occur plus some delay. The delay
> is
> > > >> used to provide enough time for the internal region replication to
> > take
> > > >> effect.
> > > >>
> > > >> So: primary-write, pause, [region replication happens in
> background],
> > > >> add updated key to read queue, reader gets key from queue verifies
> the
> > > >> value on a replica.
> > > >>
> > > >> The primary should always have seen the new value for a key. If the
> > test
> > > >> is showing that a replica does not see the result, it's either a
> > timing
> > > >> issue (you need to give a larger delay for HBase to perform the
> region
> > > >> replication) or a bug in the region replication framework itself.
> That
> > > >> said, if you can show that you are seeing what you describe, that
> > sounds
> > > >> like the test framework itself is broken :)
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> >
> >
>
>

Reply via email to