I made some testing and found an interesting behavior that you might be able to comment on.
When running the test against apache/branch-1.1 and apache/branch-1.2 using the following command the tests consistently failed for me: `mvn -pl hbase-it -am -Dtest=NoUnitTests -Dit.test=IntegrationTestRegionReplicaReplication verify` If I remove line 103 from the test then the test passes on both apache branch and CDH based on v.1.2. conf.setLong(HConstants.HREGION_MEMSTORE_FLUSH_SIZE, 1024L * 1024 * 4); // flush every 4 MB Do you know why setting hbase.hregion.memstore.flush.size is needed? As far as I understand the test verifies that async WAL replication works. Don't we bypass that functionality if we flush too frequently? Thanks, Peter On Mon, Jun 19, 2017 at 2:55 AM, Devaraj Das <d...@hortonworks.com> wrote: > If it is failing consistently I'd suspect we have introduced a bug in the > 1.2 line or something. We do run the same test with a version based on > 1.1.2 (HDP-2.3 and beyond) and it works fine > > > > > On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" < > psomo...@cloudera.com<mailto:psomo...@cloudera.com>> wrote: > > > I'm using hbase based on 1.2 version. > > On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das wrote: > > > Peter which version of HBase are tou testing with? > > > > > > > > > > On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" < > > psomo...@cloudera.com> wrote: > > > > > > I tried with those parameters but the test still failed. > > I noticed that some of the rows were not replicated to the replicas just > > after I called flush manually. I think memstore replication is not > working > > on my system even though it is enabled in the configuration. > > I will look into it today. > > > > On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das wrote: > > > > > Peter, do have a look at IntegrationTestRegionReplicaReplication.java > .. > > > At the top of the file, the ways to specify the options are documented > .. > > > You need to add something like -DIntegrationTestRegionReplicaR > > eplication.read_delay_ms > > > .. > > > ________________________________________ > > > From: Josh Elser > > > Sent: Thursday, June 15, 2017 10:40 AM > > > To: dev@hbase.apache.org > > > Subject: Re: Problem with IntegrationTestRegionReplicaReplication > > > > > > I'd start trying a read_delay_ms=60000, region_replication=2, > > > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of > > > reader and writer threads. > > > > > > Again, this can be quite dependent on the kind of hardware you have. > > > You'll definitely have to tweak ;) > > > > > > On 6/15/17 4:44 AM, Peter Somogyi wrote: > > > > Thanks Josh and Devaraj! > > > > > > > > I will try to increase the timeouts. Devaraj, could you share the > > > > parameters you used for this test which worked? > > > > > > > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das > > > wrote: > > > > > > > >> That sounds about right, Josh. Peter, in our internal testing we > have > > > seen > > > >> this test failing and increasing timeouts (look at the test code > > > options to > > > >> do with increasing timeout) helped quite some. > > > >> ________________________________________ > > > >> From: Josh Elser > > > >> Sent: Wednesday, June 14, 2017 3:17 PM > > > >> To: dev@hbase.apache.org > > > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication > > > >> > > > >> On 6/14/17 3:53 AM, Peter Somogyi wrote: > > > >>> Hi, > > > >>> > > > >>> As one of my first task with HBase I started to look into > > > >>> why IntegrationTestRegionReplicaReplication fails. I would like to > > get > > > >> some > > > >>> suggestions from you. > > > >>> > > > >>> I noticed when I run the test using normal cluster or minicluster I > > get > > > >> the > > > >>> same error messages: "Error checking data for key [null], no data > > > >>> returned". I looked into the code and here are my conclusions. > > > >>> > > > >>> There are multiple threads writing data parallel which are read by > > > >> multiple > > > >>> reader threads simultaneously. Each writer gets a portion of the > keys > > > to > > > >>> write (e.g. 0-2000) and these keys are added to a > ConstantDelayQueue. > > > >>> The reader threads get the elements (e.g. key=1000) from the queue > > and > > > >>> these reader threads assume that all the keys up to this are > already > > in > > > >> the > > > >>> database. Since we're using multiple writers it can happen that > > another > > > >>> thread has not yet written key=500 and verifying these keys will > > cause > > > >> the > > > >>> test failure. > > > >>> > > > >>> Do you think my assumption is correct? > > > >> > > > >> Hi Peter, > > > >> > > > >> No, as my memory serves, this is not correct. Readers are not made > > aware > > > >> of keys to verify until the write occur plus some delay. The delay > is > > > >> used to provide enough time for the internal region replication to > > take > > > >> effect. > > > >> > > > >> So: primary-write, pause, [region replication happens in > background], > > > >> add updated key to read queue, reader gets key from queue verifies > the > > > >> value on a replica. > > > >> > > > >> The primary should always have seen the new value for a key. If the > > test > > > >> is showing that a replica does not see the result, it's either a > > timing > > > >> issue (you need to give a larger delay for HBase to perform the > region > > > >> replication) or a bug in the region replication framework itself. > That > > > >> said, if you can show that you are seeing what you describe, that > > sounds > > > >> like the test framework itself is broken :) > > > >> > > > >> > > > >> > > > >> > > > > > > > > > > > > > > > > > > >