Hi Nick, Thank you for your answer!
"I'm not quite comfortable with removing the warning yet (more testing to be done), but I'm hoping to get to that point, at least for our production workloads, if not for 2.3.0, within the early releases of 2.3.x." What would make you comfortable? Do you have specific criteria in mind? Are there test results you would like to see? Use of different load, bigger cluster or longer runtime? I could try to provide these if you would let me know what you are looking for. Thanks, Szabolcs On Fri, Feb 7, 2020 at 12:03 AM Nick Dimiduk <[email protected]> wrote: > Hi Szabolcs, > > Looks like that note was dropped in via HBASE-20830, about 2 years back. > From what we've seen on our clusters, kicking the tires with branch-2.2 and > branch-2, that was indeed the case. Based on that recent experience and > that of others', there's a number of fixes for Procedures and hbck2 that > better handle read replica regions. Try this git incantation [0] and look > for issues related to "unknown servers" and "RIT". > > I'm not quite comfortable with removing the warning yet (more testing to be > done), but I'm hoping to get to that point, at least for our production > workloads, if not for 2.3.0, within the early releases of 2.3.x. > > Maybe someone else who's using and abusing this feature on a branch-2 > release could add their experience? > > Thanks, > Nick > > [0]: git log --oneline branch-2 -- > hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/ | > head -n20 > > On Thu, Feb 6, 2020 at 1:26 AM Szabolcs Bukros > <[email protected]> wrote: > > > Hi! > > > > In the current Reference Guide the section "75. Timeline-consistent High > > Available Reads" is flagged as "maybe broken. Use it with caution". I'm > not > > familiar with the original reason it was flagged but I have spent a few > > weeks working on this and after a few small fixes it looked stable > enough. > > I think we should remove this warning for new 2.2+ releases. Below are > some > > details about the fixes and the testing I did. > > > > Fixes: > > - HBASE-23589 FlushDescriptor contains non-matching family/output > > combinations <https://issues.apache.org/jira/browse/HBASE-23589> > > - HBASE-23601 OutputSink.WriterThread exception gets stuck and repeated > > indefinitely <https://issues.apache.org/jira/browse/HBASE-23601> > > > > Testing: > > After the fixes I run IntegrationTestRegionReplicaReplication for testing > > on a 4 machine cluster (3 RS, 30GB heap/RS). I used the default test > > parameters, only increased read_delay_ms to 60000. The longest > > uninterrupted run I tried was 8 hours and I encountered no issues. Even > > adding in the chaos monkeys (slowDeterministic) hasn't revealed any new > > correctness issues with the feature. > > > > Next steps: > > - Further testing. I realize IntegrationTestRegionReplicaReplication > > provides a very uniform, unrealistic load, using different data could be > > interesting. If someone would find the time to run a few tests or propose > > some scenarios I would be grateful. > > - I was thinking of providing a cleaner flush logic on replication side, > > but my proposal might have too much overhead and the current logic while > > having issues works after the previous fixes. The proposal can be found > in > > HBASE-23591, any feedback would be welcomed. > > > > Thoughts? > > > ------------------------------
