Hi!

In the current Reference Guide the section "75. Timeline-consistent High
Available Reads" is flagged as "maybe broken. Use it with caution". I'm not
familiar with the original reason it was flagged but I have spent a few
weeks working on this and after a few small fixes it looked stable enough.
I think we should remove this warning for new 2.2+ releases. Below are some
details about the fixes and the testing I did.

Fixes:
- HBASE-23589 FlushDescriptor contains non-matching family/output
combinations <https://issues.apache.org/jira/browse/HBASE-23589>
- HBASE-23601 OutputSink.WriterThread exception gets stuck and repeated
indefinitely <https://issues.apache.org/jira/browse/HBASE-23601>

Testing:
After the fixes I run IntegrationTestRegionReplicaReplication for testing
on a 4 machine cluster (3 RS, 30GB heap/RS). I used the default test
parameters, only increased read_delay_ms to 60000. The longest
uninterrupted run I tried was 8 hours and I encountered no issues. Even
adding in the chaos monkeys (slowDeterministic) hasn't revealed any new
correctness issues with the feature.

Next steps:
- Further testing. I realize IntegrationTestRegionReplicaReplication
provides a very uniform, unrealistic load, using different data could be
interesting. If someone would find the time to run a few tests or propose
some scenarios I would be grateful.
- I was thinking of providing a cleaner flush logic on replication side,
but my proposal might have too much overhead and the current logic while
having issues works after the previous fixes. The proposal can be found in
HBASE-23591, any feedback would be welcomed.

Thoughts?

Reply via email to