Hi Duo, Just to be clear: You are saying go ahead with the merge, but then also go back and start this discussion fresh, to see if anything was missed and more can be done?
> On Nov 16, 2020, at 11:25 PM, 张铎 <[email protected]> wrote: > > Oh, this is my fault. I mean the old behavior IS to go to primary replica > first, which is what we want to change here. > > And what I commented on jira, is to say that we do not need to get a > performance improvement before merging, it is not the goal of this issue. > And I suggested that if we want to show our advantage, we need to get the > primary replica fucked up. I do not know why then the discussion went to > the HedgeRead and I could not poll it back. I do not think this should > block the merging but even though it was still very hard to communicate, so > I assumed this means we still have a big gap on what we want to solve here, > thus I voted a -1 here. > > I think we need to go back to the beginning, to reach an agreement on the > goal here. Let’s review the design doc again to see if we missed something > which lead us to this situation. > > And I need to say that, I do not want to block the issue to be merged. I > tried my best to speed up the process. I suggested to land the changes at > client side to master directly but was refused. I helped to add scan on > specific replica feature soon on branch-2 to let the port to branch-2 can > be landed cleanly. > > On a mobile device so can not review the code or PR. Very busy these days. > And the health examination this morning told me that I had a high blood > pressure. Not a good birthday present. Will get back to the issue when > possible. > > Thanks. > > Stack <[email protected]>于2020年11月17日 周二06:34写道: > >>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <[email protected]> >>> wrote: >>> >>> So what is your purpose of distributing the request of region location >>> lookup? It is just because you want to 'distribute the request of region >>> location lookup'? >>> >>> Then I'm -1 on merging. We should reach an agreement on what we want to >>> solve before merging at least. >>> >>> >> HERE.1 >> >> >>> I've helped this issue from the design doc step. For me, the purpose for >>> this issue is clear. We want to prevent the hotspot of meta, so the >>> solution is simple, enable meta replica, and then just modify the client >> to >>> not always go to primary replica first(this is the old behavior even with >>> meta replica feature on). >>> And this will introduce another problem that, there is no meta region >>> replication implementation for meta read replicas, which means the >> latency >>> will be large as we can only sync the data between replicas through >> region >>> flush, so we implement meta region replication. >>> >>> So I think it is very important to verify that we have truly distributed >>> the request of region location lookup, and also make sure that we could >>> support more requests of region location lookup. Otherwise this feature >> is >>> useless. >>> >>> And I agree with Andrew that, since the feature is default off on >> branch-2 >>> and has no regression, it is OK to merge for now. Theoretically our >>> approach here should work, so even it does not work for now, I think we >>> could fix the problems to make it work. >>> >>> >> HERE.2 >> >> I agree with all of the above between HERE.1 and HERE.2 (except the >> suggestion that the old behavior of read replicas is that they went to the >> replica first; they go to the primary first -- see [1], [2]). >> >> Lets work with any misalignment of understanding/communication offline and >> not in the way of merge. >> >> Thanks, >> S >> >> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a read >> is performed with Consistency.TIMELINE, then the read RPC will be sent to >> the primary region server first." >> 2. >> >> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195 >> >> >> >>> But your reply above made me wonder whether we are talking about the same >>> thing. That's why I'm -1 here. I'm not going to force you to do the test >>> suggested by me, as I said it could be done after merging, just want to >>> reach an agreement on the goal of this feature. >>> >>> Thanks. >>> >>> Stack <[email protected]> 于2020年11月16日周一 下午12:35写道: >>> >>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell < >> [email protected] >>>> >>>> wrote: >>>> >>>>> I agree with Duo’s comment that a performance gain is unlikely but >>> would >>>>> be orthogonal anyway; >>>> >>>> >>>> Perf observation is just an aside in the issue. Perf is orthogonal as >> you >>>> say above (as long as no regression). >>>> >>>> >>>> >>>>> it’s an availability gain that is the goal. We can assume it based on >>>>> theory of operation and unit test results but the gain should be >> tested >>>> and >>>>> measured on a cluster too. >>>>> >>>> >>>> >>>> The feature is about distributing load on hbase:meta to alleviate >>>> hotspotting; it makes read replicas more live so replicas are more >> likely >>>> to satisfy location lookups making read replicas more effective. That >>> read >>>> replicas improve HA is presumed -- it was the original justification >> for >>>> this years old commit -- but HA is not the focus of this addition; >> hence >>> no >>>> reports on effectiveness in this area. >>>> >>>> I have no problem working on such tests/reports but suggest that they >> are >>>> done post merge. >>>> >>>> >>>> >>>>> That said, the results of the testing thus far indicate no >> regression, >>>>> which gives me confidence to support a merge. Specifically, a merge >> to >>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided >> the >>>>> default there is the feature is configured off. But please indicate >> in >>>>> documentation and release notes that the feature is not widely tested >>>> yet - >>>>> as is customarily done for new functionality like this. >>>>> >>>>> >>>> No problem w/ flagging the feature as new. >>>> >>>> Thanks, >>>> S >>>> >>>> >>>> >>>>> >>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <[email protected]> wrote: >>>>>> >>>>>> Replied on jira, I think we missed an important scenario when >>> testing. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Stack <[email protected]> 于2020年11月15日周日 上午2:30写道: >>>>>> >>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to >>> the >>>>>>> primary, (< second lags rather than minutes). It adds Async WAL >>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed >>> across >>>>>>> replicas as they arrive at the primary's WAL. Before this work, >>> Async >>>>> WAL >>>>>>> Replication was only available on user-space tables and the only >>>> option >>>>> for >>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a >>>> period >>>>>>> (minutes). HBASE-18070 also adds an optional client-side >>> 'LoadBalance' >>>>>>> policy that favors read replicas ahead of primary reads falling >> back >>>> to >>>>> the >>>>>>> primary on fault. Together, these additions allow distributing >>>>> hbase:meta >>>>>>> read load across primary and replicas alleviating 'hotspotting'. >>>>>>> >>>>>>> I would like to merge the feature to master branch Monday evening >> if >>>>> there >>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature >>> can >>>>>>> hopefully be included in the upcoming 2.4.0RC). >>>>>>> >>>>>>> * For the design, see [2]. >>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise >>> this >>>>>>> feature, see [3]. >>>>>>> * For a PE report that compared performance before and after, see >>>>>>> HBASE-25127 (no regression). >>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070 >>> but >>>>> runs >>>>>>> so far show no regression with the feature enabled (ITBLL runs >> were >>>> done >>>>>>> against a backport of this feature to branch-2 as the ITBLL state >> of >>>>> master >>>>>>> is currently an unknown). >>>>>>> >>>>>>> Testing continues mainly looking for further improvement and to >>> better >>>>>>> understand this feature in operation. Documentation is included >> but >>> in >>>>> need >>>>>>> of polish (working on it). >>>>>>> >>>>>>> Dump any questions here and I'll be happy to respond. If you need >>> more >>>>> time >>>>>>> to review, just shout. >>>>>>> >>>>>>> Thanks and thanks to all who contributed to this feature; the >>>> reviewers >>>>> and >>>>>>> the testers in particular. >>>>>>> >>>>>>> S >>>>>>> >>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication >>>>>>> 2. >>>>>>> >>>>>>> >>>>> >>>> >>> >> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit# >>>>>>> This patch is currently missing HBASE-25280, a bug found in >> testing. >>>>>>> 3. https://github.com/apache/hbase/pull/2643 >>>>>>> >>>>> >>>> >>> >>
