Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Andrew Purtell Tue, 17 Nov 2020 07:56:35 -0800

Hi Duo,

Just to be clear: You are saying go ahead with the merge, but then also go back 
and start this discussion fresh, to see if anything was missed and more can be 
done?


> On Nov 16, 2020, at 11:25 PM, 张铎 <[email protected]> wrote:
> 
> Oh, this is my fault. I mean the old behavior IS to go to primary replica
> first, which is what we want to change here.
> 
> And what I commented  on jira, is to say that we do not need to get a
> performance improvement before merging, it is not the goal of this issue.
> And I suggested that if we want to show our advantage, we need to get the
> primary replica fucked up. I do not know why then the discussion went to
> the HedgeRead and I could not poll it back. I do not think this should
> block the merging but even though it was still very hard to communicate, so
> I assumed this means we still have a big gap on what we want to solve here,
> thus I voted a -1 here.
> 
> I think we need to go back to the beginning, to reach an agreement on the
> goal here. Let’s review the design doc again to see if we missed something
> which lead us to this situation.
> 
> And I need to say that, I do not want to block the issue to be merged. I
> tried my best to speed up the process. I suggested to land the changes at
> client side to master directly but was refused. I helped to add scan on
> specific replica feature soon on branch-2 to let the port to branch-2 can
> be landed cleanly.
> 
> On a mobile device so can not review the code or PR. Very busy these days.
> And the health examination this morning told me that I had a high blood
> pressure. Not a good birthday present. Will get back to the issue when
> possible.
> 
> Thanks.
> 
> Stack <[email protected]>于2020年11月17日 周二06:34写道：
> 
>>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <[email protected]>
>>> wrote:
>>> 
>>> So what is your purpose of distributing the request of region location
>>> lookup? It is just because you want to 'distribute the request of region
>>> location lookup'?
>>> 
>>> Then I'm -1 on merging. We should reach an agreement on what we want to
>>> solve before merging at least.
>>> 
>>> 
>> HERE.1
>> 
>> 
>>> I've helped this issue from the design doc step. For me, the purpose for
>>> this issue is clear. We want to prevent the hotspot of meta, so the
>>> solution is simple, enable meta replica, and then just modify the client
>> to
>>> not always go to primary replica first(this is the old behavior even with
>>> meta replica feature on).
>>> And this will introduce another problem that, there is no meta region
>>> replication implementation for meta read replicas, which means the
>> latency
>>> will be large as we can only sync the data between replicas through
>> region
>>> flush, so we implement meta region replication.
>>> 
>>> So I think it is very important to verify that we have truly distributed
>>> the request of region location lookup, and also make sure that we could
>>> support more requests of region location lookup. Otherwise this feature
>> is
>>> useless.
>>> 
>>> And I agree with Andrew that, since the feature is default off on
>> branch-2
>>> and has no regression, it is OK to merge for now. Theoretically our
>>> approach here should work, so even it does not work for now, I think we
>>> could fix the problems to make it work.
>>> 
>>> 
>> HERE.2
>> 
>> I agree with all of the above between HERE.1 and HERE.2 (except the
>> suggestion that the old behavior of read replicas is that they went to the
>> replica first; they go to the primary first -- see [1], [2]).
>> 
>> Lets work with any misalignment of understanding/communication offline and
>> not in the way of merge.
>> 
>> Thanks,
>> S
>> 
>> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a read
>> is performed with Consistency.TIMELINE, then the read RPC will be sent to
>> the primary region server first."
>> 2.
>> 
>> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
>> 
>> 
>> 
>>> But your reply above made me wonder whether we are talking about the same
>>> thing. That's why I'm -1 here. I'm not going to force you to do the test
>>> suggested by me, as I said it could be done after merging, just want to
>>> reach an agreement on the goal of this feature.
>>> 
>>> Thanks.
>>> 
>>> Stack <[email protected]> 于2020年11月16日周一 下午12:35写道：
>>> 
>>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
>> [email protected]
>>>> 
>>>> wrote:
>>>> 
>>>>> I agree with Duo’s comment that a performance gain is unlikely but
>>> would
>>>>> be orthogonal anyway;
>>>> 
>>>> 
>>>> Perf observation is just an aside in the issue. Perf is orthogonal as
>> you
>>>> say above (as long as no regression).
>>>> 
>>>> 
>>>> 
>>>>> it’s an availability gain that is the goal. We can assume it based on
>>>>> theory of operation and unit test results but the gain should be
>> tested
>>>> and
>>>>> measured on a cluster too.
>>>>> 
>>>> 
>>>> 
>>>> The feature is about distributing load on hbase:meta to alleviate
>>>> hotspotting; it makes read replicas more live so replicas are more
>> likely
>>>> to satisfy location lookups making read replicas more effective. That
>>> read
>>>> replicas improve HA is presumed -- it was the original justification
>> for
>>>> this years old commit -- but HA is not the focus of this addition;
>> hence
>>> no
>>>> reports on effectiveness in this area.
>>>> 
>>>> I have no problem working on such tests/reports but suggest that they
>> are
>>>> done post merge.
>>>> 
>>>> 
>>>> 
>>>>> That said, the results of the testing thus far indicate no
>> regression,
>>>>> which gives me confidence to support a merge. Specifically, a merge
>> to
>>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
>> the
>>>>> default there is the feature is configured off. But please indicate
>> in
>>>>> documentation and release notes that the feature is not widely tested
>>>> yet -
>>>>> as is customarily done for new functionality like this.
>>>>> 
>>>>> 
>>>> No problem w/ flagging the feature as new.
>>>> 
>>>> Thanks,
>>>> S
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <[email protected]> wrote:
>>>>>> 
>>>>>> Replied on jira, I think we missed an important scenario when
>>> testing.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Stack <[email protected]> 于2020年11月15日周日 上午2:30写道：
>>>>>> 
>>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to
>>> the
>>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
>>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed
>>> across
>>>>>>> replicas as they arrive at the primary's WAL. Before this work,
>>> Async
>>>>> WAL
>>>>>>> Replication was only available on user-space tables and the only
>>>> option
>>>>> for
>>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
>>>> period
>>>>>>> (minutes). HBASE-18070 also adds an optional client-side
>>> 'LoadBalance'
>>>>>>> policy that favors read replicas ahead of primary reads falling
>> back
>>>> to
>>>>> the
>>>>>>> primary on fault. Together, these additions allow distributing
>>>>> hbase:meta
>>>>>>> read load across primary and replicas alleviating 'hotspotting'.
>>>>>>> 
>>>>>>> I would like to merge the feature to master branch Monday evening
>> if
>>>>> there
>>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature
>>> can
>>>>>>> hopefully be included in the upcoming 2.4.0RC).
>>>>>>> 
>>>>>>> * For the design, see [2].
>>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
>>> this
>>>>>>> feature, see [3].
>>>>>>> * For a PE report that compared performance before and after, see
>>>>>>> HBASE-25127 (no regression).
>>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070
>>> but
>>>>> runs
>>>>>>> so far show no regression with the feature enabled (ITBLL runs
>> were
>>>> done
>>>>>>> against a backport of this feature to branch-2 as the ITBLL state
>> of
>>>>> master
>>>>>>> is currently an unknown).
>>>>>>> 
>>>>>>> Testing continues mainly looking for further improvement and to
>>> better
>>>>>>> understand this feature in operation. Documentation is included
>> but
>>> in
>>>>> need
>>>>>>> of polish (working on it).
>>>>>>> 
>>>>>>> Dump any questions here and I'll be happy to respond. If you need
>>> more
>>>>> time
>>>>>>> to review, just shout.
>>>>>>> 
>>>>>>> Thanks and thanks to all who contributed to this feature; the
>>>> reviewers
>>>>> and
>>>>>>> the testers in particular.
>>>>>>> 
>>>>>>> S
>>>>>>> 
>>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>>>>>>> 2.
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>>>>>>> This patch is currently missing HBASE-25280, a bug found in
>> testing.
>>>>>>> 3. https://github.com/apache/hbase/pull/2643
>>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Reply via email to