Thanks Duo for your offer to coordinate on writing "Part 3" of this series, sounds great! Although I see TRSP#assign being used by SCP directly while assigning the regions, I am yet to take a detailed look into HBASE-20881 <https://issues.apache.org/jira/browse/HBASE-20881> and the relevant work. Let me reach out to you over Slack and we can take it from there.
On Sun, Sep 12, 2021 at 7:02 PM 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > Thank you Viraj and Andrew, the blog posts are outstanding! > > And I think we'd better have a part 3, about the ServerCrashProcedure(SCP) > :) > > In 2.0 and 2.1, we use MoveRegionProcedure, AssignRegionProcedure and > UnassignRegionProcedure, and one of the reasons why we removed them all and > introduced a single TRSP to do assign/unassign/move/reopen, is because of > SCP. > > If a region server crashed, obviously, we can not assign regions to it any > more, so we should have a way to stop the procedure which are still trying > to assign regions to the dead server. And even for unassigning a region, we > still need to make it online first and then unassign it. For example, when > disabling a table, we must make sure that all the data in memstore have > been flushed to storage, so we will need make it online, and then do a > clean close. > In 2.0 and 2.1, we had 3 procedures for region assignment, and there were > lots of corner cases when we want to interrupt them from SCP, which made > the code really hard to understand and buggy. So finally, we introduced a > TRSP to replace them all. So SCP only needs to interrupt one type of > procedure. > > This is the story :) > > I could help if you guys want to write the part 3 about SCP :) > > Thanks. > > Viraj Jasani <vjas...@apache.org> 于2021年9月8日周三 上午2:27写道: > > > As some of the HBase users are still running HBase 1.x versions in their > > production environment, and branch-1 is trending toward EOL, now is > really > > the right time to evaluate as well as understand the features and core > > design changes provided by HBase 2.x versions. > > > > As the majority of us are already aware, one of the key features with > > significant architectural changes provided by HBase 2 is > > AssignmentManagerV2 (AMv2). > > However, we don't seem to have one place explaining 1) *the evolution > > of AM* and > > 2) how it manages region assignments with better scalability, reliability > > and fault-tolerance. > > Keeping this in mind, Andrew and I have published a series of two-part > blog > > posts explaining this evolution. Part 1 provides a) some basic > introduction > > to HBase concepts, and b) AM and it's shortcomings from previous versions > > that AMv2 is trying to resolve. Part 2 provides detailed info about Pv2 > and > > how AMv2 leverages it, and also state diagrams explaining some of the > > complex region assignment workflows. The intention of state diagrams is > for > > dev/users to be able to a) understand region assignment workflows > in-depth, > > b) easier code walk-through and c) debug and root cause issues with > > better knowledge. > > > > Part 1: > > > > > https://engineering.salesforce.com/evolution-of-region-assignment-in-the-apache-hbase-architecture-part-1-c43b1becc522 > > Part 2: > > > > > https://engineering.salesforce.com/evolution-of-region-assignment-in-the-apache-hbase-architecture-part-2-9568fb3790b > > >