Thanks for your reply and clarification!

It sounds like a mechanism like fencing?

I'd also like to look for JIRAs about this issue, that is,
coordination in master switch. Maybe some like this one[1]?

Best,
tison.

[1] https://issues.apache.org/jira/browse/HBASE-5549


Wellington Chevreuil <wellington.chevre...@gmail.com> 于2019年6月6日周四
下午10:15写道:

> Hey Zili,
>
> Besides what Duo explained previously, just clarifying on some concepts to
> your previous description:
>
> 1) RegionServer started full gc and timeout on ZooKeeper. Thus ZooKeeper
> > regarded it as failed.
> >
> ZK just knows about sessions and clients, not the type of client connecting
> to it. Clients open a session in ZK, then keep pinging back ZK
> periodically, to keep the session alive. In the case of long full GC
> pauses, the client (RS, in this case), will fail to ping back within the
> required period. At this point, ZK will *expire *the session.
>
> 2) ZooKeeper launched a new RegionServer, and the new one started to serve.
> >
> ZK doesn't launch new RS, it doesn't know about RSes, only client sessions.
> With the session expiration, Master will be notified that an RS is
> potentially gone, and will start the process explained by Duo.
>
> 3) The old RegionServer finished gc and thought itself was still active and
> > serving.
> >
> What really happens here is that once RS is back from GC, it will try ping
> ZK again for that session, ZK will back it off because the session is
> already expired, then RS will kill itself.
>
>
>
>
>
> Em qui, 6 de jun de 2019 às 14:58, 张铎(Duo Zhang) <palomino...@gmail.com>
> escreveu:
>
> > Once a RS is started, it will create its wal directory and start to write
> > wal into it. And if master thinks a RS is dead, it will rename the wal
> > directory of the RS and call recover lease on all the wal files under the
> > directory to make sure that they are all closed. So even after the RS is
> > back after a long GC, before it kills itself because of the
> > SessionExpiredException, it can not accept any write requests any more
> > since its old wal file is closed and the wal directory is also gone so it
> > can not create new wal files either.
> >
> > Of course, you may still read from the dead RS at this moment
> > so theoretically you could read a stale data, which means HBase can not
> > guarantee ‘external consistency’.
> >
> > Hope this solves your problem.
> >
> > Thanks.
> >
> > Zili Chen <wander4...@gmail.com> 于2019年6月6日周四 下午9:38写道:
> >
> > > Hi,
> > >
> > > Recently from the book, ZooKeeper: Distributed Process Coordination, I
> > find
> > > a paragraph mentions that, HBase once suffered by
> > >
> > > 1) RegionServer started full gc and timeout on ZooKeeper. Thus
> ZooKeeper
> > > regarded it as failed.
> > > 2) ZooKeeper launched a new RegionServer, and the new one started to
> > serve.
> > > 3) The old RegionServer finished gc and thought itself was still active
> > and
> > > serving.
> > >
> > > in Chapter 5 section 5.3.
> > >
> > > I'm interested on it and would like to know how HBase community
> overcame
> > > this issue.
> > >
> > > Best,
> > > tison.
> > >
> >
>

Reply via email to