Re: What is Dead Region Servers and how to clear them up?

Yu Li Sun, 28 May 2017 10:14:46 -0700

Thanks for the additional information Jeff, interesting scenario.

Let me re-explain: dead server means on this node (or container, in your
case) there was a regionserver process once but not now. This doesn't
indicate the current health state of the cluster, but only tells the fact
and alarm operator to give a check on those nodes/containers to see what
problem cause them dead. But I admit that these might cause confusion.


And as I proposed in previous mail, I think in the Yarn/Mesos deployment
scenario we need to supply a command to clear those dead servers. To be
more specified, after all the actions, no matter automatic ones like WAL
split and zk clearance, or the manual ones like hbck -repair, as long as
we're sure we don't need to care about those dead servers any more, we
could remove them from master UI. If this satisfies what you desire, I
could open a JIRA and get the work done (smile).

Let me know your thoughts, thanks.

Best Regards,
Yu

On 28 May 2017 at 23:26, jeff saremi <jeffsar...@hotmail.com> wrote:

> I think more and more deployments are being made dynamic using Yarn and
> Mesos. Going back to a fixed set of servers is not going to eliminate the
> problem i'm talking about. Making assumptions that the region servers come
> back on the same node is too optimistic.
>
> Let me try this a different way to see if I can make my point:
>
> - A cluster is either healthy or not healthy.
>
> - If the cluster is unhealthy, then it can be made healthy using either
> external tools (hbck) or the internal agreement of master-regionserver. If
> this is not achievable, then the cluster must be discarded.
>
> - The cluster is now healthy, meaning that no information should be
> lingering on such as dead server, dead regions, or whatever anywhere in the
> system. And moreover no such information must ever be brought up to the
> attention of the administrators of the cluster.
>
> - If there is such information still hiding in some place in the system,
> then it only means that the mechansim (hbck or hbase itself) that made the
> system healthy did not complete its job in cleaning up what is needed to be
> cleaned up
>
>
>
> ------------------------------
> *From:* Ted Yu <yuzhih...@gmail.com>
> *Sent:* Saturday, May 27, 2017 1:54:50 PM
>
> *To:* dev@hbase.apache.org
> *Cc:* Hbase-User; Yu Li
> *Subject:* Re: What is Dead Region Servers and how to clear them up?
>
> The involvement of Yarn can explain why you observed relatively more dead
> servers (compared to traditional deployment).
>
> Suppose in first run, Yarn allocates containers for region servers on a set
> of nodes. Subsequently, Yarn may choose nodes (for the same number of
> servers) which are not exactly the same nodes in the previous run.
>
> What Yu Li described as restarting server is on the same node where the
> server was running previously.
>
> Cheers
>
> On Sat, May 27, 2017 at 11:59 AM, jeff saremi <jeffsar...@hotmail.com>
> wrote:
>
> > Yes. we don't have fixed servers with the exceptions of ZK machines.
> >
> > We have 3 yarn jobs one for each of master, region, and thrift servers
> > each launched separately with different number of nodes. I hope that's
> not
> > what is causing problems.
> >
> > ________________________________
> > From: Ted Yu <yuzhih...@gmail.com>
> > Sent: Saturday, May 27, 2017 11:27:36 AM
> > To: dev@hbase.apache.org
> > Cc: Hbase-User; Yu Li
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > Jeff:
> > bq. We run our cluster on Yarn and upon restarting jobs in Yarn
> >
> > Can you clarify a bit more - are you running hbase processes inside Yarn
> > container ?
> >
> > Cheers
> >
> > On Sat, May 27, 2017 at 10:58 AM, jeff saremi <jeffsar...@hotmail.com>
> > wrote:
> >
> > > Thanks @Yu Li<mailto:car...@gmail.com <car...@gmail.com>>
> > >
> > > You are absolutely correct. Dead RS's will happen regardless. My issue
> > > with this is more "psychological". If I have done everything needed to
> be
> > > done to ensure that RSs are running fine and regions are assigned and
> > such
> > > and hbck reports are consistent then how is this list of dead region
> > > servers helping me? other than causing anxiety?
> > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a
> lot
> > > of inconsistent, unavailable regions. (and this is only one scenario).
> > Then
> > > we'll run hbck with -repair option (and i was wrong here too: hbck does
> > > take care of some issues) and restart the master(s). After that there
> > seem
> > > to be no more issues other than dead region servers being still
> reported.
> > > We should not have this anymore after having taken all precautions to
> > reset
> > > the system properly.
> > >
> > > If was trying to write something similar to what hbck would do to take
> > > care of this specific issue. I wouldn't mind contributing to the hbck
> > > itself either. However I needed to understand where this list comes
> from
> > > and why. These are things that I could possibly automate (after all the
> > > other steps i mentioned):
> > > - check the ZK list of RS's. If any of the dead RS's found, remove node
> > >
> > > - check hdfs root WALs folder. If there are any with the dead RS's name
> > in
> > > them, delete them. (here we need to take precaution as @Enis mentioned;
> > > possibly if the node timestamp has not been changed in a while)
> > >
> > > - what else? These steps are not enough
> > >
> > > For instance, we currently have 17 servers being reported as dead. Only
> > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder.
> Where
> > > do the rest come from?
> > > thanks
> > >
> > > Jeff
> > >
> > > ________________________________
> > > From: Yu Li <car...@gmail.com>
> > > Sent: Friday, May 26, 2017 10:18:09 PM
> > > To: Hbase-User
> > > Cc: dev@hbase.apache.org
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > bq. And having a list of "dead" servers is not a healthy thing to have.
> > > I don't think the existence of "dead" servers means the service is
> > > unhealthy, especially in a distributed system. Besides hbase, HDFS also
> > > shows Live and Dead nodes in namenode UI, and people won't regard HDFS
> as
> > > unhealthy if there're dead nodes.
> > >
> > > In HBase, if some RS aborts due to unexpected issue like long GC,
> > normally
> > > we will restart it and once it's restarted and report to master, it
> will
> > be
> > > removed from the dead server list. So when we observed dead server in
> > > Master UI, the first thing is to check the root cause and restart it if
> > it
> > > won't cause further issue.
> > >
> > > However, sometimes we may find the server aborted due to some hardware
> > > failure and we must offline the server for repairing. Or we need to
> move
> > > some nodes to join other clusters so we stop the RS process on
> purpose. I
> > > guess this is the case you're dealing with @jeff? If so, I think it's a
> > > reasonable requirement that we supply a command in hbase to clear the
> > dead
> > > nodes when operator assure they no longer serves.
> > >
> > > Best Regards,
> > > Yu
> > >
> > > On 27 May 2017 at 04:49, Enis Söztutar <enis....@gmail.com> wrote:
> > >
> > > > In general if there are no regions in transition, the WAL recovery
> has
> > > > already finished. You can watch the master's log4j log for those
> > entries,
> > > > but the lack of regions in transition is the easiest way to identify.
> > > >
> > > > Enis
> > > >
> > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <
> jeffsar...@hotmail.com>
> > > > wrote:
> > > >
> > > > > thanks Enis
> > > > >
> > > > > I apologize for earlier
> > > > >
> > > > > This looks very close to our issue
> > > > > When you say: "there is no "WAL" recovery is happening", how could
> i
> > > make
> > > > > sure of that? Thanks
> > > > >
> > > > > Jeff
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: Enis Söztutar <enis....@gmail.com>
> > > > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > > > To: dev@hbase.apache.org
> > > > > Cc: hbase-user
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Jeff, please be respectful to be people who are trying to help you.
> > > This
> > > > is
> > > > > not acceptable behavior and will result in consequences next time.
> > > > >
> > > > > On the specific issue that you are seeing, it is highly likely that
> > you
> > > > are
> > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
> > Having
> > > > > those servers in the dead servers list will not hurt operations, or
> > > > > runtimes or anything else. Possibly for those servers, there is not
> > new
> > > > > instance of the regionserver running in the same host and ports.
> > > > >
> > > > > If you want to manually clean out these, you can follow these
> steps:
> > > > >  - Manually move these directries from the file system:
> > > > > <hbase_hdfs>/WALs/dead-server-splitting
> > > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > > > happening, and there is only WAL files with names containing
> ".meta."
> > > > >  - Restart HBase master.
> > > > >
> > > > > Upon restart, you can see that these do not show up anymore. For
> more
> > > > > technical details, please refer to the jira link.
> > > > >
> > > > > Enis
> > > > >
> > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
> > jeffsar...@hotmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thank you for the GFY answer
> > > > > >
> > > > > > And i guess to figure out how to fix these I can always go
> through
> > > the
> > > > > > HBase source code.
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: Dima Spivak <dimaspi...@apache.org>
> > > > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > > > To: hbase-user
> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> up?
> > > > > >
> > > > > > Sending this back to the user mailing list.
> > > > > >
> > > > > > RegionServers can die for many reasons. Looking at your
> > RegionServer
> > > > log
> > > > > > files should give hints as to why it's happening.
> > > > > >
> > > > > >
> > > > > > -Dima
> > > > > >
> > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
> > jeffsar...@hotmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I had posted this to the user mailing list and I have not got
> any
> > > > > direct
> > > > > > > answer to my question.
> > > > > > >
> > > > > > > Where do dead RS's come from and how can they be cleaned up?
> > > Someone
> > > > in
> > > > > > > the midst of developers should know this.
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > > > Jeff
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
> > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > > > To: u...@hbase.apache.org
> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> > up?
> > > > > > >
> > > > > > > I'm still looking to get hints on how to remove the dead
> regions.
> > > > > thanks
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
> > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > > > To: u...@hbase.apache.org
> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> > up?
> > > > > > >
> > > > > > > i'm trying to eliminate the dead region servers.
> > > > > > >
> > > > > > > ________________________________
> > > > > > > From: Ted Yu <yuzhih...@gmail.com>
> > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > > > To: u...@hbase.apache.org
> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
> > up?
> > > > > > >
> > > > > > > bq. running hbck (many times
> > > > > > >
> > > > > > > Can you describe the specific inconsistencies you were trying
> to
> > > > > resolve
> > > > > > ?
> > > > > > > Depending on the inconsistencies, advice can be given on the
> best
> > > > known
> > > > > > > hbck command arguments to use.
> > > > > > >
> > > > > > > Feel free to pastebin master log if needed.
> > > > > > >
> > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > > > jeffsar...@hotmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > these are the things I have done so far:
> > > > > > > >
> > > > > > > >
> > > > > > > > - restarting master (few times)
> > > > > > > >
> > > > > > > > - running hbck (many times; this tool does not seem to be
> doing
> > > > > > anything
> > > > > > > > at all)
> > > > > > > >
> > > > > > > > - checking the list of region servers in ZK (none of the dead
> > > ones
> > > > > are
> > > > > > > > listed here)
> > > > > > > >
> > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
> > ones
> > > > > only 3
> > > > > > > > are listed here with "-splitting" at the end of their names
> and
> > > > they
> > > > > > > > contain one single file like: 1493846660401..meta.
> > > > 1493922323600.meta
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ________________________________
> > > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
> > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > > > To: u...@hbase.apache.org
> > > > > > > > Subject: What is Dead Region Servers and how to clear them
> up?
> > > > > > > >
> > > > > > > > Apparently having dead region servers is so common that a
> > section
> > > > of
> > > > > > the
> > > > > > > > master console is dedicated to that?
> > > > > > > > How can we clean this up (preferably in an automated
> fashion)?
> > > Why
> > > > > > isn't
> > > > > > > > this being done by HBase automatically?
> > > > > > > >
> > > > > > > >
> > > > > > > > thanks
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: What is Dead Region Servers and how to clear them up?

Reply via email to