Re: What is Dead Region Servers and how to clear them up?

Yu Li Tue, 30 May 2017 04:34:48 -0700

Thanks for the confirmation Jeff, have opened HBASE-18131
<https://issues.apache.org/jira/browse/HBASE-18131> for this, FYI.


Best Regards,
Yu

On 29 May 2017 at 03:48, jeff saremi <jeffsar...@hotmail.com> wrote:

> Yes Yu. What you're suggesting would work for us too and would still be
> appreciated.
>
> thanks a lot
>
> jeff
> ------------------------------
> *From:* Yu Li <car...@gmail.com>
> *Sent:* Sunday, May 28, 2017 10:13:38 AM
> *To:* jeff saremi
> *Cc:* dev@hbase.apache.org; hbase-user
>
> *Subject:* Re: What is Dead Region Servers and how to clear them up?
>
> Thanks for the additional information Jeff, interesting scenario.
>
> Let me re-explain: dead server means on this node (or container, in your
> case) there was a regionserver process once but not now. This doesn't
> indicate the current health state of the cluster, but only tells the fact
> and alarm operator to give a check on those nodes/containers to see what
> problem cause them dead. But I admit that these might cause confusion.
>
> And as I proposed in previous mail, I think in the Yarn/Mesos deployment
> scenario we need to supply a command to clear those dead servers. To be
> more specified, after all the actions, no matter automatic ones like WAL
> split and zk clearance, or the manual ones like hbck -repair, as long as
> we're sure we don't need to care about those dead servers any more, we
> could remove them from master UI. If this satisfies what you desire, I
> could open a JIRA and get the work done (smile).
>
> Let me know your thoughts, thanks.
>
> Best Regards,
> Yu
>
> On 28 May 2017 at 23:26, jeff saremi <jeffsar...@hotmail.com> wrote:
>
>> I think more and more deployments are being made dynamic using Yarn and
>> Mesos. Going back to a fixed set of servers is not going to eliminate the
>> problem i'm talking about. Making assumptions that the region servers come
>> back on the same node is too optimistic.
>>
>> Let me try this a different way to see if I can make my point:
>>
>> - A cluster is either healthy or not healthy.
>>
>> - If the cluster is unhealthy, then it can be made healthy using either
>> external tools (hbck) or the internal agreement of master-regionserver. If
>> this is not achievable, then the cluster must be discarded.
>>
>> - The cluster is now healthy, meaning that no information should be
>> lingering on such as dead server, dead regions, or whatever anywhere in the
>> system. And moreover no such information must ever be brought up to the
>> attention of the administrators of the cluster.
>>
>> - If there is such information still hiding in some place in the system,
>> then it only means that the mechansim (hbck or hbase itself) that made the
>> system healthy did not complete its job in cleaning up what is needed to be
>> cleaned up
>>
>>
>>
>> ------------------------------
>> *From:* Ted Yu <yuzhih...@gmail.com>
>> *Sent:* Saturday, May 27, 2017 1:54:50 PM
>>
>> *To:* dev@hbase.apache.org
>> *Cc:* Hbase-User; Yu Li
>> *Subject:* Re: What is Dead Region Servers and how to clear them up?
>>
>> The involvement of Yarn can explain why you observed relatively more dead
>> servers (compared to traditional deployment).
>>
>> Suppose in first run, Yarn allocates containers for region servers on a
>> set
>> of nodes. Subsequently, Yarn may choose nodes (for the same number of
>> servers) which are not exactly the same nodes in the previous run.
>>
>> What Yu Li described as restarting server is on the same node where the
>> server was running previously.
>>
>> Cheers
>>
>> On Sat, May 27, 2017 at 11:59 AM, jeff saremi <jeffsar...@hotmail.com>
>> wrote:
>>
>> > Yes. we don't have fixed servers with the exceptions of ZK machines.
>> >
>> > We have 3 yarn jobs one for each of master, region, and thrift servers
>> > each launched separately with different number of nodes. I hope that's
>> not
>> > what is causing problems.
>> >
>> > ________________________________
>> > From: Ted Yu <yuzhih...@gmail.com>
>> > Sent: Saturday, May 27, 2017 11:27:36 AM
>> > To: dev@hbase.apache.org
>> > Cc: Hbase-User; Yu Li
>> > Subject: Re: What is Dead Region Servers and how to clear them up?
>> >
>> > Jeff:
>> > bq. We run our cluster on Yarn and upon restarting jobs in Yarn
>> >
>> > Can you clarify a bit more - are you running hbase processes inside Yarn
>> > container ?
>> >
>> > Cheers
>> >
>> > On Sat, May 27, 2017 at 10:58 AM, jeff saremi <jeffsar...@hotmail.com>
>> > wrote:
>> >
>> > > Thanks @Yu Li<mailto:car...@gmail.com <car...@gmail.com>>
>> > >
>> > > You are absolutely correct. Dead RS's will happen regardless. My issue
>> > > with this is more "psychological". If I have done everything needed
>> to be
>> > > done to ensure that RSs are running fine and regions are assigned and
>> > such
>> > > and hbck reports are consistent then how is this list of dead region
>> > > servers helping me? other than causing anxiety?
>> > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a
>> lot
>> > > of inconsistent, unavailable regions. (and this is only one scenario).
>> > Then
>> > > we'll run hbck with -repair option (and i was wrong here too: hbck
>> does
>> > > take care of some issues) and restart the master(s). After that there
>> > seem
>> > > to be no more issues other than dead region servers being still
>> reported.
>> > > We should not have this anymore after having taken all precautions to
>> > reset
>> > > the system properly.
>> > >
>> > > If was trying to write something similar to what hbck would do to take
>> > > care of this specific issue. I wouldn't mind contributing to the hbck
>> > > itself either. However I needed to understand where this list comes
>> from
>> > > and why. These are things that I could possibly automate (after all
>> the
>> > > other steps i mentioned):
>> > > - check the ZK list of RS's. If any of the dead RS's found, remove
>> node
>> > >
>> > > - check hdfs root WALs folder. If there are any with the dead RS's
>> name
>> > in
>> > > them, delete them. (here we need to take precaution as @Enis
>> mentioned;
>> > > possibly if the node timestamp has not been changed in a while)
>> > >
>> > > - what else? These steps are not enough
>> > >
>> > > For instance, we currently have 17 servers being reported as dead.
>> Only
>> > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder.
>> Where
>> > > do the rest come from?
>> > > thanks
>> > >
>> > > Jeff
>> > >
>> > > ________________________________
>> > > From: Yu Li <car...@gmail.com>
>> > > Sent: Friday, May 26, 2017 10:18:09 PM
>> > > To: Hbase-User
>> > > Cc: dev@hbase.apache.org
>> > > Subject: Re: What is Dead Region Servers and how to clear them up?
>> > >
>> > > bq. And having a list of "dead" servers is not a healthy thing to
>> have.
>> > > I don't think the existence of "dead" servers means the service is
>> > > unhealthy, especially in a distributed system. Besides hbase, HDFS
>> also
>> > > shows Live and Dead nodes in namenode UI, and people won't regard
>> HDFS as
>> > > unhealthy if there're dead nodes.
>> > >
>> > > In HBase, if some RS aborts due to unexpected issue like long GC,
>> > normally
>> > > we will restart it and once it's restarted and report to master, it
>> will
>> > be
>> > > removed from the dead server list. So when we observed dead server in
>> > > Master UI, the first thing is to check the root cause and restart it
>> if
>> > it
>> > > won't cause further issue.
>> > >
>> > > However, sometimes we may find the server aborted due to some hardware
>> > > failure and we must offline the server for repairing. Or we need to
>> move
>> > > some nodes to join other clusters so we stop the RS process on
>> purpose. I
>> > > guess this is the case you're dealing with @jeff? If so, I think it's
>> a
>> > > reasonable requirement that we supply a command in hbase to clear the
>> > dead
>> > > nodes when operator assure they no longer serves.
>> > >
>> > > Best Regards,
>> > > Yu
>> > >
>> > > On 27 May 2017 at 04:49, Enis Söztutar <enis....@gmail.com> wrote:
>> > >
>> > > > In general if there are no regions in transition, the WAL recovery
>> has
>> > > > already finished. You can watch the master's log4j log for those
>> > entries,
>> > > > but the lack of regions in transition is the easiest way to
>> identify.
>> > > >
>> > > > Enis
>> > > >
>> > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <
>> jeffsar...@hotmail.com>
>> > > > wrote:
>> > > >
>> > > > > thanks Enis
>> > > > >
>> > > > > I apologize for earlier
>> > > > >
>> > > > > This looks very close to our issue
>> > > > > When you say: "there is no "WAL" recovery is happening", how
>> could i
>> > > make
>> > > > > sure of that? Thanks
>> > > > >
>> > > > > Jeff
>> > > > >
>> > > > >
>> > > > > ________________________________
>> > > > > From: Enis Söztutar <enis....@gmail.com>
>> > > > > Sent: Friday, May 26, 2017 11:47:11 AM
>> > > > > To: dev@hbase.apache.org
>> > > > > Cc: hbase-user
>> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
>> > > > >
>> > > > > Jeff, please be respectful to be people who are trying to help
>> you.
>> > > This
>> > > > is
>> > > > > not acceptable behavior and will result in consequences next time.
>> > > > >
>> > > > > On the specific issue that you are seeing, it is highly likely
>> that
>> > you
>> > > > are
>> > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223.
>> > Having
>> > > > > those servers in the dead servers list will not hurt operations,
>> or
>> > > > > runtimes or anything else. Possibly for those servers, there is
>> not
>> > new
>> > > > > instance of the regionserver running in the same host and ports.
>> > > > >
>> > > > > If you want to manually clean out these, you can follow these
>> steps:
>> > > > >  - Manually move these directries from the file system:
>> > > > > <hbase_hdfs>/WALs/dead-server-splitting
>> > > > >  - ONLY do this if you are sure that there is no "WAL" recovery is
>> > > > > happening, and there is only WAL files with names containing
>> ".meta."
>> > > > >  - Restart HBase master.
>> > > > >
>> > > > > Upon restart, you can see that these do not show up anymore. For
>> more
>> > > > > technical details, please refer to the jira link.
>> > > > >
>> > > > > Enis
>> > > > >
>> > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <
>> > jeffsar...@hotmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Thank you for the GFY answer
>> > > > > >
>> > > > > > And i guess to figure out how to fix these I can always go
>> through
>> > > the
>> > > > > > HBase source code.
>> > > > > >
>> > > > > >
>> > > > > > ________________________________
>> > > > > > From: Dima Spivak <dimaspi...@apache.org>
>> > > > > > Sent: Friday, May 26, 2017 9:58:00 AM
>> > > > > > To: hbase-user
>> > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> up?
>> > > > > >
>> > > > > > Sending this back to the user mailing list.
>> > > > > >
>> > > > > > RegionServers can die for many reasons. Looking at your
>> > RegionServer
>> > > > log
>> > > > > > files should give hints as to why it's happening.
>> > > > > >
>> > > > > >
>> > > > > > -Dima
>> > > > > >
>> > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <
>> > jeffsar...@hotmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > I had posted this to the user mailing list and I have not got
>> any
>> > > > > direct
>> > > > > > > answer to my question.
>> > > > > > >
>> > > > > > > Where do dead RS's come from and how can they be cleaned up?
>> > > Someone
>> > > > in
>> > > > > > > the midst of developers should know this.
>> > > > > > >
>> > > > > > > thanks
>> > > > > > >
>> > > > > > > Jeff
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
>> > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
>> > > > > > > To: u...@hbase.apache.org
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > I'm still looking to get hints on how to remove the dead
>> regions.
>> > > > > thanks
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
>> > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
>> > > > > > > To: u...@hbase.apache.org
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > i'm trying to eliminate the dead region servers.
>> > > > > > >
>> > > > > > > ________________________________
>> > > > > > > From: Ted Yu <yuzhih...@gmail.com>
>> > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
>> > > > > > > To: u...@hbase.apache.org
>> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them
>> > up?
>> > > > > > >
>> > > > > > > bq. running hbck (many times
>> > > > > > >
>> > > > > > > Can you describe the specific inconsistencies you were trying
>> to
>> > > > > resolve
>> > > > > > ?
>> > > > > > > Depending on the inconsistencies, advice can be given on the
>> best
>> > > > known
>> > > > > > > hbck command arguments to use.
>> > > > > > >
>> > > > > > > Feel free to pastebin master log if needed.
>> > > > > > >
>> > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
>> > > > jeffsar...@hotmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > these are the things I have done so far:
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > - restarting master (few times)
>> > > > > > > >
>> > > > > > > > - running hbck (many times; this tool does not seem to be
>> doing
>> > > > > > anything
>> > > > > > > > at all)
>> > > > > > > >
>> > > > > > > > - checking the list of region servers in ZK (none of the
>> dead
>> > > ones
>> > > > > are
>> > > > > > > > listed here)
>> > > > > > > >
>> > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
>> > ones
>> > > > > only 3
>> > > > > > > > are listed here with "-splitting" at the end of their names
>> and
>> > > > they
>> > > > > > > > contain one single file like: 1493846660401..meta.
>> > > > 1493922323600.meta
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > ________________________________
>> > > > > > > > From: jeff saremi <jeffsar...@hotmail.com>
>> > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
>> > > > > > > > To: u...@hbase.apache.org
>> > > > > > > > Subject: What is Dead Region Servers and how to clear them
>> up?
>> > > > > > > >
>> > > > > > > > Apparently having dead region servers is so common that a
>> > section
>> > > > of
>> > > > > > the
>> > > > > > > > master console is dedicated to that?
>> > > > > > > > How can we clean this up (preferably in an automated
>> fashion)?
>> > > Why
>> > > > > > isn't
>> > > > > > > > this being done by HBase automatically?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > thanks
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: What is Dead Region Servers and how to clear them up?

Reply via email to