Thanks for the confirmation Jeff, have opened HBASE-18131 <https://issues.apache.org/jira/browse/HBASE-18131> for this, FYI.
Best Regards, Yu On 29 May 2017 at 03:48, jeff saremi <jeffsar...@hotmail.com> wrote: > Yes Yu. What you're suggesting would work for us too and would still be > appreciated. > > thanks a lot > > jeff > ------------------------------ > *From:* Yu Li <car...@gmail.com> > *Sent:* Sunday, May 28, 2017 10:13:38 AM > *To:* jeff saremi > *Cc:* dev@hbase.apache.org; hbase-user > > *Subject:* Re: What is Dead Region Servers and how to clear them up? > > Thanks for the additional information Jeff, interesting scenario. > > Let me re-explain: dead server means on this node (or container, in your > case) there was a regionserver process once but not now. This doesn't > indicate the current health state of the cluster, but only tells the fact > and alarm operator to give a check on those nodes/containers to see what > problem cause them dead. But I admit that these might cause confusion. > > And as I proposed in previous mail, I think in the Yarn/Mesos deployment > scenario we need to supply a command to clear those dead servers. To be > more specified, after all the actions, no matter automatic ones like WAL > split and zk clearance, or the manual ones like hbck -repair, as long as > we're sure we don't need to care about those dead servers any more, we > could remove them from master UI. If this satisfies what you desire, I > could open a JIRA and get the work done (smile). > > Let me know your thoughts, thanks. > > Best Regards, > Yu > > On 28 May 2017 at 23:26, jeff saremi <jeffsar...@hotmail.com> wrote: > >> I think more and more deployments are being made dynamic using Yarn and >> Mesos. Going back to a fixed set of servers is not going to eliminate the >> problem i'm talking about. Making assumptions that the region servers come >> back on the same node is too optimistic. >> >> Let me try this a different way to see if I can make my point: >> >> - A cluster is either healthy or not healthy. >> >> - If the cluster is unhealthy, then it can be made healthy using either >> external tools (hbck) or the internal agreement of master-regionserver. If >> this is not achievable, then the cluster must be discarded. >> >> - The cluster is now healthy, meaning that no information should be >> lingering on such as dead server, dead regions, or whatever anywhere in the >> system. And moreover no such information must ever be brought up to the >> attention of the administrators of the cluster. >> >> - If there is such information still hiding in some place in the system, >> then it only means that the mechansim (hbck or hbase itself) that made the >> system healthy did not complete its job in cleaning up what is needed to be >> cleaned up >> >> >> >> ------------------------------ >> *From:* Ted Yu <yuzhih...@gmail.com> >> *Sent:* Saturday, May 27, 2017 1:54:50 PM >> >> *To:* dev@hbase.apache.org >> *Cc:* Hbase-User; Yu Li >> *Subject:* Re: What is Dead Region Servers and how to clear them up? >> >> The involvement of Yarn can explain why you observed relatively more dead >> servers (compared to traditional deployment). >> >> Suppose in first run, Yarn allocates containers for region servers on a >> set >> of nodes. Subsequently, Yarn may choose nodes (for the same number of >> servers) which are not exactly the same nodes in the previous run. >> >> What Yu Li described as restarting server is on the same node where the >> server was running previously. >> >> Cheers >> >> On Sat, May 27, 2017 at 11:59 AM, jeff saremi <jeffsar...@hotmail.com> >> wrote: >> >> > Yes. we don't have fixed servers with the exceptions of ZK machines. >> > >> > We have 3 yarn jobs one for each of master, region, and thrift servers >> > each launched separately with different number of nodes. I hope that's >> not >> > what is causing problems. >> > >> > ________________________________ >> > From: Ted Yu <yuzhih...@gmail.com> >> > Sent: Saturday, May 27, 2017 11:27:36 AM >> > To: dev@hbase.apache.org >> > Cc: Hbase-User; Yu Li >> > Subject: Re: What is Dead Region Servers and how to clear them up? >> > >> > Jeff: >> > bq. We run our cluster on Yarn and upon restarting jobs in Yarn >> > >> > Can you clarify a bit more - are you running hbase processes inside Yarn >> > container ? >> > >> > Cheers >> > >> > On Sat, May 27, 2017 at 10:58 AM, jeff saremi <jeffsar...@hotmail.com> >> > wrote: >> > >> > > Thanks @Yu Li<mailto:car...@gmail.com <car...@gmail.com>> >> > > >> > > You are absolutely correct. Dead RS's will happen regardless. My issue >> > > with this is more "psychological". If I have done everything needed >> to be >> > > done to ensure that RSs are running fine and regions are assigned and >> > such >> > > and hbck reports are consistent then how is this list of dead region >> > > servers helping me? other than causing anxiety? >> > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a >> lot >> > > of inconsistent, unavailable regions. (and this is only one scenario). >> > Then >> > > we'll run hbck with -repair option (and i was wrong here too: hbck >> does >> > > take care of some issues) and restart the master(s). After that there >> > seem >> > > to be no more issues other than dead region servers being still >> reported. >> > > We should not have this anymore after having taken all precautions to >> > reset >> > > the system properly. >> > > >> > > If was trying to write something similar to what hbck would do to take >> > > care of this specific issue. I wouldn't mind contributing to the hbck >> > > itself either. However I needed to understand where this list comes >> from >> > > and why. These are things that I could possibly automate (after all >> the >> > > other steps i mentioned): >> > > - check the ZK list of RS's. If any of the dead RS's found, remove >> node >> > > >> > > - check hdfs root WALs folder. If there are any with the dead RS's >> name >> > in >> > > them, delete them. (here we need to take precaution as @Enis >> mentioned; >> > > possibly if the node timestamp has not been changed in a while) >> > > >> > > - what else? These steps are not enough >> > > >> > > For instance, we currently have 17 servers being reported as dead. >> Only >> > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. >> Where >> > > do the rest come from? >> > > thanks >> > > >> > > Jeff >> > > >> > > ________________________________ >> > > From: Yu Li <car...@gmail.com> >> > > Sent: Friday, May 26, 2017 10:18:09 PM >> > > To: Hbase-User >> > > Cc: dev@hbase.apache.org >> > > Subject: Re: What is Dead Region Servers and how to clear them up? >> > > >> > > bq. And having a list of "dead" servers is not a healthy thing to >> have. >> > > I don't think the existence of "dead" servers means the service is >> > > unhealthy, especially in a distributed system. Besides hbase, HDFS >> also >> > > shows Live and Dead nodes in namenode UI, and people won't regard >> HDFS as >> > > unhealthy if there're dead nodes. >> > > >> > > In HBase, if some RS aborts due to unexpected issue like long GC, >> > normally >> > > we will restart it and once it's restarted and report to master, it >> will >> > be >> > > removed from the dead server list. So when we observed dead server in >> > > Master UI, the first thing is to check the root cause and restart it >> if >> > it >> > > won't cause further issue. >> > > >> > > However, sometimes we may find the server aborted due to some hardware >> > > failure and we must offline the server for repairing. Or we need to >> move >> > > some nodes to join other clusters so we stop the RS process on >> purpose. I >> > > guess this is the case you're dealing with @jeff? If so, I think it's >> a >> > > reasonable requirement that we supply a command in hbase to clear the >> > dead >> > > nodes when operator assure they no longer serves. >> > > >> > > Best Regards, >> > > Yu >> > > >> > > On 27 May 2017 at 04:49, Enis Söztutar <enis....@gmail.com> wrote: >> > > >> > > > In general if there are no regions in transition, the WAL recovery >> has >> > > > already finished. You can watch the master's log4j log for those >> > entries, >> > > > but the lack of regions in transition is the easiest way to >> identify. >> > > > >> > > > Enis >> > > > >> > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi < >> jeffsar...@hotmail.com> >> > > > wrote: >> > > > >> > > > > thanks Enis >> > > > > >> > > > > I apologize for earlier >> > > > > >> > > > > This looks very close to our issue >> > > > > When you say: "there is no "WAL" recovery is happening", how >> could i >> > > make >> > > > > sure of that? Thanks >> > > > > >> > > > > Jeff >> > > > > >> > > > > >> > > > > ________________________________ >> > > > > From: Enis Söztutar <enis....@gmail.com> >> > > > > Sent: Friday, May 26, 2017 11:47:11 AM >> > > > > To: dev@hbase.apache.org >> > > > > Cc: hbase-user >> > > > > Subject: Re: What is Dead Region Servers and how to clear them up? >> > > > > >> > > > > Jeff, please be respectful to be people who are trying to help >> you. >> > > This >> > > > is >> > > > > not acceptable behavior and will result in consequences next time. >> > > > > >> > > > > On the specific issue that you are seeing, it is highly likely >> that >> > you >> > > > are >> > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. >> > Having >> > > > > those servers in the dead servers list will not hurt operations, >> or >> > > > > runtimes or anything else. Possibly for those servers, there is >> not >> > new >> > > > > instance of the regionserver running in the same host and ports. >> > > > > >> > > > > If you want to manually clean out these, you can follow these >> steps: >> > > > > - Manually move these directries from the file system: >> > > > > <hbase_hdfs>/WALs/dead-server-splitting >> > > > > - ONLY do this if you are sure that there is no "WAL" recovery is >> > > > > happening, and there is only WAL files with names containing >> ".meta." >> > > > > - Restart HBase master. >> > > > > >> > > > > Upon restart, you can see that these do not show up anymore. For >> more >> > > > > technical details, please refer to the jira link. >> > > > > >> > > > > Enis >> > > > > >> > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi < >> > jeffsar...@hotmail.com> >> > > > > wrote: >> > > > > >> > > > > > Thank you for the GFY answer >> > > > > > >> > > > > > And i guess to figure out how to fix these I can always go >> through >> > > the >> > > > > > HBase source code. >> > > > > > >> > > > > > >> > > > > > ________________________________ >> > > > > > From: Dima Spivak <dimaspi...@apache.org> >> > > > > > Sent: Friday, May 26, 2017 9:58:00 AM >> > > > > > To: hbase-user >> > > > > > Subject: Re: What is Dead Region Servers and how to clear them >> up? >> > > > > > >> > > > > > Sending this back to the user mailing list. >> > > > > > >> > > > > > RegionServers can die for many reasons. Looking at your >> > RegionServer >> > > > log >> > > > > > files should give hints as to why it's happening. >> > > > > > >> > > > > > >> > > > > > -Dima >> > > > > > >> > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi < >> > jeffsar...@hotmail.com >> > > > >> > > > > > wrote: >> > > > > > >> > > > > > > I had posted this to the user mailing list and I have not got >> any >> > > > > direct >> > > > > > > answer to my question. >> > > > > > > >> > > > > > > Where do dead RS's come from and how can they be cleaned up? >> > > Someone >> > > > in >> > > > > > > the midst of developers should know this. >> > > > > > > >> > > > > > > thanks >> > > > > > > >> > > > > > > Jeff >> > > > > > > >> > > > > > > ________________________________ >> > > > > > > From: jeff saremi <jeffsar...@hotmail.com> >> > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM >> > > > > > > To: u...@hbase.apache.org >> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them >> > up? >> > > > > > > >> > > > > > > I'm still looking to get hints on how to remove the dead >> regions. >> > > > > thanks >> > > > > > > >> > > > > > > ________________________________ >> > > > > > > From: jeff saremi <jeffsar...@hotmail.com> >> > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM >> > > > > > > To: u...@hbase.apache.org >> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them >> > up? >> > > > > > > >> > > > > > > i'm trying to eliminate the dead region servers. >> > > > > > > >> > > > > > > ________________________________ >> > > > > > > From: Ted Yu <yuzhih...@gmail.com> >> > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM >> > > > > > > To: u...@hbase.apache.org >> > > > > > > Subject: Re: What is Dead Region Servers and how to clear them >> > up? >> > > > > > > >> > > > > > > bq. running hbck (many times >> > > > > > > >> > > > > > > Can you describe the specific inconsistencies you were trying >> to >> > > > > resolve >> > > > > > ? >> > > > > > > Depending on the inconsistencies, advice can be given on the >> best >> > > > known >> > > > > > > hbck command arguments to use. >> > > > > > > >> > > > > > > Feel free to pastebin master log if needed. >> > > > > > > >> > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi < >> > > > jeffsar...@hotmail.com> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > these are the things I have done so far: >> > > > > > > > >> > > > > > > > >> > > > > > > > - restarting master (few times) >> > > > > > > > >> > > > > > > > - running hbck (many times; this tool does not seem to be >> doing >> > > > > > anything >> > > > > > > > at all) >> > > > > > > > >> > > > > > > > - checking the list of region servers in ZK (none of the >> dead >> > > ones >> > > > > are >> > > > > > > > listed here) >> > > > > > > > >> > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead >> > ones >> > > > > only 3 >> > > > > > > > are listed here with "-splitting" at the end of their names >> and >> > > > they >> > > > > > > > contain one single file like: 1493846660401..meta. >> > > > 1493922323600.meta >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > ________________________________ >> > > > > > > > From: jeff saremi <jeffsar...@hotmail.com> >> > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM >> > > > > > > > To: u...@hbase.apache.org >> > > > > > > > Subject: What is Dead Region Servers and how to clear them >> up? >> > > > > > > > >> > > > > > > > Apparently having dead region servers is so common that a >> > section >> > > > of >> > > > > > the >> > > > > > > > master console is dedicated to that? >> > > > > > > > How can we clean this up (preferably in an automated >> fashion)? >> > > Why >> > > > > > isn't >> > > > > > > > this being done by HBase automatically? >> > > > > > > > >> > > > > > > > >> > > > > > > > thanks >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >