Thanks for the additional information Jeff, interesting scenario. Let me re-explain: dead server means on this node (or container, in your case) there was a regionserver process once but not now. This doesn't indicate the current health state of the cluster, but only tells the fact and alarm operator to give a check on those nodes/containers to see what problem cause them dead. But I admit that these might cause confusion.
And as I proposed in previous mail, I think in the Yarn/Mesos deployment scenario we need to supply a command to clear those dead servers. To be more specified, after all the actions, no matter automatic ones like WAL split and zk clearance, or the manual ones like hbck -repair, as long as we're sure we don't need to care about those dead servers any more, we could remove them from master UI. If this satisfies what you desire, I could open a JIRA and get the work done (smile). Let me know your thoughts, thanks. Best Regards, Yu On 28 May 2017 at 23:26, jeff saremi <jeffsar...@hotmail.com> wrote: > I think more and more deployments are being made dynamic using Yarn and > Mesos. Going back to a fixed set of servers is not going to eliminate the > problem i'm talking about. Making assumptions that the region servers come > back on the same node is too optimistic. > > Let me try this a different way to see if I can make my point: > > - A cluster is either healthy or not healthy. > > - If the cluster is unhealthy, then it can be made healthy using either > external tools (hbck) or the internal agreement of master-regionserver. If > this is not achievable, then the cluster must be discarded. > > - The cluster is now healthy, meaning that no information should be > lingering on such as dead server, dead regions, or whatever anywhere in the > system. And moreover no such information must ever be brought up to the > attention of the administrators of the cluster. > > - If there is such information still hiding in some place in the system, > then it only means that the mechansim (hbck or hbase itself) that made the > system healthy did not complete its job in cleaning up what is needed to be > cleaned up > > > > ------------------------------ > *From:* Ted Yu <yuzhih...@gmail.com> > *Sent:* Saturday, May 27, 2017 1:54:50 PM > > *To:* dev@hbase.apache.org > *Cc:* Hbase-User; Yu Li > *Subject:* Re: What is Dead Region Servers and how to clear them up? > > The involvement of Yarn can explain why you observed relatively more dead > servers (compared to traditional deployment). > > Suppose in first run, Yarn allocates containers for region servers on a set > of nodes. Subsequently, Yarn may choose nodes (for the same number of > servers) which are not exactly the same nodes in the previous run. > > What Yu Li described as restarting server is on the same node where the > server was running previously. > > Cheers > > On Sat, May 27, 2017 at 11:59 AM, jeff saremi <jeffsar...@hotmail.com> > wrote: > > > Yes. we don't have fixed servers with the exceptions of ZK machines. > > > > We have 3 yarn jobs one for each of master, region, and thrift servers > > each launched separately with different number of nodes. I hope that's > not > > what is causing problems. > > > > ________________________________ > > From: Ted Yu <yuzhih...@gmail.com> > > Sent: Saturday, May 27, 2017 11:27:36 AM > > To: dev@hbase.apache.org > > Cc: Hbase-User; Yu Li > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > Jeff: > > bq. We run our cluster on Yarn and upon restarting jobs in Yarn > > > > Can you clarify a bit more - are you running hbase processes inside Yarn > > container ? > > > > Cheers > > > > On Sat, May 27, 2017 at 10:58 AM, jeff saremi <jeffsar...@hotmail.com> > > wrote: > > > > > Thanks @Yu Li<mailto:car...@gmail.com <car...@gmail.com>> > > > > > > You are absolutely correct. Dead RS's will happen regardless. My issue > > > with this is more "psychological". If I have done everything needed to > be > > > done to ensure that RSs are running fine and regions are assigned and > > such > > > and hbck reports are consistent then how is this list of dead region > > > servers helping me? other than causing anxiety? > > > We run our cluster on Yarn and upon restarting jobs in Yarn we get a > lot > > > of inconsistent, unavailable regions. (and this is only one scenario). > > Then > > > we'll run hbck with -repair option (and i was wrong here too: hbck does > > > take care of some issues) and restart the master(s). After that there > > seem > > > to be no more issues other than dead region servers being still > reported. > > > We should not have this anymore after having taken all precautions to > > reset > > > the system properly. > > > > > > If was trying to write something similar to what hbck would do to take > > > care of this specific issue. I wouldn't mind contributing to the hbck > > > itself either. However I needed to understand where this list comes > from > > > and why. These are things that I could possibly automate (after all the > > > other steps i mentioned): > > > - check the ZK list of RS's. If any of the dead RS's found, remove node > > > > > > - check hdfs root WALs folder. If there are any with the dead RS's name > > in > > > them, delete them. (here we need to take precaution as @Enis mentioned; > > > possibly if the node timestamp has not been changed in a while) > > > > > > - what else? These steps are not enough > > > > > > For instance, we currently have 17 servers being reported as dead. Only > > > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. > Where > > > do the rest come from? > > > thanks > > > > > > Jeff > > > > > > ________________________________ > > > From: Yu Li <car...@gmail.com> > > > Sent: Friday, May 26, 2017 10:18:09 PM > > > To: Hbase-User > > > Cc: dev@hbase.apache.org > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > bq. And having a list of "dead" servers is not a healthy thing to have. > > > I don't think the existence of "dead" servers means the service is > > > unhealthy, especially in a distributed system. Besides hbase, HDFS also > > > shows Live and Dead nodes in namenode UI, and people won't regard HDFS > as > > > unhealthy if there're dead nodes. > > > > > > In HBase, if some RS aborts due to unexpected issue like long GC, > > normally > > > we will restart it and once it's restarted and report to master, it > will > > be > > > removed from the dead server list. So when we observed dead server in > > > Master UI, the first thing is to check the root cause and restart it if > > it > > > won't cause further issue. > > > > > > However, sometimes we may find the server aborted due to some hardware > > > failure and we must offline the server for repairing. Or we need to > move > > > some nodes to join other clusters so we stop the RS process on > purpose. I > > > guess this is the case you're dealing with @jeff? If so, I think it's a > > > reasonable requirement that we supply a command in hbase to clear the > > dead > > > nodes when operator assure they no longer serves. > > > > > > Best Regards, > > > Yu > > > > > > On 27 May 2017 at 04:49, Enis Söztutar <enis....@gmail.com> wrote: > > > > > > > In general if there are no regions in transition, the WAL recovery > has > > > > already finished. You can watch the master's log4j log for those > > entries, > > > > but the lack of regions in transition is the easiest way to identify. > > > > > > > > Enis > > > > > > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi < > jeffsar...@hotmail.com> > > > > wrote: > > > > > > > > > thanks Enis > > > > > > > > > > I apologize for earlier > > > > > > > > > > This looks very close to our issue > > > > > When you say: "there is no "WAL" recovery is happening", how could > i > > > make > > > > > sure of that? Thanks > > > > > > > > > > Jeff > > > > > > > > > > > > > > > ________________________________ > > > > > From: Enis Söztutar <enis....@gmail.com> > > > > > Sent: Friday, May 26, 2017 11:47:11 AM > > > > > To: dev@hbase.apache.org > > > > > Cc: hbase-user > > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > > > Jeff, please be respectful to be people who are trying to help you. > > > This > > > > is > > > > > not acceptable behavior and will result in consequences next time. > > > > > > > > > > On the specific issue that you are seeing, it is highly likely that > > you > > > > are > > > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. > > Having > > > > > those servers in the dead servers list will not hurt operations, or > > > > > runtimes or anything else. Possibly for those servers, there is not > > new > > > > > instance of the regionserver running in the same host and ports. > > > > > > > > > > If you want to manually clean out these, you can follow these > steps: > > > > > - Manually move these directries from the file system: > > > > > <hbase_hdfs>/WALs/dead-server-splitting > > > > > - ONLY do this if you are sure that there is no "WAL" recovery is > > > > > happening, and there is only WAL files with names containing > ".meta." > > > > > - Restart HBase master. > > > > > > > > > > Upon restart, you can see that these do not show up anymore. For > more > > > > > technical details, please refer to the jira link. > > > > > > > > > > Enis > > > > > > > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi < > > jeffsar...@hotmail.com> > > > > > wrote: > > > > > > > > > > > Thank you for the GFY answer > > > > > > > > > > > > And i guess to figure out how to fix these I can always go > through > > > the > > > > > > HBase source code. > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: Dima Spivak <dimaspi...@apache.org> > > > > > > Sent: Friday, May 26, 2017 9:58:00 AM > > > > > > To: hbase-user > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > up? > > > > > > > > > > > > Sending this back to the user mailing list. > > > > > > > > > > > > RegionServers can die for many reasons. Looking at your > > RegionServer > > > > log > > > > > > files should give hints as to why it's happening. > > > > > > > > > > > > > > > > > > -Dima > > > > > > > > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi < > > jeffsar...@hotmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > I had posted this to the user mailing list and I have not got > any > > > > > direct > > > > > > > answer to my question. > > > > > > > > > > > > > > Where do dead RS's come from and how can they be cleaned up? > > > Someone > > > > in > > > > > > > the midst of developers should know this. > > > > > > > > > > > > > > thanks > > > > > > > > > > > > > > Jeff > > > > > > > > > > > > > > ________________________________ > > > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM > > > > > > > To: u...@hbase.apache.org > > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > > up? > > > > > > > > > > > > > > I'm still looking to get hints on how to remove the dead > regions. > > > > > thanks > > > > > > > > > > > > > > ________________________________ > > > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM > > > > > > > To: u...@hbase.apache.org > > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > > up? > > > > > > > > > > > > > > i'm trying to eliminate the dead region servers. > > > > > > > > > > > > > > ________________________________ > > > > > > > From: Ted Yu <yuzhih...@gmail.com> > > > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM > > > > > > > To: u...@hbase.apache.org > > > > > > > Subject: Re: What is Dead Region Servers and how to clear them > > up? > > > > > > > > > > > > > > bq. running hbck (many times > > > > > > > > > > > > > > Can you describe the specific inconsistencies you were trying > to > > > > > resolve > > > > > > ? > > > > > > > Depending on the inconsistencies, advice can be given on the > best > > > > known > > > > > > > hbck command arguments to use. > > > > > > > > > > > > > > Feel free to pastebin master log if needed. > > > > > > > > > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi < > > > > jeffsar...@hotmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > these are the things I have done so far: > > > > > > > > > > > > > > > > > > > > > > > > - restarting master (few times) > > > > > > > > > > > > > > > > - running hbck (many times; this tool does not seem to be > doing > > > > > > anything > > > > > > > > at all) > > > > > > > > > > > > > > > > - checking the list of region servers in ZK (none of the dead > > > ones > > > > > are > > > > > > > > listed here) > > > > > > > > > > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead > > ones > > > > > only 3 > > > > > > > > are listed here with "-splitting" at the end of their names > and > > > > they > > > > > > > > contain one single file like: 1493846660401..meta. > > > > 1493922323600.meta > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM > > > > > > > > To: u...@hbase.apache.org > > > > > > > > Subject: What is Dead Region Servers and how to clear them > up? > > > > > > > > > > > > > > > > Apparently having dead region servers is so common that a > > section > > > > of > > > > > > the > > > > > > > > master console is dedicated to that? > > > > > > > > How can we clean this up (preferably in an automated > fashion)? > > > Why > > > > > > isn't > > > > > > > > this being done by HBase automatically? > > > > > > > > > > > > > > > > > > > > > > > > thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >