Re: Strange issue when DataNode goes down

Nicolas Liochon Fri, 20 Mar 2015 07:49:56 -0700

The split is done by the region servers (the master coordinates). Is there
some interesting stuff in their logs?


On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <dejan.men...@gmail.com>
wrote:

> With client issue was that it was retrying connecting to the same region
> servers even when they were reassigned. Lowering it down helped in this
> specific use case, but yes, we still want servers to reallocate quickly.
>
> What got me here:
>
> http://hbase.apache.org/book.html#mttr
>
> I basically set configuration exactly the same way as it's explained here.
> *zookeeper.session.timeout* is (and was before) 60000 (one minute).
>
> So basically what happens is: - simulating network issues we had recently).
> - After short time I see in HBase that my RegionServer is dead, and as
> total number of regions I see previous total minus number of regions that
> were hosted on the node hosting RegionServer that just 'disappeared'.
> - In this point I want my clus
>
> - I have test cluster consisting of four nodes, every node being DataNode
> and RegionServer.
> - I simulate network partition on one (connect to it through console and
> take network interface downter to recover as soon as possible, to start
> serving missing regions.
> - First thing I see in HMaster logs are:
>
> 2015-03-20 14:17:26,015 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> ephemeral node deleted, processing expiration
> [{name_of_node_I_took_down},60020,1426860403261]
>
> 2015-03-20 14:17:26,067 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> logs for {name_of_node_I_took_down},60020,1426860403261 before assignment.
>
> 2015-03-20 14:17:26,105 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
> {name_of_node_I_took_down},60020,1426860403261]
>
> 2015-03-20 14:17:26,107 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 logs in
> [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down}
> ,60020,1426860403261-splitting]
>
> 2015-03-20 14:17:26,150 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
> /hbase/splitWAL/WALs%2F
> {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
> acquired by {fqdn_of_another_live_hnode},60020,1426859445623
>
> 2015-03-20 14:17:26,182 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
> = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
> = 1426861046182 last_version = 2 cur_worker_name =
> {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>
> 2015-03-20 14:17:31,183 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
> = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
> = 1426861046182 last_version = 2 cur_worker_name =
> {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>
> 2015-03-20 14:17:36,184 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
> = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
> = 1426861046182 last_version = 2 cur_worker_name =
> {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>
> 2015-03-20 14:17:42,185 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
> = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
> = 1426861046182 last_version = 2 cur_worker_name =
> {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
>
> 2015-03-20 14:17:48,184 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
> = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
>
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
> = 1426861046182 last_version = 2 cur_worker_name =
> {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> In the meantime, In hbase...out log I got this:
>
> ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
>
> java.io.IOException: Call to
> {name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020
> failed on local exception:
> org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
> id=93152, waitTime=60044, rpcTimeout=60000
>
> at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1532)
>
> at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502)
>
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)
>
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737)
>
> at
>
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getRegionInfo(AdminProtos.java:20806)
>
> at
>
> org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState(HBaseAdmin.java:2524)
>
> at
>
> org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:167)
>
> at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
>
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
>
> at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>
> at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>
> at
>
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
>
> at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>
> at
>
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
>
> at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>
> at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>
> at
>
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>
> at
>
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>
> at org.mortbay.jetty.Server.handle(Server.java:326)
>
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>
> at
>
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>
> at
>
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
>
> at
>
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
>
> Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
> id=93152, waitTime=60044, rpcTimeout=60000
>
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCalls(RpcClient.java:1234)
>
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1171)
>
> at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:751)
> Beside this same issue, please note that first message was at 2015-03-20
> 14:17:26,015. And then (we got to the point when it started transition):
>
> 2015-03-20 14:32:35,059 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: task
> /hbase/splitWAL/WALs%2F
> {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
> entered state: DONE {fqdn_of_new_live_node},60020,1426859445623
>
> 2015-03-20 14:32:35,109 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
> /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> %252C60020%252C1426860403261.1426860404905
>
> 2015-03-20 14:32:35,190 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
> than or equal to) 9 bytes in 1 log files in
>
> [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_node_I_took_down},60020,1426860403261-splitting]
> in 909083ms
>
> 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
> Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
> ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261} to
> {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191, server=
> {name_of_node_I_took_down},60020,1426860403261}
>
> 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
> Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from {name_of_node_I_took_down}
> ,60020,1426860403261
>
> 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
> Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
> ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261} to
> {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191, server=
> {name_of_node_I_took_down},60020,1426860403261}
>
> 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
> Offlined 25ab6e7b42e36ddaa723d71be5954543 from {name_of_node_I_took_down}
> ,60020,1426860403261
> At this point, note that it finished SplitLogManager task at 14:32:35 and
> started transitioning just after that. So this is 15 minutes that I'm
> talking about.
>
> What am I missing?
>
>
> On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <nkey...@gmail.com> wrote:
>
> > You've changed the value of hbase.zookeeper.timeout to 15 minutes? A very
> > reasonable target is 1 minute before relocating the regions. That's the
> > default iirc. You can push it to 20s, but then gc-stopping-the-world
> > becomes more of an issue. 15 minutes is really a lot. The hdfs stale mode
> > must always be used, with a lower timeout than the hbase one.
> >
> > Client side there should be nothing to do (excepted for advanced stuff);
> at
> > each retry the client checks the location of the regions. If you lower
> the
> > number of retry the client will fail sooner, but usually you don't want
> the
> > client to fail, you want the servers to reallocate quickly.
> >
> > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <dejan.men...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Sorry for little bit late update, but managed to narrow it little bit
> > down.
> > >
> > > We didn't update yet, as we are using Hortonworks distribution right
> now,
> > > and even if we update we will get 0.98.4. However, looks that issue
> here
> > > was in our use case and configuration (still looking into it).
> > >
> > > Basically, initially I saw that when one server goes down, we start
> > having
> > > performance issues in general, but it managed to be on our client side,
> > due
> > > to caching, and clients were trying to reconnect to nodes that were
> > offline
> > > and later trying to get regions from those nodes too. This is basically
> > why
> > > on server side I didn't manage to see anything in logs that would be at
> > > least little bit interesting and point me into desired direction.
> > >
> > > Another question that popped up to me is - in case server is down (and
> > with
> > > it DataNode and HRegionServer it was hosting) - what's optimal time to
> > set
> > > for HMaster to consider server dead reassign regions somewhere else, as
> > > this is another performance bottleneck we hit during inability to
> access
> > > regions? In our case it's configured to 15 minutes, and simple logic
> > tells
> > > me if you want it earlier then configure lower number of retries, but
> > issue
> > > is as always in details, so not sure if anyone knows some better math
> for
> > > this?
> > >
> > > And last question - is it possible to manually force HBase to reassign
> > > regions? In this case, while HMaster is retrying to contact node that's
> > > dead, it's impossible to force it using 'balancer' command.
> > >
> > > Thanks a lot!
> > >
> > > Dejan
> > >
> > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <dejan.men...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > To be very honest - there's no particular reason why we stick to this
> > > one,
> > > > beside just lack of time currently to go through upgrade process, but
> > > looks
> > > > to me that's going to be next step.
> > > >
> > > > Had a crazy day, didn't have time to go through all logs again, plus
> > one
> > > > of the machines (last one where we had this issue) is fully
> > reprovisioned
> > > > yesterday so I don't have logs from there anymore.
> > > >
> > > > Beside upgrading,  what I will talk about today, can you just point
> me
> > to
> > > > the specific RPC issue in 0.98.0? Thing is that we have some strange
> > > > moments with RPC in this case, and just want to see if that's the
> same
> > > > thing (and we were even suspecting to RPC).
> > > >
> > > > Thanks a lot!
> > > > Dejan
> > > >
> > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <apurt...@apache.org
> >
> > > > wrote:
> > > >
> > > >> Is there a particular reason why you are using HBase 0.98.0? The
> > latest
> > > >> 0.98 release is 0.98.11. There's a known performance issue with
> 0.98.0
> > > >> pertaining to RPC that was fixed in later releases, you should move
> up
> > > >> from
> > > >> 0.98.0. In addition hundreds of improvements and bug fixes have gone
> > > into
> > > >> the ten releases since 0.98.0.
> > > >>
> > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <
> dejan.men...@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi All,
> > > >> >
> > > >> > We have a strange issue with HBase performance (overall cluster
> > > >> > performance) in case one of datanodes in the cluster unexpectedly
> > goes
> > > >> > down.
> > > >> >
> > > >> > So scenario is like follows:
> > > >> > - Cluster works fine, it's stable.
> > > >> > - One DataNode unexpectedly goes down (PSU issue, network issue,
> > > >> anything)
> > > >> > - Whole HBase cluster goes down (performance becomes so bad that
> we
> > > >> have to
> > > >> > restart all RegionServers to get it back to life).
> > > >> >
> > > >> > Most funny and latest issue that happened was that we added new
> node
> > > to
> > > >> the
> > > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode
> running
> > > on
> > > >> it
> > > >> > to give it couple of days to get some data. At some point in time,
> > due
> > > >> to
> > > >> > hardware issue, server rebooted (twice during three hours) in
> moment
> > > >> when
> > > >> > it had maybe 5% of data it would have in a couple of days. Nothing
> > > else
> > > >> > beside DataNode was running, and once it went down, it affected
> > > literary
> > > >> > everything, and restarting RegionServers in the end fixed it.
> > > >> >
> > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Best regards,
> > > >>
> > > >>    - Andy
> > > >>
> > > >> Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > >> (via Tom White)
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Strange issue when DataNode goes down

Reply via email to