You should run with a backup master in a production cluster.  The failover
process works very well and will cause no downtime.  I've done it literally
hundreds of times across our multiple production hbase clusters.

Even if you don't have a backup master, you should still be fine with
restarting the master.  It can handle a brief blip without any problems,
from what I've seen.  The master is really only used for coordination such
as region moves, RS failovers, etc.  Your clients can still retrieve data
from your regionservers, as long as no servers die in the brief moment you
are masterless.

On Thu, Mar 5, 2015 at 5:53 AM, Sandeep Reddy <sandeepvre...@outlook.com>
wrote:

> Since ours is production cluster we cant restart master.
> In our test cluster I tested this scenario, and it got resolved after
> restarting master.
> Other than restarting master I couldn't find any solution.
> Thanks,Sandeep.
>
> > From: nkey...@gmail.com
> > Date: Wed, 4 Mar 2015 14:55:03 +0100
> > Subject: Re: Where is HBase failed servers list stored
> > To: user@hbase.apache.org
> >
> > If I understand the issue correctly, restarting the master should solve
> the
> > problem.
> >
> > On Wed, Mar 4, 2015 at 5:55 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> > > Please see HBASE-13067 Fix caching of stubs to allow IP address
> changes of
> > > restarted remote servers
> > >
> > > Cheers
> > >
> > > On Tue, Mar 3, 2015 at 8:26 PM, Sandeep L <sandeepvre...@outlook.com>
> > > wrote:
> > >
> > > > Hi nkeywal,
> > > > While trying to get more details about this issue I got to know that
> > > > HMaster is trying to connect to wrong IP Address.
> > > > Here is exact issue:
> > > > Due to some unavoidable reason we are forced to change IP Address of
> > > > regionsserver & then updated new IP Address in /etc/hosts file
> across all
> > > > HBase servers. I started RegionServer from master with start-hbase.sh
> > > > scripts & jps output in regionserver shows it's(regionserver
> process) up
> > > > and running.
> > > > But when running hbase balancer HMaster is trying to connect to old
> IP
> > > > Address instead of new IP Address.
> > > > One more thing here is when I checked regionserver status on 60010
> port
> > > > its showing as up and running.
> > > > Thanks,Sandeep.
> > > >
> > > > > From: nkey...@gmail.com
> > > > > Date: Tue, 3 Mar 2015 19:01:01 +0100
> > > > > Subject: Re: Where is HBase failed servers list stored
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > It's in local memory. When HBase cannot connect to a server, it
> puts it
> > > > > into the "failedServerList" for 2 seconds. This is to avoid having
> all
> > > > the
> > > > > threads going into a potentially long socket timeout. Are you sure
> that
> > > > you
> > > > > can connect from the master to this machine/port?
> > > > >
> > > > > You can change the time it stays in the list with
> > > > > hbase.ipc.client.failed.servers.expiry (in milliseconds), but it
> should
> > > > not
> > > > > help.
> > > > >
> > > > > You should have another exception before this one in the logs (the
> one
> > > > that
> > > > > initially put this region server in this failedServerList).
> > > > >
> > > > > On Tue, Mar 3, 2015 at 12:08 PM, Sandeep L <
> sandeepvre...@outlook.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > While trying to run hbase balancer I am getting error message as
> > > "This
> > > > > > server is in the failed servers list".Due to this cluster is not
> > > > getting
> > > > > > balanced.
> > > > > > Even though regionserver is up and running hmaster is unable to
> > > > connect to
> > > > > > it.
> > > > > > The odd thing here is hmaster is able to start regionserver and
> it is
> > > > > > detected as up and running but unable to assign regions.
> > > > > > Can some one suggest any solution for this.
> > > > > > Following is full stack
> > > > > >
> trace:org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException:
> > > This
> > > > > > server is in the failed servers list: host1/192.168.2.20:60020
> at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:853)
> > > > > > at
> > > > > >
> > > >
> org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1543)
> > > > > >  at
> org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1442)
> > > >   at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
> > > > > >       at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
> > > > > >      at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:20964)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:671)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2097)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1577)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1550)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:104)
> > > > > >    at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager.handleRegion(AssignmentManager.java:999)
> > > > > >   at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager$6.run(AssignmentManager.java:1447)
> > > > > > at
> > > > > >
> > > >
> > >
> org.apache.hadoop.hbase.master.AssignmentManager$3.run(AssignmentManager.java:1260)
> > > > > > at
> > > >
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> > > > > >     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> > >  at
> > > > > >
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > > > > >     at
> > > > > >
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > > > >     at java.lang.Thread.run(Thread.java:745)
> > > > > > Thanks,Sandeep.
> > > >
> > > >
> > >
>
>

Reply via email to