[
https://issues.apache.org/jira/browse/HBASE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-1439:
--------------------------------------
Priority: Blocker (was: Major)
I did a little experiment to confirm some assumptions I have. I started 1
master, 1 RS and a ZK server on my machine on a fresh HDFS. Right when the RS
finished opening the ROOT and META regions, I killed the ZK server to simulate
a network partition. The master got:
{code}
org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Got ZooKeeper
event, state: Disconnected, type: None, path: null
{code}
The RS got:
{code}
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state:
Disconnected, type: None, path: null
{code}
Then both nodes continued to live and that's a real problem. If ZK is the boss
and you can't talk to it, you should do an emergency shutdown. They have
something like that in Bigtable:
{quote}
To ensure that a Bigtable cluster is not vulnera-
ble to networking issues between the master and Chubby,
the master kills itself if its Chubby session expires.
{quote}
I also tried deleting the hbase/rs/number to simulate a RS being unable to
contact ZK but still able to talk to HDFS and I got a boat load of:
{code}
2009-06-23 14:32:21,502 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60000, call regionServerReport(address: 192.168.1.86:60020,
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=25,
maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@4baa2c23,
[Lorg.apache.hadoop.hbase.HRegionInfo;@1137d4a4) from 192.168.1.86:37090:
error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
2009-06-23 14:32:21,511 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 0 on 60000, call regionServerReport(address: 192.168.1.86:60020,
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=26,
maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@6e4eeaaf,
[Lorg.apache.hadoop.hbase.HRegionInfo;@7f11bfbc) from 192.168.1.86:37090:
error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
2009-06-23 14:32:21,515 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 1 on 60000, call regionServerReport(address: 192.168.1.86:60020,
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=26,
maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@664310d0,
[Lorg.apache.hadoop.hbase.HRegionInfo;@3d04fc23) from 192.168.1.86:37090:
error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
at
org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
{code}
The RS was receiving that kind of errors too:
{code}
2009-06-23 14:32:21,507 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry:
0)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561)
at java.lang.Thread.run(Thread.java:619)
2009-06-23 14:32:21,512 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Processing message (Retry:
1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561)
at java.lang.Thread.run(Thread.java:619)
{code}
I'm marking this issue as a blocker since it's really bad.
> race between master and regionserver after missed heartbeat
> -----------------------------------------------------------
>
> Key: HBASE-1439
> URL: https://issues.apache.org/jira/browse/HBASE-1439
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.19.1
> Environment: CentOS 5.2 x86_64, HBase 0.19.1, Hadoop 0.19.1
> Reporter: Andrew Purtell
> Assignee: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.20.0
>
>
> Seen on one of our 0.19.1 clusters:
> {code}
> java.io.FileNotFoundException: File does not exist:
> hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000
> /data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415)
> at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679)
> at
> org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
> at
> org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426)
> at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753)
> at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:716)
> at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:249)
> at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:442)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:377)
> 2009-05-17 04:05:55,481 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process
> shutdown of server 10.3.134.207:60020: logSplit: false, rootRescanned: false,
> numberOfMetaRegions: 1,
> onlineMetaRegions.size(): 1
> {code}
> I do not have the region server log yet, but here is my conjecture:
> Here, the write ahead log for 10.3.134.207 is missing in DFS:
> java.io.FileNotFoundException:
> hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000/data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
> when the master tries to split it after declaring the region server crashed.
> There have been recent trouble reports on this cluster that indicate severe
> memory stress, e.g. kernel panics due to OOM. Based on that I think it is
> likely that the region server here missed a heartbeat so the master declared
> it crashed and began to split the log. But, the log was then deleted out from
> underneath the master's split thread. I think the region server was actually
> still running but partially swapped out or the node was otherwise overloaded
> so it missed its heartbeat. Then, when the region server came back after
> being swapped in, it realized it missed its heartbeat and shut down, deleting
> the log as is normally done.
> Even if the above is not the actual cause in this case, I think the scenario
> is plausible. What do you think?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.