Was region server running on 10.147.1.168 ? If so, can you pastebin region server log around 2014-02-18 01:17:31 ?
Thanks On Tue, Feb 18, 2014 at 9:17 PM, takeshi <[email protected]> wrote: > Hi, all > > We also encounter the issue on our POC cluster which seems like the issue > yo were talking about, we also still not find the root cause yet, but would > like to share some findings here which may help others' study. > > Environment: > We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro. > with some bugfixes porting from other apache versions. running on 14 > machines with OS CentOS-5.3. > > The error msg we found for the issue > > HBase master reports "Parent directory doesn't exist" error > {code:title=hbase-master.log} > 334 [2014-02-18 > > 01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN > ][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot > di rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist > 335 [2014-02-18 01:20:36,597][IPC Server handler 47 on > 8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server > ^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 > reported a fatal error: > 336 ABORTING region server > aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: IOE > in log roller > 337 Cause: > 338 java.io.IOException: Exception in createWriter > 339 at > > org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66) > 340 at > > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694) > 341 at > org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631) > 342 at > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) > 343 at java.lang.Thread.run(Thread.java:662) > 344 Caused by: java.io.IOException: cannot get log writer > 345 at > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750) > 346 at > > org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60) > 347 ... 4 more > 348 Caused by: java.io.IOException: java.io.FileNotFoundException: Parent > directory doesn't exist: > > /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 > 349 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726) > 350 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848) > 351 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770) > 352 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747) > 353 at > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418) > 354 at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205) > 355 at > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068) > 356 at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > 357 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) > 358 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) > 359 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) > 360 at java.security.AccessController.doPrivileged(Native Method) > 361 at javax.security.auth.Subject.doAs(Subject.java:396) > 362 at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > 363 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689) > 364 > 365 at > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173) > 366 at > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747) > 367 ... 5 more > {code} > > namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from > principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" later > by requested from "hdfs/<fqdn>@LAB" ! > {code:title=namenode.log (with hdfs audit enabled)} > 624 2014-02-18 01:08:38,203 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true > ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB > (auth:KERBEROS) > ip=/ > 10.147.1.168 cmd=mkdirs > > src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 > dst=null perm=hbase:supergroup:rwxr-xr-x > ... > 1068 2014-02-18 01:17:31,189 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB > (auth:KERBEROS) > ip=/ > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs > dst=null perm=null > ... > 1343 2014-02-18 01:19:10,132 ERROR > org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException > as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB > (auth:KERBEROS) > cause:org.apache > .hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on > > /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter > nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files. > 1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC Server > handler 83 on 8020, call > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from > 10.147.1.168:35703: error: org.apache.hadoop. > hdfs.server.namenode.LeaseExpiredException: No lease on > > /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81 > 20%2C1392703714776.1392703718212 File does not exist. Holder > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files. > 1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease > on > > /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/ > aws-scottm-tmh6-1.ap-southeast-1.com > pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist. > Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files. > 1346 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419) > 1347 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410) > 1348 at > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203) > 1349 at > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480) > 1350 at > > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > 1351 at > > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > 1352 at > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > 1353 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) > 1354 at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) > 1355 at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) > 1356 at java.security.AccessController.doPrivileged(Native Method) > 1357 at javax.security.auth.Subject.doAs(Subject.java:396) > 1358 at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) > 1359 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689) > {code} > > Way to reproduce: > Currently, we have a way to reproduce this issue, which is running ycsb > workload[a-f] to HBase, and the other hand, we simply to execute the > service <hadoop,hbase-daemons> start. Then we keep to running 10 times in a > loop on both side for one round, it could usually be reproduced on first or > second round. But I am not sure whether it is related to the issue in the > beginning of this discussion thread talked about. > > > Follow-ups: > Now I am following the weird msg appeared in the namenode.log > {code} > 2014-02-18 01:17:31,189 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB > (auth:KERBEROS) > ip=/ > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs > dst=null perm=null > {code} > To see what can I dig out more... > > > I'll keep to update this thread if any finding, and welcome any opinion, > tks > > > Best regards > > takeshi >
