Hi All, Here is the link for log files from the test I did. https://dl.dropboxusercontent.com/u/9473777/20140219-hbase-logs/20140219-hbase-logs.tar.gz
Best regards takeshi 2014-02-19 13:47 GMT+08:00 takeshi <[email protected]>: > Hi, all > > I think that I may need to further describe the test env. I used for > reproducing this issue. > > | hostname | ip | roles | > | aws-scottm-tmh6-1 | 10.147.1.168 | activeNN, ZKFC, JN, ZK, activeJT, > MRZK, NN, TT, HM, RS| > | aws-scottm-tmh6-2 | 10.138.150.16 | standbyNN, ZKFC, JN, standbyJT, > MRZK, NN, TT, HM, RS| > | aws-scottm-tmh6-3 | 10.134.214.132 | JN, NN, TT, RS| > > P.S. I currently do not think that this issue is related to MR related > daemons > > I would find some other places to put the log files for people interested > in these works. > > > > Best regards > > takeshi > > > 2014-02-19 13:34 GMT+08:00 takeshi <[email protected]>: > > Hi, Ted >> >> Yes, there is a regionserver running on 10.147.1.168, following are the >> log snippet for it. But actually this regionserver did nothing around >> 2014-02-18 01:17:31 ...and it detected error around 01:18:53 >> {code} >> 218 [2014-02-18 01:08:40,123][REPL IPC Server handler 0 on 8120][INFO >> ][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 0 on 8120: >> starting >> 219 [2014-02-18 01:08:40,124][REPL IPC Server handler 1 on 8120][INFO >> ][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 1 on 8120: >> starting >> 220 [2014-02-18 01:08:40,126][REPL IPC Server handler 2 on 8120][INFO >> ][org.apache.hadoop.ipc.HBaseServer]: REPL IPC Server handler 2 on 8120: >> starting >> 221 [2014-02-18 01:08:40,128][regionserver8120][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegionServer]: Serving as >> aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776, RPC >> listening o n aws-scottm-tmh6-1.ap-southeast-1.compute.internal/ >> 10.147.1.168:8120, sessionid=0x14443974ce70009 >> 222 [2014-02-18 01:08:40,130][regionserver8120][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegionServer]: Registered >> RegionServer MXBean >> 223 [2014-02-18 >> 01:08:40,142][SplitLogWorker-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776][INFO >> ][org.apache.hadoop.hbase.regionserver.SplitLogWorker]: SplitLogWorker >> aws-scottm-tm h6-1.ap-southeast-1.compute.internal,8120,1392703714776 >> starting >> 224 [2014-02-18 01:18:52,027][PRI IPC Server handler 0 on 8120][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegionServer]: Received request to >> open region: usertable,,1392704274325.e8be76322f79998d58750d >> 87f268e0ae. >> 225 [2014-02-18 >> 01:18:52,551][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegion]: Setting up tabledescriptor >> config now ... >> 226 [2014-02-18 >> 01:18:52,584][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO >> ][org.apache.hadoop.hbase.coprocessor.CoprocessorHost]: System coprocessor >> org.apa che.hadoop.hbase.security.token.TokenProvider was loaded >> successfully with priority (536870911). >> 227 [2014-02-18 >> 01:18:52,591][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO >> ][org.apache.hadoop.hbase.coprocessor.CoprocessorHost]: System coprocessor >> org.apa che.hadoop.hbase.security.access.AccessController was loaded >> successfully with priority (536870912). >> 228 [2014-02-18 >> 01:18:52,946][StoreOpenerThread-usertable,,1392704274325.e8be76322f79998d58750d87f268e0ae.-1][INFO >> ][org.apache.hadoop.hbase.regionserver.Store]: time to purge deletes set to >> 0ms in store family >> 229 [2014-02-18 >> 01:18:52,974][StoreFileOpenerThread-family-1][ERROR][org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics]: >> Inconsistent configuration. Previous configuration for using table name >> in metrics: true, new configuration: false >> 230 [2014-02-18 01:18:52,983][StoreFileOpenerThread-family-1][INFO >> ][org.apache.hadoop.hbase.util.ChecksumType]: Checksum using >> org.apache.hadoop.util.PureJavaCrc32 >> 231 [2014-02-18 01:18:52,983][StoreFileOpenerThread-family-1][INFO >> ][org.apache.hadoop.hbase.util.ChecksumType]: Checksum can use >> org.apache.hadoop.util.PureJavaCrc32C >> 232 [2014-02-18 >> 01:18:53,305][RS_OPEN_REGION-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776-0][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegion]: Onlined >> usertable,,1392704274325. e8be76322f79998d58750d87f268e0ae.; next >> sequenceid=343 >> ... >> 241 [2014-02-18 >> 01:18:53,644][PostOpenDeployTasks:e8be76322f79998d58750d87f268e0ae][INFO >> ][org.apache.hadoop.hbase.catalog.MetaEditor]: Updated row >> usertable,,1392704274325.e8be76322f79998d58750d87f268e0a e. with >> server=aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 >> 242 [2014-02-18 >> 01:18:53,644][PostOpenDeployTasks:e8be76322f79998d58750d87f268e0ae][INFO >> ][org.apache.hadoop.hbase.regionserver.HRegionServer]: Done with post open >> deploy task for region=usertable,,139270 >> 4274325.e8be76322f79998d58750d87f268e0ae., daughter=false >> 243 [2014-02-18 01:19:10,142][Thread-11][WARN >> ][org.apache.hadoop.hdfs.DFSClient]: DataStreamer Exception >> 244 >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >> No lease on >> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714 >> 776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C8120%2C1392703714776.1392703718212 >> File does not exist. Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not >> have any open files. >> 245 at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419) >> 246 at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410) >> 247 at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203) >> 248 at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480) >> 249 at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) >> 250 at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) >> 251 at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) >> 252 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) >> 253 at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) >> 254 at >> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) >> 255 at java.security.AccessController.doPrivileged(Native Method) >> 256 at javax.security.auth.Subject.doAs(Subject.java:396) >> 257 at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >> 258 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689) >> 259 >> 260 at org.apache.hadoop.ipc.Client.call(Client.java:1225) >> 261 at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) >> 262 at $Proxy11.addBlock(Unknown Source) >> 263 at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290) >> 264 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> 265 at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> 266 at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> 267 at java.lang.reflect.Method.invoke(Method.java:597) >> 268 at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) >> 269 at >> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) >> 270 at $Proxy14.addBlock(Unknown Source) >> 271 at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1176) >> 272 at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1029) >> 273 at >> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487) >> {code} >> >> >> >> Best regards >> >> takeshi >> >> >> 2014-02-19 11:26 GMT+08:00 Ted Yu <[email protected]>: >> >> Was region server running on 10.147.1.168 ? >>> >>> If so, can you pastebin region server log around 2014-02-18 01:17:31 ? >>> >>> Thanks >>> >>> >>> On Tue, Feb 18, 2014 at 9:17 PM, takeshi <[email protected]> wrote: >>> >>> > Hi, all >>> > >>> > We also encounter the issue on our POC cluster which seems like the >>> issue >>> > yo were talking about, we also still not find the root cause yet, but >>> would >>> > like to share some findings here which may help others' study. >>> > >>> > Environment: >>> > We are using cdh-4.2.1 (hadoop-2.0.0, hbase-0.94.2) as our base distro. >>> > with some bugfixes porting from other apache versions. running on 14 >>> > machines with OS CentOS-5.3. >>> > >>> > The error msg we found for the issue >>> > >>> > HBase master reports "Parent directory doesn't exist" error >>> > {code:title=hbase-master.log} >>> > 334 [2014-02-18 >>> > >>> > >>> 01:20:30,550][master-aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8100,1392703701863.archivedHFileCleaner][WARN >>> > ][org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache]: Snapshot >>> > di rectory: hdfs://cluster1/user/SPN-hbase/.snapshot doesn't exist >>> > 335 [2014-02-18 01:20:36,597][IPC Server handler 47 on >>> > 8100][ERROR][org.apache.hadoop.hbase.master.HMaster]: Region server >>> > >>> ^@^@aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 >>> > reported a fatal error: >>> > 336 ABORTING region server >>> > aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776: >>> IOE >>> > in log roller >>> > 337 Cause: >>> > 338 java.io.IOException: Exception in createWriter >>> > 339 at >>> > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:66) >>> > 340 at >>> > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:694) >>> > 341 at >>> > org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:631) >>> > 342 at >>> > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94) >>> > 343 at java.lang.Thread.run(Thread.java:662) >>> > 344 Caused by: java.io.IOException: cannot get log writer >>> > 345 at >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:750) >>> > 346 at >>> > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLogFileSystem.createWriter(HLogFileSystem.java:60) >>> > 347 ... 4 more >>> > 348 Caused by: java.io.IOException: java.io.FileNotFoundException: >>> Parent >>> > directory doesn't exist: >>> > >>> > >>> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 >>> > 349 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:1726) >>> > 350 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1848) >>> > 351 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1770) >>> > 352 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1747) >>> > 353 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:418) >>> > 354 at >>> > >>> > >>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:205) >>> > 355 at >>> > >>> > >>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44068) >>> > 356 at >>> > >>> > >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) >>> > 357 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) >>> > 358 at >>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) >>> > 359 at >>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) >>> > 360 at java.security.AccessController.doPrivileged(Native >>> Method) >>> > 361 at javax.security.auth.Subject.doAs(Subject.java:396) >>> > 362 at >>> > >>> > >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >>> > 363 at >>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689) >>> > 364 >>> > 365 at >>> > >>> > >>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:173) >>> > 366 at >>> > >>> org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:747) >>> > 367 ... 5 more >>> > {code} >>> > >>> > namenode mkdirs for "/<hbase-root>/.logs/<regionserver>" requested from >>> > principal "hbase/<fqdn>@LAB" , but remove the "/<hbase-root>/.logs" >>> later >>> > by requested from "hdfs/<fqdn>@LAB" ! >>> > {code:title=namenode.log (with hdfs audit enabled)} >>> > 624 2014-02-18 01:08:38,203 INFO >>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true >>> > ugi=hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB >>> > (auth:KERBEROS) >>> > ip=/ >>> > 10.147.1.168 cmd=mkdirs >>> > >>> > >>> src=/user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776 >>> > dst=null perm=hbase:supergroup:rwxr-xr-x >>> > ... >>> > 1068 2014-02-18 01:17:31,189 INFO >>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true >>> > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB >>> > (auth:KERBEROS) >>> > ip=/ >>> > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs >>> > dst=null perm=null >>> > ... >>> > 1343 2014-02-18 01:19:10,132 ERROR >>> > org.apache.hadoop.security.UserGroupInformation: >>> PriviledgedActionException >>> > as:hbase/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB >>> > (auth:KERBEROS) >>> > cause:org.apache >>> > .hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on >>> > >>> > >>> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.inter >>> > nal%2C8120%2C1392703714776.1392703718212 File does not exist. Holder >>> > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files. >>> > 1344 2014-02-18 01:19:10,132 INFO org.apache.hadoop.ipc.Server: IPC >>> Server >>> > handler 83 on 8020, call >>> > org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from >>> > 10.147.1.168:35703: error: org.apache.hadoop. >>> > hdfs.server.namenode.LeaseExpiredException: No lease on >>> > >>> > >>> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/aws-scottm-tmh6-1.ap-southeast-1.compute.internal%2C81 >>> > 20%2C1392703714776.1392703718212 File does not exist. Holder >>> > DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open files. >>> > 1345 org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No >>> lease >>> > on >>> > >>> > >>> /user/SPN-hbase/.logs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal,8120,1392703714776/ >>> > aws-scottm-tmh6-1.ap-southeast-1.com >>> > pute.internal%2C8120%2C1392703714776.1392703718212 File does not exist. >>> > Holder DFSClient_NONMAPREDUCE_-2013292460_25 does not have any open >>> files. >>> > 1346 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2419) >>> > 1347 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2410) >>> > 1348 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2203) >>> > 1349 at >>> > >>> > >>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:480) >>> > 1350 at >>> > >>> > >>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) >>> > 1351 at >>> > >>> > >>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) >>> > 1352 at >>> > >>> > >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) >>> > 1353 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002) >>> > 1354 at >>> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1695) >>> > 1355 at >>> > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1691) >>> > 1356 at java.security.AccessController.doPrivileged(Native >>> Method) >>> > 1357 at javax.security.auth.Subject.doAs(Subject.java:396) >>> > 1358 at >>> > >>> > >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) >>> > 1359 at >>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:1689) >>> > {code} >>> > >>> > Way to reproduce: >>> > Currently, we have a way to reproduce this issue, which is running ycsb >>> > workload[a-f] to HBase, and the other hand, we simply to execute the >>> > service <hadoop,hbase-daemons> start. Then we keep to running 10 times >>> in a >>> > loop on both side for one round, it could usually be reproduced on >>> first or >>> > second round. But I am not sure whether it is related to the issue in >>> the >>> > beginning of this discussion thread talked about. >>> > >>> > >>> > Follow-ups: >>> > Now I am following the weird msg appeared in the namenode.log >>> > {code} >>> > 2014-02-18 01:17:31,189 INFO >>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: allowed=true >>> > ugi=hdfs/aws-scottm-tmh6-1.ap-southeast-1.compute.internal@LAB >>> > (auth:KERBEROS) >>> > ip=/ >>> > 10.147.1.168 cmd=delete src=/user/SPN-hbase/.logs >>> > dst=null perm=null >>> > {code} >>> > To see what can I dig out more... >>> > >>> > >>> > I'll keep to update this thread if any finding, and welcome any >>> opinion, >>> > tks >>> > >>> > >>> > Best regards >>> > >>> > takeshi >>> > >>> >> >> >
