[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460607#comment-16460607 ]
stack commented on HBASE-20503: ------------------------------- .0001 is a test that reproduces the log pasted in above only this time there is no hang-up of the WAL sub-system.... it seems to keep going. A file on which we are currently recovering the lease is "archived" (by the periodic roll waller thread interceding; it thinks any file whose edits have been synced is up for rolling) which makes it so the NN does not recover the WAL lease; it fails with FileNotFoundException (Default hadoop2, not hadoop3). The log roller should wait until lease is recovered at least I'd think. Let me try and find the original log to see if there are other clues as to what about this combination locked up the WAL subsystem such that meta could no longer take edits stuck on appending WAL. > [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL > system stuck? > ----------------------------------------------------------------------------------------- > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal > Reporter: stack > Priority: Critical > Fix For: 2.0.1 > > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: e.g. {{2018-04-25 10:05:21,506 WARN > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. > java.net.ConnectException: Connection refused}} > AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" . > About ten minutes in, we get this: > {code} > 2018-04-25 10:15:16,532 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > with entries=10819, filesize=7.57 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > 2018-04-25 10:15:16,686 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,687 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > We want to roll because we failed sync, we fail writing trailer because > stream is broke, we fail close for same reason, then we fall into a block on > waiting on lease recovery (Should be trying to recover the lease inline? We > fail to recover the lease after 4 seconds, that ok?). > We seem to recover fine though. We continue. I see some meta flushes. > Twenty minutes in, we have similar event-type for the meta and rsgroup wals > that are on this server. What is interesting is the FileNotFound. We seem to > be recovering lease on a file we just archived. > {code} > 2018-04-25 10:24:17,558 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:17,579 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined 3aa9f94ed073b55c35147eae127b3ebe; next sequenceid=306801 > 2018-04-25 10:24:18,015 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > with entries=21496, filesize=12.68 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:24:18,016 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > 2018-04-25 10:24:18,049 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > with entries=168, filesize=68.67 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:18,050 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,\x1C\x84\xA4z\x07\xF5an,1524068594035.4cad45e6d4430b1a39538a045f2c3909. > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,0\x11\xD3\xC0\x1E\x8F\x8F\x08,1524068594035.3aa9f94ed073b55c35147eae127b3ebe. > 2018-04-25 10:24:18,050 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,051 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,053 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,058 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > after 5ms > 2018-04-25 10:24:20,434 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping HBase metrics > system... > 2018-04-25 10:24:20,436 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > stopped. > 2018-04-25 10:24:20,940 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > started > 2018-04-25 10:24:22,156 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > {code} > Here is NN log for the above rsgroup file > {code} > 2018-04-25 10:24:18,030 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1096417978_22686485, replicas=10.17.208.15:20002, > 10.17.218.44:20002, 10.17.218.35:20002 for > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,220 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > from client DFSClient_NONMAPREDUCE_1739935309_1 > 2018-04-25 10:24:58,221 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,221 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > has not been closed. Lease recovery is in progress. RecoveryId = 22686546 > for block blk_1096417978_22686485 > 2018-04-25 10:25:00,419 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(oldBlock=BP-245617012-10.17.XXX-XX-XXXX556508602:blk_1096417978_22686485, > > file=/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022, > newgenerationstamp=22686546, newlength=1066, newtargets=[10.17.208.15:20002, > 10.17.218.44:20002]) successful > 2018-04-25 10:25:02,257 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 128 on 8020, call Call#23009 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from > 10.17.208.15:35769: java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > {code} > Five minutes later same thing happens for the meta file: > {code} > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > with entries=941, filesize=668.65 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677608280.meta > 2018-04-25 10:33:28,347 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > > 2018-04-25 10:33:28,348 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,349 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,353 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > after 4ms > 2018-04-25 10:33:28,355 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:33:28,361 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > with entries=3, filesize=1.49 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677608367 > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > 2018-04-25 10:33:30,931 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: > totalSize=328.08 MB, freeSize=2.84 GB, max=3.16 GB, blockCount=2835, > accesses=21635, hits=21635, hitRatio=100.00%, , cachingAccesses=21617, > cachingHits=21617, cachingHitsRatio=100.00%, evictions=179, evicted=0, > evictedPerRun=0.0 > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {code} > Now we seem to have a sick WAL. > There is this lone exception after 5 minutes with no accompanying attempt at > roll: > {code} > 2018-04-25 10:39:12,741 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > meta seems to have stalled taking edits seeminly because it cannot flush > (write its flush marker...) Ten minutes later I get this: > {code} > 2018-04-25 10:49:13,051 ERROR > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for > region hbase:meta,,1 > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 300000 ms for txid=160912, WAL system stuck? > at > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:136) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:596) > at > org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2549) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2495) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2353) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2327) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2218) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:497) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:466) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:69) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:252) > at java.lang.Thread.run(Thread.java:748) > {code} > Lots of log about wanting to flush but can't and these: > 2018-04-25 10:44:12,776 WARN org.apache.hadoop.hbase.ipc.RpcServer: > (responseTooSlow): > {"call":"Multi(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1524677952741,"responsesize":1558,"method":"Multi","param":"region= > hbase:meta,,1, for 1 actions and 1st row > key=BUNDLE-REDACTED\\xBF\\x08\\x1E\"\\xE9X.\\x1A,1524000668742.5add42a952519e671face3500922057f.","processingtimems":300034,"client":"10.17.208.13:44534","queuetimems":0,"class":"HRegionServer"} > On the end is one of these: > {code} > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at > java.security.AccessController.doPrivileged(Native Method) > > > > > at javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > > > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > > > > > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > > > > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:873) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:290) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:287) > > > > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:302) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163) > > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.recoverAndClose(FanOutOneBlockAsyncDFSOutput.java:555) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:147) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > > > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > > > > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > > > > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > > > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > > > > at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at java.security.AccessController.doPrivileged(Native > Method) > > > > at > javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1437) > > > > > at > org.apache.hadoop.ipc.Client.call(Client.java:1347) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > > > > > at com.sun.proxy.$Proxy24.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.recoverLease(ClientNamenodeProtocolTranslatorPB.java:683) > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > > > > > at com.sun.proxy.$Proxy25.recoverLease(Unknown Source) > > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372) > > > > > at com.sun.proxy.$Proxy26.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:871) > > > > > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)