[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-20503: -------------------------- Priority: Major (was: Critical) > [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL > system stuck? > ----------------------------------------------------------------------------------------- > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal > Reporter: stack > Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: e.g. {{2018-04-25 10:05:21,506 WARN > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. > java.net.ConnectException: Connection refused}} > AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" . > About ten minutes in, we get this: > {code} > 2018-04-25 10:15:16,532 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > with entries=10819, filesize=7.57 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > 2018-04-25 10:15:16,686 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,687 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > We want to roll because we failed sync, we fail writing trailer because > stream is broke, we fail close for same reason, then we fall into a block on > waiting on lease recovery (Should be trying to recover the lease inline? We > fail to recover the lease after 4 seconds, that ok?). > We seem to recover fine though. We continue. I see some meta flushes. > Twenty minutes in, we have similar event-type for the meta and rsgroup wals > that are on this server. What is interesting is the FileNotFound. We seem to > be recovering lease on a file we just archived. > {code} > 2018-04-25 10:24:17,558 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:17,579 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined 3aa9f94ed073b55c35147eae127b3ebe; next sequenceid=306801 > 2018-04-25 10:24:18,015 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > with entries=21496, filesize=12.68 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:24:18,016 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > 2018-04-25 10:24:18,049 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > with entries=168, filesize=68.67 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:18,050 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,\x1C\x84\xA4z\x07\xF5an,1524068594035.4cad45e6d4430b1a39538a045f2c3909. > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,0\x11\xD3\xC0\x1E\x8F\x8F\x08,1524068594035.3aa9f94ed073b55c35147eae127b3ebe. > 2018-04-25 10:24:18,050 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,051 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,053 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,058 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > after 5ms > 2018-04-25 10:24:20,434 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping HBase metrics > system... > 2018-04-25 10:24:20,436 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > stopped. > 2018-04-25 10:24:20,940 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > started > 2018-04-25 10:24:22,156 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > {code} > Here is NN log for the above rsgroup file > {code} > 2018-04-25 10:24:18,030 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1096417978_22686485, replicas=10.17.208.15:20002, > 10.17.218.44:20002, 10.17.218.35:20002 for > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,220 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > from client DFSClient_NONMAPREDUCE_1739935309_1 > 2018-04-25 10:24:58,221 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,221 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > has not been closed. Lease recovery is in progress. RecoveryId = 22686546 > for block blk_1096417978_22686485 > 2018-04-25 10:25:00,419 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(oldBlock=BP-245617012-10.17.XXX-XX-XXXX556508602:blk_1096417978_22686485, > > file=/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022, > newgenerationstamp=22686546, newlength=1066, newtargets=[10.17.208.15:20002, > 10.17.218.44:20002]) successful > 2018-04-25 10:25:02,257 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 128 on 8020, call Call#23009 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from > 10.17.208.15:35769: java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > {code} > Five minutes later same thing happens for the meta file: > {code} > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > with entries=941, filesize=668.65 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677608280.meta > 2018-04-25 10:33:28,347 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > > 2018-04-25 10:33:28,348 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,349 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,353 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > after 4ms > 2018-04-25 10:33:28,355 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:33:28,361 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > with entries=3, filesize=1.49 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677608367 > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > 2018-04-25 10:33:30,931 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: > totalSize=328.08 MB, freeSize=2.84 GB, max=3.16 GB, blockCount=2835, > accesses=21635, hits=21635, hitRatio=100.00%, , cachingAccesses=21617, > cachingHits=21617, cachingHitsRatio=100.00%, evictions=179, evicted=0, > evictedPerRun=0.0 > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {code} > Now we seem to have a sick WAL. > There is this lone exception after 5 minutes with no accompanying attempt at > roll: > {code} > 2018-04-25 10:39:12,741 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > meta seems to have stalled taking edits seeminly because it cannot flush > (write its flush marker...) Ten minutes later I get this: > {code} > 2018-04-25 10:49:13,051 ERROR > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for > region hbase:meta,,1 > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 300000 ms for txid=160912, WAL system stuck? > at > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:136) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:596) > at > org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2549) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2495) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2353) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2327) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2218) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:497) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:466) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:69) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:252) > at java.lang.Thread.run(Thread.java:748) > {code} > Lots of log about wanting to flush but can't and these: > 2018-04-25 10:44:12,776 WARN org.apache.hadoop.hbase.ipc.RpcServer: > (responseTooSlow): > {"call":"Multi(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1524677952741,"responsesize":1558,"method":"Multi","param":"region= > hbase:meta,,1, for 1 actions and 1st row > key=BUNDLE-REDACTED\\xBF\\x08\\x1E\"\\xE9X.\\x1A,1524000668742.5add42a952519e671face3500922057f.","processingtimems":300034,"client":"10.17.208.13:44534","queuetimems":0,"class":"HRegionServer"} > On the end is one of these: > {code} > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at > java.security.AccessController.doPrivileged(Native Method) > > > > > at javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > > > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > > > > > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > > > > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:873) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:290) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:287) > > > > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:302) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163) > > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.recoverAndClose(FanOutOneBlockAsyncDFSOutput.java:555) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:147) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > > > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > > > > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > > > > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > > > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > > > > at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at java.security.AccessController.doPrivileged(Native > Method) > > > > at > javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1437) > > > > > at > org.apache.hadoop.ipc.Client.call(Client.java:1347) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > > > > > at com.sun.proxy.$Proxy24.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.recoverLease(ClientNamenodeProtocolTranslatorPB.java:683) > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > > > > > at com.sun.proxy.$Proxy25.recoverLease(Unknown Source) > > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372) > > > > > at com.sun.proxy.$Proxy26.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:871) > > > > > ... 13 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)