[ https://issues.apache.org/jira/browse/HBASE-20503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365637#comment-17365637 ]
Emil Kleszcz edited comment on HBASE-20503 at 6/19/21, 1:32 PM: ---------------------------------------------------------------- Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region was residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> <INFO> <wal.AbstractFSWAL>: <Slow sync cost: 134 ms (...) <2021-06-15T11:09:00.406+0200> <WARN> <wal.AbstractProtobufLogWriter>: <Failed to write trailer, non-fatal, continuing...> java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> <WARN> <wal.AsyncFSWAL>: <close old writer failed> java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> <WARN> <asyncfs.FanOutOneBlockAsyncDFSOutputHelper>: <create fan-out dfs output /hbase/WALs Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/WALs (...) <2021-06-15T11:31:42.845+0200> <WARN> <wal.AsyncFSWAL>: <sync failed> java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> <ERROR> <regionserver.CompactSplit>: <Compaction failed Request=regionName=XXX,00309759-01086330590,1526681161957.b6620a3b9d0c2e32ab097a84939c6975., storeName=r, fileCount=2, fileSize=2.8 G (2.8 G, 3.6 M), priority=298, time=1623749480843> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> <ERROR> <regionserver.CompactSplit>: <Compaction failed region=hbase:meta,,1.1588230740, storeName=1588230740/info, priority=294, startTime=1623861481313> {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293949, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=XXX, parent=b31a3040431b34256e265cd6c5a0c4e6, daughterA=502a98e2851e9e3036b746ab7a1f1371, daughterB=89429106a89ff859e396d22c1977798f> <2021-06-17T08:53:13.862+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293949, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=XXX, parent=b31a3040431b34256e265cd6c5a0c4e6, daughterA=502a98e2851e9e3036b746ab7a1f1371, daughterB=89429106a89ff859e396d22c1977798f> <2021-06-17T08:53:13.866+0200> <INFO> <procedure2.ProcedureExecutor>: <Initialized subprocedures=[{pid=2293950, ppid=2293949, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=XXX, region=b31a3040431b34256e265cd6c5a0c4e6, UNASSIGN}]> <2021-06-17T08:53:13.867+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293950, ppid=2293949, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=XXX, region=b31a3040431b34256e265cd6c5a0c4e6, UNASSIGN> <2021-06-17T08:53:13.867+0200> <INFO> <assignment.RegionStateStore>: <pid=2293950 updating hbase:meta row=b31a3040431b34256e265cd6c5a0c4e6, regionState=CLOSING, regionLocation=XXX,16020,1623644749045> <2021-06-17T08:53:28.443+0200> <WARN> <procedure2.ProcedureExecutor>: <Worker stuck PEWorker-7(pid=2293950), run time 14.5760 sec>{code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope we have some patch for this soon. was (Author: tr0k): Hi, we experienced the same issue in HBase 2.3.4 on one of our production clusters this week. This happened a few weeks after upgrading HBase from 2.2.4 where we never observed this problem. We run on the HDP 3.2.1. On average we have around 800 regions per RS and the workload was, as usual, these days. This problem started on one of the RSs where meta region residing. We could observe the following in the RS log: {code:java} <2021-06-15T10:31:28.284+0200> <INFO> <wal.AbstractFSWAL>: <Slow sync cost: 134 ms (...) <2021-06-15T11:09:00.406+0200> <WARN> <wal.AbstractProtobufLogWriter>: <Failed to write trailer, non-fatal, continuing...> java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:420) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:509) (...) <2021-06-15T11:11:39.744+0200> <WARN> <wal.AsyncFSWAL>: <close old writer failed> java.io.FileNotFoundException: File does not exist: /hbase/WALs/ (...) <2021-06-15T11:15:59.241+0200> <WARN> <asyncfs.FanOutOneBlockAsyncDFSOutputHelper>: <create fan-out dfs output /hbase/WALs Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /hbase/WALs (...) <2021-06-15T11:31:42.845+0200> <WARN> <wal.AsyncFSWAL>: <sync failed> java.io.IOException: stream already broken (...) <2021-06-15T11:39:39.986+0200> <ERROR> <regionserver.CompactSplit>: <Compaction failed Request=regionName=XXX,00309759-01086330590,1526681161957.b6620a3b9d0c2e32ab097a84939c6975., storeName=r, fileCount=2, fileSize=2.8 G (2.8 G, 3.6 M), priority=298, time=1623749480843> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=54056852, WAL system stuck? (...) <2021-06-16T18:43:05.220+0200> <ERROR> <regionserver.CompactSplit>: <Compaction failed region=hbase:meta,,1.1588230740, storeName=1588230740/info, priority=294, startTime=1623861481313> {code} Since then compaction started failing on many regions including meta. 2 days later we could see one RS going down... This triggered an avalanche of stuck procedures in HMaster {code:java} <2021-06-17T08:53:13.862+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293949, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=XXX, parent=b31a3040431b34256e265cd6c5a0c4e6, daughterA=502a98e2851e9e3036b746ab7a1f1371, daughterB=89429106a89ff859e396d22c1977798f> <2021-06-17T08:53:13.862+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293949, state=RUNNABLE:SPLIT_TABLE_REGION_PREPARE; SplitTableRegionProcedure table=XXX, parent=b31a3040431b34256e265cd6c5a0c4e6, daughterA=502a98e2851e9e3036b746ab7a1f1371, daughterB=89429106a89ff859e396d22c1977798f> <2021-06-17T08:53:13.866+0200> <INFO> <procedure2.ProcedureExecutor>: <Initialized subprocedures=[{pid=2293950, ppid=2293949, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=XXX, region=b31a3040431b34256e265cd6c5a0c4e6, UNASSIGN}]> <2021-06-17T08:53:13.867+0200> <INFO> <procedure.MasterProcedureScheduler>: <Took xlock for pid=2293950, ppid=2293949, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=XXX, region=b31a3040431b34256e265cd6c5a0c4e6, UNASSIGN> <2021-06-17T08:53:13.867+0200> <INFO> <assignment.RegionStateStore>: <pid=2293950 updating hbase:meta row=b31a3040431b34256e265cd6c5a0c4e6, regionState=CLOSING, regionLocation=XXX,16020,1623644749045> <2021-06-17T08:53:28.443+0200> <WARN> <procedure2.ProcedureExecutor>: <Worker stuck PEWorker-7(pid=2293950), run time 14.5760 sec>{code} In HA the Hmasters started flipping over and we could observe more and more RITs with OPENING and CLOSING states pointing to stale RSs (old timestamps or null). Only the manual fix (forcing states for tables/regions) helped to recover the cluster. I hope you apply the working patch soon. > [AsyncFSWAL] Failed to get sync result after 300000 ms for txid=160912, WAL > system stuck? > ----------------------------------------------------------------------------------------- > > Key: HBASE-20503 > URL: https://issues.apache.org/jira/browse/HBASE-20503 > Project: HBase > Issue Type: Bug > Components: wal > Reporter: Michael Stack > Priority: Major > Attachments: > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch, > 0001-HBASE-20503-AsyncFSWAL-Failed-to-get-sync-result-aft.patch > > > Scale test. Startup w/ 30k regions over ~250nodes. This RS is trying to > furiously open regions assigned by Master. It is importantly carrying > hbase:meta. Twenty minutes in, meta goes dead after an exception up out > AsyncFSWAL. Process had been restarted so I couldn't get a thread dump. > Suspicious is we archive a WAL and we get a FNFE because we got to access WAL > in old location. [~Apache9] mind taking a look? Does this FNFE rolling kill > the WAL sub-system? Thanks. > DFS complaining on file open for a few files getting blocks from remote dead > DNs: e.g. {{2018-04-25 10:05:21,506 WARN > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory: I/O error constructing > remote block reader. > java.net.ConnectException: Connection refused}} > AsyncFSWAL complaining: "AbstractFSWAL: Slow sync cost: 103 ms" . > About ten minutes in, we get this: > {code} > 2018-04-25 10:15:16,532 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > with entries=10819, filesize=7.57 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > 2018-04-25 10:15:16,680 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524675848653.meta > 2018-04-25 10:15:16,686 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:15:16,687 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > We want to roll because we failed sync, we fail writing trailer because > stream is broke, we fail close for same reason, then we fall into a block on > waiting on lease recovery (Should be trying to recover the lease inline? We > fail to recover the lease after 4 seconds, that ok?). > We seem to recover fine though. We continue. I see some meta flushes. > Twenty minutes in, we have similar event-type for the meta and rsgroup wals > that are on this server. What is interesting is the FileNotFound. We seem to > be recovering lease on a file we just archived. > {code} > 2018-04-25 10:24:17,558 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:17,579 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Onlined 3aa9f94ed073b55c35147eae127b3ebe; next sequenceid=306801 > 2018-04-25 10:24:18,015 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > with entries=21496, filesize=12.68 MB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:24:18,016 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676253923.meta > 2018-04-25 10:24:18,049 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > with entries=168, filesize=68.67 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:18,050 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,\x1C\x84\xA4z\x07\xF5an,1524068594035.4cad45e6d4430b1a39538a045f2c3909. > 2018-04-25 10:24:18,051 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Post open deploy tasks > for > IntegrationTestBigLinkedList_20180418091745,0\x11\xD3\xC0\x1E\x8F\x8F\x08,1524068594035.3aa9f94ed073b55c35147eae127b3ebe. > 2018-04-25 10:24:18,050 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,051 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:24:18,053 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > 2018-04-25 10:24:18,058 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > after 5ms > 2018-04-25 10:24:20,434 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping HBase metrics > system... > 2018-04-25 10:24:20,436 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > stopped. > 2018-04-25 10:24:20,940 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot > period at 10 second(s). > 2018-04-25 10:24:20,941 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: HBase metrics system > started > 2018-04-25 10:24:22,156 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524676516724 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > {code} > Here is NN log for the above rsgroup file > {code} > 2018-04-25 10:24:18,030 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* > allocate blk_1096417978_22686485, replicas=10.17.208.15:20002, > 10.17.218.44:20002, 10.17.218.35:20002 for > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,220 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > from client DFSClient_NONMAPREDUCE_1739935309_1 > 2018-04-25 10:24:58,221 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_1739935309_1, pending creates: 3], > src=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > 2018-04-25 10:24:58,221 WARN org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.internalReleaseLease: File > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > has not been closed. Lease recovery is in progress. RecoveryId = 22686546 > for block blk_1096417978_22686485 > 2018-04-25 10:25:00,419 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(oldBlock=BP-245617012-10.17.XXX-XX-XXXX556508602:blk_1096417978_22686485, > > file=/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022, > newgenerationstamp=22686546, newlength=1066, newtargets=[10.17.208.15:20002, > 10.17.218.44:20002]) successful > 2018-04-25 10:25:02,257 INFO org.apache.hadoop.ipc.Server: IPC Server handler > 128 on 8020, call Call#23009 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from > 10.17.208.15:35769: java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677058022 > {code} > Five minutes later same thing happens for the meta file: > {code} > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > with entries=941, filesize=668.65 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677608280.meta > 2018-04-25 10:33:28,347 WARN > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter: Failed to > write trailer, non-fatal, continuing... > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.lambda$writeWALTrailerAndMagic$3(AsyncProtobufLogWriter.java:210) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.write(AsyncProtobufLogWriter.java:166) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.writeWALTrailerAndMagic(AsyncProtobufLogWriter.java:201) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractProtobufLogWriter.writeWALTrailer(AbstractProtobufLogWriter.java:233) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:143) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,347 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524676516535.meta > > 2018-04-25 10:33:28,348 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter: normal close > failed, try recover > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.endBlock(FanOutOneBlockAsyncDFSOutput.java:521) > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.close(FanOutOneBlockAsyncDFSOutput.java:565) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:144) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-04-25 10:33:28,349 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: > Recover lease on dfs file > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,353 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Failed > to recover lease, attempt=0 on > file=/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > after 4ms > 2018-04-25 10:33:28,355 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677057560.meta > 2018-04-25 10:33:28,361 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Rolled WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > with entries=3, filesize=1.49 KB; new WAL > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677608367 > 2018-04-25 10:33:28,405 INFO > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL: Archiving > hdfs://ns1/hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > to > hdfs://ns1/hbase/oldWALs/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.vc0205.halxg.cloudera.com%2C22101%2C1524675808073.regiongroup-0.1524677098139 > 2018-04-25 10:33:30,931 INFO org.apache.hadoop.hbase.io.hfile.LruBlockCache: > totalSize=328.08 MB, freeSize=2.84 GB, max=3.16 GB, blockCount=2835, > accesses=21635, hits=21635, hitRatio=100.00%, , cachingAccesses=21617, > cachingHits=21617, cachingHitsRatio=100.00%, evictions=179, evicted=0, > evictedPerRun=0.0 > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > {code} > Now we seem to have a sick WAL. > There is this lone exception after 5 minutes with no accompanying attempt at > roll: > {code} > 2018-04-25 10:39:12,741 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: sync failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:424) > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:513) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:134) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:364) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:547) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} > meta seems to have stalled taking edits seeminly because it cannot flush > (write its flush marker...) Ten minutes later I get this: > {code} > 2018-04-25 10:49:13,051 ERROR > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for > region hbase:meta,,1 > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 300000 ms for txid=160912, WAL system stuck? > at > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:136) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:718) > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:596) > at > org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2549) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2495) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2353) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2327) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2218) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:497) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:466) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:69) > at > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:252) > at java.lang.Thread.run(Thread.java:748) > {code} > Lots of log about wanting to flush but can't and these: > 2018-04-25 10:44:12,776 WARN org.apache.hadoop.hbase.ipc.RpcServer: > (responseTooSlow): > {"call":"Multi(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$MultiRequest)","starttimems":1524677952741,"responsesize":1558,"method":"Multi","param":"region= > hbase:meta,,1, for 1 actions and 1st row > key=BUNDLE-REDACTED\\xBF\\x08\\x1E\"\\xE9X.\\x1A,1524000668742.5add42a952519e671face3500922057f.","processingtimems":300034,"client":"10.17.208.13:44534","queuetimems":0,"class":"HRegionServer"} > On the end is one of these: > {code} > 2018-04-25 10:33:32,359 WARN > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL: close old writer failed > java.io.FileNotFoundException: File does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at > java.security.AccessController.doPrivileged(Native Method) > > > > > at javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > > > > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > > > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > at > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > > > > > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > > > > > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:873) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:290) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:287) > > > > > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > > > > > at > org.apache.hadoop.hdfs.DistributedFileSystem.recoverLease(DistributedFileSystem.java:302) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverLease(FSHDFSUtils.java:283) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:216) > > > > > at > org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:163) > > > > > at > org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.recoverAndClose(FanOutOneBlockAsyncDFSOutput.java:555) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.close(AsyncProtobufLogWriter.java:147) > > > > at > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.lambda$executeClose$8(AsyncFSWAL.java:742) > > > > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > Caused by: > org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File > does not exist: > /hbase/WALs/vc0205.halxg.cloudera.com,22101,1524675808073/vc0205.halxg.cloudera.com%2C22101%2C1524675808073.meta.1524677098059.meta > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85) > > > > > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75) > > > > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2490) > > > > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:809) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:737) > > > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > > > > at > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > > > > > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > > > > > at > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > > > > > at java.security.AccessController.doPrivileged(Native > Method) > > > > at > javax.security.auth.Subject.doAs(Subject.java:422) > > > > > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > > > > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > > > > > > > > > > at > org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1437) > > > > > at > org.apache.hadoop.ipc.Client.call(Client.java:1347) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > > > > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > > > > > at com.sun.proxy.$Proxy24.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.recoverLease(ClientNamenodeProtocolTranslatorPB.java:683) > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > > > > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > > > > > at com.sun.proxy.$Proxy25.recoverLease(Unknown Source) > > > > > at > sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) > > > > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > > > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372) > > > > > at com.sun.proxy.$Proxy26.recoverLease(Unknown Source) > > > > > at > org.apache.hadoop.hdfs.DFSClient.recoverLease(DFSClient.java:871) > > > > > ... 13 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)