Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
嗯嗯,我们用的是hbase2.2.6,卡住的起因与第一个ISSUE描述的很像,测试用例我也复现了,第二个ISSUE应用目的是,sync wal卡住时,希望能中断RegionServer,否则的话,RS进程会卡几个小时以上,才OOM退出,这个时间内,此节点写入完全停止 张铎(Duo Zhang) 于2023年6月1日周四 10:39写道: > 第一个 issue 的情况是有一个 DN 返回的速度比别的 DN 都快然后他又挂了,就可能会卡住 > 第二个 issue 是说 shutdown WAL 的时候可能会卡住,这个主要是导致 RegionServer 退出不了 > > 应该在 2.4.10 之后的版本都修复了,你可以试试 > > leojie 于2023年6月1日周四 09:34写道: > > > 非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch: > > https://issues.apache.org/jira/browse/HBASE-26679 > 通过测试用例可以稳定复现我们使用版本的阻塞异常 > > https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal > > sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况 > > > > leojie 于2023年5月15日周一 15:51写道: > > > > > 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block > > > ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高), > > > 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。 > > > > > > 感觉DN有压力是引起此现象的一个导火索,但可能不是sync > > > wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码 > > > > > > 张铎(Duo Zhang) 于2023年5月13日周六 18:58写道: > > > > > >> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标 > > >> > > >> 也可以尝试升级一下 JDK 吧 > > >> > > >> https://bugs.openjdk.org/browse/JDK-8215355 > > >> > > >> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了 > > >> > > >> leojie 于2023年5月11日周四 12:03写道: > > >> > > >> > java 的版本是:1.8.0_131 > > >> > > > >> > 翻了翻异常时间点之前的日志,发现有如下相关报错: > > >> > 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] > > impl.MetricsSystemImpl: > > >> > Stopping HBase metrics system... > > >> > 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: > > ganglia > > >> > thread interrupted. > > >> > 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] > > impl.MetricsSystemImpl: > > >> > HBase metrics system stopped. > > >> > 2023-05-11 10:59:33,038 WARN > > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] > wal.AsyncFSWAL: > > >> sync > > >> > failed > > >> > java.io.IOException: stream already broken > > >> > at > > >> > org.apache.hadoop.hbase.io > > >> > > > >> > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > > >> > at > > >> > org.apache.hadoop.hbase.io > > >> > > > >> > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > > >> > at > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > >> > at > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > >> > at java.lang.Thread.run(Thread.java:748) > > >> > 2023-05-11 10:59:33,040 WARN > > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] > wal.AsyncFSWAL: > > >> sync > > >> > failed > > >> > java.io.IOException: stream already broken > > >> > at > > >> > org.apache.hadoop.hbase.io > > >> > > > >> > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > > >> > at > > >> > org.apache.hadoop.hbase.io > > >> > > > >> > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > > >> > at > > >> > > > >> > > > >> > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > > >> > at > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > >> > at > > >> > > > >> > > > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > >> > at java.lang.Thread.run(Thread.java:748) > > >> > 2023-05-11 10:59:33,040 WARN > > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] > wal.AsyncFSWAL: > > >> sync > > >> > failed > > >> > > > >> > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > >> > syscall:read(..) failed: Connection reset by peer > > >> > at > > >> > > > >> > > > >> > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > >> > Source) > > >> > 2023-05-11 10:59:33,040 WARN > > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] > wal.AsyncFSWAL: > > >> sync > > >> > failed > > >> > > > >> > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > >> > syscall:read(..) failed: Connection reset by peer > > >> > at > > >> > > > >> > > > >> > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > >> > Source) > > >> > 2023-05-11
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
第一个 issue 的情况是有一个 DN 返回的速度比别的 DN 都快然后他又挂了,就可能会卡住 第二个 issue 是说 shutdown WAL 的时候可能会卡住,这个主要是导致 RegionServer 退出不了 应该在 2.4.10 之后的版本都修复了,你可以试试 leojie 于2023年6月1日周四 09:34写道: > 非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch: > https://issues.apache.org/jira/browse/HBASE-26679 通过测试用例可以稳定复现我们使用版本的阻塞异常 > https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal > sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况 > > leojie 于2023年5月15日周一 15:51写道: > > > 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block > > ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高), > > 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。 > > > > 感觉DN有压力是引起此现象的一个导火索,但可能不是sync > > wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码 > > > > 张铎(Duo Zhang) 于2023年5月13日周六 18:58写道: > > > >> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标 > >> > >> 也可以尝试升级一下 JDK 吧 > >> > >> https://bugs.openjdk.org/browse/JDK-8215355 > >> > >> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了 > >> > >> leojie 于2023年5月11日周四 12:03写道: > >> > >> > java 的版本是:1.8.0_131 > >> > > >> > 翻了翻异常时间点之前的日志,发现有如下相关报错: > >> > 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] > impl.MetricsSystemImpl: > >> > Stopping HBase metrics system... > >> > 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: > ganglia > >> > thread interrupted. > >> > 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] > impl.MetricsSystemImpl: > >> > HBase metrics system stopped. > >> > 2023-05-11 10:59:33,038 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> > failed > >> > java.io.IOException: stream already broken > >> > at > >> > org.apache.hadoop.hbase.io > >> > > >> > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > >> > at > >> > org.apache.hadoop.hbase.io > >> > > >> > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > >> > at > >> > > >> > > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > >> > at > >> > > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > >> > at java.lang.Thread.run(Thread.java:748) > >> > 2023-05-11 10:59:33,040 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> > failed > >> > java.io.IOException: stream already broken > >> > at > >> > org.apache.hadoop.hbase.io > >> > > >> > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > >> > at > >> > org.apache.hadoop.hbase.io > >> > > >> > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > >> > at > >> > > >> > > >> > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > >> > at > >> > > >> > > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > >> > at > >> > > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > >> > at java.lang.Thread.run(Thread.java:748) > >> > 2023-05-11 10:59:33,040 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> > failed > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > >> > syscall:read(..) failed: Connection reset by peer > >> > at > >> > > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > >> > Source) > >> > 2023-05-11 10:59:33,040 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> > failed > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > >> > syscall:read(..) failed: Connection reset by peer > >> > at > >> > > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > >> > Source) > >> > 2023-05-11 10:59:33,040 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> > failed > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > >> > syscall:read(..) failed: Connection reset by peer > >> > at > >> > > >> > > >> > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > >> > Source) > >> > 2023-05-11 10:59:33,040 WARN > >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > >> sync > >> >
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch: https://issues.apache.org/jira/browse/HBASE-26679 通过测试用例可以稳定复现我们使用版本的阻塞异常 https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况 leojie 于2023年5月15日周一 15:51写道: > 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block > ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高), > 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。 > > 感觉DN有压力是引起此现象的一个导火索,但可能不是sync > wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码 > > 张铎(Duo Zhang) 于2023年5月13日周六 18:58写道: > >> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标 >> >> 也可以尝试升级一下 JDK 吧 >> >> https://bugs.openjdk.org/browse/JDK-8215355 >> >> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了 >> >> leojie 于2023年5月11日周四 12:03写道: >> >> > java 的版本是:1.8.0_131 >> > >> > 翻了翻异常时间点之前的日志,发现有如下相关报错: >> > 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: >> > Stopping HBase metrics system... >> > 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: ganglia >> > thread interrupted. >> > 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: >> > HBase metrics system stopped. >> > 2023-05-11 10:59:33,038 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > java.io.IOException: stream already broken >> > at >> > org.apache.hadoop.hbase.io >> > >> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) >> > at >> > org.apache.hadoop.hbase.io >> > >> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> > at java.lang.Thread.run(Thread.java:748) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > java.io.IOException: stream already broken >> > at >> > org.apache.hadoop.hbase.io >> > >> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) >> > at >> > org.apache.hadoop.hbase.io >> > >> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) >> > at >> > >> > >> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> > at >> > >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> > at java.lang.Thread.run(Thread.java:748) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: >> > syscall:read(..) failed: Connection reset by peer >> > at >> > >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown >> > Source) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: >> > syscall:read(..) failed: Connection reset by peer >> > at >> > >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown >> > Source) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: >> > syscall:read(..) failed: Connection reset by peer >> > at >> > >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown >> > Source) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: >> > syscall:read(..) failed: Connection reset by peer >> > at >> > >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown >> > Source) >> > 2023-05-11 10:59:33,040 WARN >> > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: >> sync >> > failed >> > >> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: >> > syscall:read(..)
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高), 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。 感觉DN有压力是引起此现象的一个导火索,但可能不是sync wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码 张铎(Duo Zhang) 于2023年5月13日周六 18:58写道: > 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标 > > 也可以尝试升级一下 JDK 吧 > > https://bugs.openjdk.org/browse/JDK-8215355 > > 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了 > > leojie 于2023年5月11日周四 12:03写道: > > > java 的版本是:1.8.0_131 > > > > 翻了翻异常时间点之前的日志,发现有如下相关报错: > > 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > > Stopping HBase metrics system... > > 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: ganglia > > thread interrupted. > > 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > > HBase metrics system stopped. > > 2023-05-11 10:59:33,038 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > java.io.IOException: stream already broken > > at > > org.apache.hadoop.hbase.io > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > > at > > org.apache.hadoop.hbase.io > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > > at > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:748) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > java.io.IOException: stream already broken > > at > > org.apache.hadoop.hbase.io > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > > at > > org.apache.hadoop.hbase.io > > > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > > at > > > > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > > at > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:748) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > > > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > Source) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > > > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > Source) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > > > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > Source) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > > > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > Source) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > > > > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > > Source) > > 2023-05-11 10:59:33,040 WARN > > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: > sync > > failed > > > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > > syscall:read(..) failed: Connection reset by peer > > at > >
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标 也可以尝试升级一下 JDK 吧 https://bugs.openjdk.org/browse/JDK-8215355 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了 leojie 于2023年5月11日周四 12:03写道: > java 的版本是:1.8.0_131 > > 翻了翻异常时间点之前的日志,发现有如下相关报错: > 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > Stopping HBase metrics system... > 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: ganglia > thread interrupted. > 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: > HBase metrics system stopped. > 2023-05-11 10:59:33,038 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > at > org.apache.hadoop.hbase.io > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > java.io.IOException: stream already broken > at > org.apache.hadoop.hbase.io > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) > at > org.apache.hadoop.hbase.io > .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed > org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: > syscall:read(..) failed: Connection reset by peer > at > > org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown > Source) > 2023-05-11 10:59:33,040 WARN > [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync > failed >
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
java 的版本是:1.8.0_131 翻了翻异常时间点之前的日志,发现有如下相关报错: 2023-05-11 10:59:32,711 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system... 2023-05-11 10:59:32,728 INFO [ganglia] impl.MetricsSinkAdapter: ganglia thread interrupted. 2023-05-11 10:59:32,728 INFO [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped. 2023-05-11 10:59:33,038 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed java.io.IOException: stream already broken at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423) at org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512) at org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown Source) 2023-05-11 10:59:33,040 WARN [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync failed org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer at
Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
你往上翻翻有没有别的异常?这个看起来应该就是 AsyncFSWAL 有 bug 导致 hang 住不动了,不过我翻了一下,2.2.6 之后似乎没有跟这个有关的 fix 了,之前倒是有一些。 另外你的 jdk 版本是多少?我印象里 jdk8 早期版本 synchronized 有个 bug 可能会导致执行顺序错乱 leojie 于2023年5月9日周二 14:45写道: > hi all > 向社区求助一个问题,这两天总是在12:50左右遇到一个异常,描述如下: > hbase版本:2.2.6 > hadoop版本:3.3.1 > 异常现象:一个隔离组下的(只有一张表)的一个节点,在某一时刻write call > queue阻塞,阻塞时间点开始,这张表的读写qps都降为0,客户端读写不了该表,RS call queue阻塞开始的时间点,日志中不断有如下报错: > 2023-05-08 12:42:27,310 ERROR [MemStoreFlusher.2] > regionserver.MemStoreFlusher: Cache flush failed for region > > user_feature_v2,eacf_1658057555,1660314723816.2376cc2326b5372131cc530b115d959a. > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 30 ms for txid=16920651960, WAL system stuck? > at > > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:155) > at > > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:743) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:625) > at > > org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:602) > at > > org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2754) > at > > org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2691) > at > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549) > at > > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2409) > at > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:611) > at > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:580) > at > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68) > at > > org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:360) > at java.lang.Thread.run(Thread.java:748) > > 节点memstore中无法刷新数据到WAL文件中,节点其他指标都正常,HDFS也没有压力。重启阻塞节点后,表恢复正常。异常期间,捕获的jstack文件我放进附件中了。 > 麻烦社区大佬有空帮忙定位下原因 > jstack 文件见ISSUE: https://issues.apache.org/jira/browse/HBASE-27850的附件 >
TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?
hi all 向社区求助一个问题,这两天总是在12:50左右遇到一个异常,描述如下: hbase版本:2.2.6 hadoop版本:3.3.1 异常现象:一个隔离组下的(只有一张表)的一个节点,在某一时刻write call queue阻塞,阻塞时间点开始,这张表的读写qps都降为0,客户端读写不了该表,RS call queue阻塞开始的时间点,日志中不断有如下报错: 2023-05-08 12:42:27,310 ERROR [MemStoreFlusher.2] regionserver.MemStoreFlusher: Cache flush failed for region user_feature_v2,eacf_1658057555,1660314723816.2376cc2326b5372131cc530b115d959a. org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 30 ms for txid=16920651960, WAL system stuck? at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:155) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:743) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:625) at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:602) at org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2754) at org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2691) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2523) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2409) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:611) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:580) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:360) at java.lang.Thread.run(Thread.java:748) 节点memstore中无法刷新数据到WAL文件中,节点其他指标都正常,HDFS也没有压力。重启阻塞节点后,表恢复正常。异常期间,捕获的jstack文件我放进附件中了。 麻烦社区大佬有空帮忙定位下原因 jstack 文件见ISSUE: https://issues.apache.org/jira/browse/HBASE-27850的附件