Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-31 Thread leojie
嗯嗯,我们用的是hbase2.2.6,卡住的起因与第一个ISSUE描述的很像,测试用例我也复现了,第二个ISSUE应用目的是,sync
wal卡住时,希望能中断RegionServer,否则的话,RS进程会卡几个小时以上,才OOM退出,这个时间内,此节点写入完全停止

张铎(Duo Zhang)  于2023年6月1日周四 10:39写道:

> 第一个 issue 的情况是有一个 DN 返回的速度比别的 DN 都快然后他又挂了,就可能会卡住
> 第二个 issue 是说 shutdown WAL 的时候可能会卡住,这个主要是导致 RegionServer 退出不了
>
> 应该在 2.4.10 之后的版本都修复了,你可以试试
>
> leojie  于2023年6月1日周四 09:34写道:
>
> > 非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch:
> > https://issues.apache.org/jira/browse/HBASE-26679
> 通过测试用例可以稳定复现我们使用版本的阻塞异常
> > https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal
> > sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况
> >
> > leojie  于2023年5月15日周一 15:51写道:
> >
> > > 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block
> > > ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高),
> > > 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。
> > >
> > > 感觉DN有压力是引起此现象的一个导火索,但可能不是sync
> > > wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码
> > >
> > > 张铎(Duo Zhang)  于2023年5月13日周六 18:58写道:
> > >
> > >> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标
> > >>
> > >> 也可以尝试升级一下 JDK 吧
> > >>
> > >> https://bugs.openjdk.org/browse/JDK-8215355
> > >>
> > >> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了
> > >>
> > >> leojie  于2023年5月11日周四 12:03写道:
> > >>
> > >> > java 的版本是:1.8.0_131
> > >> >
> > >> > 翻了翻异常时间点之前的日志,发现有如下相关报错:
> > >> > 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1]
> > impl.MetricsSystemImpl:
> > >> > Stopping HBase metrics system...
> > >> > 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter:
> > ganglia
> > >> > thread interrupted.
> > >> > 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1]
> > impl.MetricsSystemImpl:
> > >> > HBase metrics system stopped.
> > >> > 2023-05-11 10:59:33,038 WARN
> > >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase]
> wal.AsyncFSWAL:
> > >> sync
> > >> > failed
> > >> > java.io.IOException: stream already broken
> > >> > at
> > >> > org.apache.hadoop.hbase.io
> > >> >
> > >>
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> > >> > at
> > >> > org.apache.hadoop.hbase.io
> > >> >
> > >>
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > >> > at java.lang.Thread.run(Thread.java:748)
> > >> > 2023-05-11 10:59:33,040 WARN
> > >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase]
> wal.AsyncFSWAL:
> > >> sync
> > >> > failed
> > >> > java.io.IOException: stream already broken
> > >> > at
> > >> > org.apache.hadoop.hbase.io
> > >> >
> > >>
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> > >> > at
> > >> > org.apache.hadoop.hbase.io
> > >> >
> > >>
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > >> > at
> > >> >
> > >> >
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > >> > at java.lang.Thread.run(Thread.java:748)
> > >> > 2023-05-11 10:59:33,040 WARN
> > >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase]
> wal.AsyncFSWAL:
> > >> sync
> > >> > failed
> > >> >
> > >>
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > >> > syscall:read(..) failed: Connection reset by peer
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > >> > Source)
> > >> > 2023-05-11 10:59:33,040 WARN
> > >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase]
> wal.AsyncFSWAL:
> > >> sync
> > >> > failed
> > >> >
> > >>
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > >> > syscall:read(..) failed: Connection reset by peer
> > >> > at
> > >> >
> > >> >
> > >>
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > >> > Source)
> > >> > 2023-05-11 

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-31 Thread Duo Zhang
第一个 issue 的情况是有一个 DN 返回的速度比别的 DN 都快然后他又挂了,就可能会卡住
第二个 issue 是说 shutdown WAL 的时候可能会卡住,这个主要是导致 RegionServer 退出不了

应该在 2.4.10 之后的版本都修复了,你可以试试

leojie  于2023年6月1日周四 09:34写道:

> 非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch:
> https://issues.apache.org/jira/browse/HBASE-26679 通过测试用例可以稳定复现我们使用版本的阻塞异常
> https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal
> sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况
>
> leojie  于2023年5月15日周一 15:51写道:
>
> > 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block
> > ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高),
> > 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。
> >
> > 感觉DN有压力是引起此现象的一个导火索,但可能不是sync
> > wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码
> >
> > 张铎(Duo Zhang)  于2023年5月13日周六 18:58写道:
> >
> >> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标
> >>
> >> 也可以尝试升级一下 JDK 吧
> >>
> >> https://bugs.openjdk.org/browse/JDK-8215355
> >>
> >> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了
> >>
> >> leojie  于2023年5月11日周四 12:03写道:
> >>
> >> > java 的版本是:1.8.0_131
> >> >
> >> > 翻了翻异常时间点之前的日志,发现有如下相关报错:
> >> > 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1]
> impl.MetricsSystemImpl:
> >> > Stopping HBase metrics system...
> >> > 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter:
> ganglia
> >> > thread interrupted.
> >> > 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1]
> impl.MetricsSystemImpl:
> >> > HBase metrics system stopped.
> >> > 2023-05-11 10:59:33,038 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > failed
> >> > java.io.IOException: stream already broken
> >> > at
> >> > org.apache.hadoop.hbase.io
> >> >
> >>
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> >> > at
> >> > org.apache.hadoop.hbase.io
> >> >
> >>
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> >> > at
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >> > at
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >> > at java.lang.Thread.run(Thread.java:748)
> >> > 2023-05-11 10:59:33,040 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > failed
> >> > java.io.IOException: stream already broken
> >> > at
> >> > org.apache.hadoop.hbase.io
> >> >
> >>
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> >> > at
> >> > org.apache.hadoop.hbase.io
> >> >
> >>
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> >> > at
> >> >
> >> >
> >>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> >> > at
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> >> > at
> >> >
> >> >
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> >> > at java.lang.Thread.run(Thread.java:748)
> >> > 2023-05-11 10:59:33,040 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > failed
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> >> > syscall:read(..) failed: Connection reset by peer
> >> > at
> >> >
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> >> > Source)
> >> > 2023-05-11 10:59:33,040 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > failed
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> >> > syscall:read(..) failed: Connection reset by peer
> >> > at
> >> >
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> >> > Source)
> >> > 2023-05-11 10:59:33,040 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > failed
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> >> > syscall:read(..) failed: Connection reset by peer
> >> > at
> >> >
> >> >
> >>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> >> > Source)
> >> > 2023-05-11 10:59:33,040 WARN
> >> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> >> sync
> >> > 

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-31 Thread leojie
非常感谢张老师之前的解答,在ISSUE列表中我找到了如下patch:
https://issues.apache.org/jira/browse/HBASE-26679 通过测试用例可以稳定复现我们使用版本的阻塞异常
https://issues.apache.org/jira/browse/HBASE-25905 这个是您提交的修复,wal
sync阻塞时候会立即中断RegionServer进程,不知道我理解的是否有误,我先应用这两个修复,再观察下集群情况

leojie  于2023年5月15日周一 15:51写道:

> 感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block
> ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高),
> 停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。
>
> 感觉DN有压力是引起此现象的一个导火索,但可能不是sync
> wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码
>
> 张铎(Duo Zhang)  于2023年5月13日周六 18:58写道:
>
>> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标
>>
>> 也可以尝试升级一下 JDK 吧
>>
>> https://bugs.openjdk.org/browse/JDK-8215355
>>
>> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了
>>
>> leojie  于2023年5月11日周四 12:03写道:
>>
>> > java 的版本是:1.8.0_131
>> >
>> > 翻了翻异常时间点之前的日志,发现有如下相关报错:
>> > 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
>> > Stopping HBase metrics system...
>> > 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter: ganglia
>> > thread interrupted.
>> > 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
>> > HBase metrics system stopped.
>> > 2023-05-11 10:59:33,038 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> > java.io.IOException: stream already broken
>> > at
>> > org.apache.hadoop.hbase.io
>> >
>> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
>> > at
>> > org.apache.hadoop.hbase.io
>> >
>> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
>> > at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> > at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> > at java.lang.Thread.run(Thread.java:748)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> > java.io.IOException: stream already broken
>> > at
>> > org.apache.hadoop.hbase.io
>> >
>> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
>> > at
>> > org.apache.hadoop.hbase.io
>> >
>> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
>> > at
>> >
>> >
>> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
>> > at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> > at
>> >
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> > at java.lang.Thread.run(Thread.java:748)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>> > syscall:read(..) failed: Connection reset by peer
>> > at
>> >
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
>> > Source)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>> > syscall:read(..) failed: Connection reset by peer
>> > at
>> >
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
>> > Source)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>> > syscall:read(..) failed: Connection reset by peer
>> > at
>> >
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
>> > Source)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>> > syscall:read(..) failed: Connection reset by peer
>> > at
>> >
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
>> > Source)
>> > 2023-05-11 10:59:33,040 WARN
>> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
>> sync
>> > failed
>> >
>> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
>> > syscall:read(..) 

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-15 Thread leojie
感谢张老师的回复,出现该问题期间,有张比较大的表在进行快照scan,资源并发数设置的比较大,引起read block
ops特别高,导致DN有压力(与历史对比,个别DN的XceiverCount指标持续异常高),
停止该任务后,这两天集群指标比较稳定了,没有发现WAL system stuck?的异常了。

感觉DN有压力是引起此现象的一个导火索,但可能不是sync
wal操作hang住一两个小时后,才导致RS挂掉的直接原因,我试试升级jdk,再研读下这块的代码

张铎(Duo Zhang)  于2023年5月13日周六 18:58写道:

> 这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标
>
> 也可以尝试升级一下 JDK 吧
>
> https://bugs.openjdk.org/browse/JDK-8215355
>
> 这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了
>
> leojie  于2023年5月11日周四 12:03写道:
>
> > java 的版本是:1.8.0_131
> >
> > 翻了翻异常时间点之前的日志,发现有如下相关报错:
> > 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> > Stopping HBase metrics system...
> > 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter: ganglia
> > thread interrupted.
> > 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> > HBase metrics system stopped.
> > 2023-05-11 10:59:33,038 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> > java.io.IOException: stream already broken
> > at
> > org.apache.hadoop.hbase.io
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> > at
> > org.apache.hadoop.hbase.io
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:748)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> > java.io.IOException: stream already broken
> > at
> > org.apache.hadoop.hbase.io
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> > at
> > org.apache.hadoop.hbase.io
> >
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> > at
> >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:748)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > Source)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > Source)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > Source)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > Source)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> > Source)
> > 2023-05-11 10:59:33,040 WARN
> >  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL:
> sync
> > failed
> >
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> > syscall:read(..) failed: Connection reset by peer
> > at
> >

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-13 Thread Duo Zhang
这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标

也可以尝试升级一下 JDK 吧

https://bugs.openjdk.org/browse/JDK-8215355

这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了

leojie  于2023年5月11日周四 12:03写道:

> java 的版本是:1.8.0_131
>
> 翻了翻异常时间点之前的日志,发现有如下相关报错:
> 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> Stopping HBase metrics system...
> 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter: ganglia
> thread interrupted.
> 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> HBase metrics system stopped.
> 2023-05-11 10:59:33,038 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> java.io.IOException: stream already broken
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> java.io.IOException: stream already broken
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> 

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-10 Thread leojie
java 的版本是:1.8.0_131

翻了翻异常时间点之前的日志,发现有如下相关报错:
2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
Stopping HBase metrics system...
2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter: ganglia
thread interrupted.
2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
HBase metrics system stopped.
2023-05-11 10:59:33,038 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
java.io.IOException: stream already broken
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
java.io.IOException: stream already broken
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
at
org.apache.hadoop.hbase.io.asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at
org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
Source)
2023-05-11 10:59:33,040 WARN
 [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
failed
org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
syscall:read(..) failed: Connection reset by peer
at

Re: TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-10 Thread Duo Zhang
你往上翻翻有没有别的异常?这个看起来应该就是 AsyncFSWAL 有 bug 导致 hang 住不动了,不过我翻了一下,2.2.6
之后似乎没有跟这个有关的 fix 了,之前倒是有一些。

另外你的 jdk 版本是多少?我印象里 jdk8 早期版本 synchronized 有个 bug 可能会导致执行顺序错乱

leojie  于2023年5月9日周二 14:45写道:

> hi all
> 向社区求助一个问题,这两天总是在12:50左右遇到一个异常,描述如下:
> hbase版本:2.2.6
> hadoop版本:3.3.1
> 异常现象:一个隔离组下的(只有一张表)的一个节点,在某一时刻write call
> queue阻塞,阻塞时间点开始,这张表的读写qps都降为0,客户端读写不了该表,RS call queue阻塞开始的时间点,日志中不断有如下报错:
> 2023-05-08 12:42:27,310 ERROR [MemStoreFlusher.2]
> regionserver.MemStoreFlusher: Cache flush failed for region
>
> user_feature_v2,eacf_1658057555,1660314723816.2376cc2326b5372131cc530b115d959a.
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
> result after 30 ms for txid=16920651960, WAL system stuck?
> at
>
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:155)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:743)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:625)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:602)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2754)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2691)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
> at
>
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2523)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2409)
> at
>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:611)
> at
>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:580)
> at
>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68)
> at
>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:360)
> at java.lang.Thread.run(Thread.java:748)
>
> 节点memstore中无法刷新数据到WAL文件中,节点其他指标都正常,HDFS也没有压力。重启阻塞节点后,表恢复正常。异常期间,捕获的jstack文件我放进附件中了。
> 麻烦社区大佬有空帮忙定位下原因
> jstack 文件见ISSUE: https://issues.apache.org/jira/browse/HBASE-27850的附件
>


TimeoutIOException: Failed to get sync result after 300000 ms for txid=16920651960, WAL system stuck?

2023-05-09 Thread leojie
hi all
向社区求助一个问题,这两天总是在12:50左右遇到一个异常,描述如下:
hbase版本:2.2.6
hadoop版本:3.3.1
异常现象:一个隔离组下的(只有一张表)的一个节点,在某一时刻write call
queue阻塞,阻塞时间点开始,这张表的读写qps都降为0,客户端读写不了该表,RS call queue阻塞开始的时间点,日志中不断有如下报错:
2023-05-08 12:42:27,310 ERROR [MemStoreFlusher.2]
regionserver.MemStoreFlusher: Cache flush failed for region
user_feature_v2,eacf_1658057555,1660314723816.2376cc2326b5372131cc530b115d959a.
org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync
result after 30 ms for txid=16920651960, WAL system stuck?
at
org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:155)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:743)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:625)
at
org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:602)
at
org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2754)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2691)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2523)
at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2409)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:611)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:580)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:360)
at java.lang.Thread.run(Thread.java:748)
节点memstore中无法刷新数据到WAL文件中,节点其他指标都正常,HDFS也没有压力。重启阻塞节点后,表恢复正常。异常期间,捕获的jstack文件我放进附件中了。
麻烦社区大佬有空帮忙定位下原因
jstack 文件见ISSUE: https://issues.apache.org/jira/browse/HBASE-27850的附件