这看起来还是网络有抖动所以连接断了?建议还是查查出这个 log 时候集群的各种指标

也可以尝试升级一下 JDK 吧

https://bugs.openjdk.org/browse/JDK-8215355

这个是 8u251 之后才 fix 的,虽然这里写的是用 jemalloc 才会出现,但我之前也疑似遇到过,升级之后就再也没有了

leojie <leo...@apache.org> 于2023年5月11日周四 12:03写道:

> java 的版本是:1.8.0_131
>
> 翻了翻异常时间点之前的日志,发现有如下相关报错:
> 2023-05-11 10:59:32,711 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> Stopping HBase metrics system...
> 2023-05-11 10:59:32,728 INFO  [ganglia] impl.MetricsSinkAdapter: ganglia
> thread interrupted.
> 2023-05-11 10:59:32,728 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> HBase metrics system stopped.
> 2023-05-11 10:59:33,038 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> java.io.IOException: stream already broken
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> java.io.IOException: stream already broken
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush0(FanOutOneBlockAsyncDFSOutput.java:423)
> at
> org.apache.hadoop.hbase.io
> .asyncfs.FanOutOneBlockAsyncDFSOutput.flush(FanOutOneBlockAsyncDFSOutput.java:512)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncProtobufLogWriter.sync(AsyncProtobufLogWriter.java:148)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:379)
> at
>
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.consume(AsyncFSWAL.java:566)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,040 WARN
>  [AsyncFSWAL-0-hdfs://hadoop-bdxs-hb1-namenode/hbase] wal.AsyncFSWAL: sync
> failed
> org.apache.hbase.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
> syscall:read(..) failed: Connection reset by peer
> at
>
> org.apache.hbase.thirdparty.io.netty.channel.unix.FileDescriptor.readAddress(..)(Unknown
> Source)
> 2023-05-11 10:59:33,229 INFO  [HBase-Metrics2-1] impl.MetricsConfig: Loaded
> properties from hadoop-metrics2-hbase.properties
> 2023-05-11 10:59:33,231 INFO  [HBase-Metrics2-1] impl.MetricsSinkAdapter:
> Sink ganglia started
> 2023-05-11 10:59:33,257 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> Scheduled Metric snapshot period at 20 second(s).
> 2023-05-11 10:59:33,257 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl:
> HBase metrics system started
>
>
>
> 张铎(Duo Zhang) <palomino...@gmail.com> 于2023年5月11日周四 10:25写道:
>
> > 你往上翻翻有没有别的异常?这个看起来应该就是 AsyncFSWAL 有 bug 导致 hang 住不动了,不过我翻了一下,2.2.6
> > 之后似乎没有跟这个有关的 fix 了,之前倒是有一些。
> >
> > 另外你的 jdk 版本是多少?我印象里 jdk8 早期版本 synchronized 有个 bug 可能会导致执行顺序错乱
> >
> > leojie <leo...@apache.org> 于2023年5月9日周二 14:45写道:
> >
> > > hi all
> > >     向社区求助一个问题,这两天总是在12:50左右遇到一个异常,描述如下:
> > >     hbase版本:2.2.6
> > >     hadoop版本:3.3.1
> > >     异常现象:一个隔离组下的(只有一张表)的一个节点,在某一时刻write call
> > > queue阻塞,阻塞时间点开始,这张表的读写qps都降为0,客户端读写不了该表,RS call
> queue阻塞开始的时间点,日志中不断有如下报错:
> > > 2023-05-08 12:42:27,310 ERROR [MemStoreFlusher.2]
> > > regionserver.MemStoreFlusher: Cache flush failed for region
> > >
> > >
> >
> user_feature_v2,eacf_1658057555,1660314723816.2376cc2326b5372131cc530b115d959a.
> > > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get
> sync
> > > result after 300000 ms for txid=16920651960, WAL system stuck?
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:155)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:743)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:625)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.sync(AsyncFSWAL.java:602)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doSyncOfUnflushedWALChanges(HRegion.java:2754)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2691)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2549)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2523)
> > >         at
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2409)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:611)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:580)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68)
> > >         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:360)
> > >         at java.lang.Thread.run(Thread.java:748)
> > >
> > >
> >
> 节点memstore中无法刷新数据到WAL文件中,节点其他指标都正常,HDFS也没有压力。重启阻塞节点后,表恢复正常。异常期间,捕获的jstack文件我放进附件中了。
> > > 麻烦社区大佬有空帮忙定位下原因
> > > jstack 文件见ISSUE: https://issues.apache.org/jira/browse/HBASE-27850的附件
> > >
> >
>

Reply via email to