RongtongJin commented on issue #2427: URL: https://github.com/apache/rocketmq/issues/2427#issuecomment-731016501
> 这几天在写CURD,这个问题还是存在, > > #### store.log中[PAGECACHERT]的统计存在明显的刷盘超时: > `2020-11-18 14:22:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 50, PutMessageDistributeTime [<=0ms]:10 [0~10ms]:35 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:1 [1~2s]:1 [2~3s]:1 [3~4s]:0 [4~5s]:2 [5~10s]:0 [10s~]:0 ` > > #### 在每次发生这个现象的时候都会出现这些异常: > 我们的broker配置如下: > > ``` > brokerClusterName = DefaultCluster > brokerName = broker-a > brokerId = 0 > deleteWhen = 04 > fileReservedTime = 48 > brokerRole = ASYNC_MASTER > flushDiskType = ASYNC_FLUSH > ``` > > 看PAGECACHERT中有 2--3s:1、4--5s:2 这种明显的超时 > > ``` > 2020-11-18 14:21:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 52, PutMessageDistributeTime [<=0ms]:9 [0~10ms]:43 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 > 2020-11-18 14:21:14 WARN ReadSocketService - ha housekeeping, found this connection[/slaveIP:47894] expired, 20019 > 2020-11-18 14:21:14 INFO ReadSocketService - ReadSocketService service end > 2020-11-18 14:21:14 INFO AcceptSocketService - HAService receive new connection, /slaveIP:33400 > 2020-11-18 14:21:14 INFO ReadSocketService - ReadSocketService service started > 2020-11-18 14:21:14 INFO WriteSocketService - WriteSocketService service started > 2020-11-18 14:21:14 INFO WriteSocketService - WriteSocketService service end > 2020-11-18 14:21:19 INFO ReadSocketService - slave[/slaveIP:33400] request offset 2235247317 > 2020-11-18 14:21:19 INFO WriteSocketService - master transfer data from 2235247317 to slave[/slaveIP:33400], and slave request 2235247317 > 2020-11-18 14:21:19 WARN SendMessageThread_40 - putMessage not in lock eclipse time(ms)=4370, bodyLength=492 > 2020-11-18 14:21:19 WARN SendMessageThread_37 - putMessage not in lock eclipse time(ms)=4056, bodyLength=466 > 2020-11-18 14:21:19 WARN SendMessageThread_34 - putMessage not in lock eclipse time(ms)=2280, bodyLength=459 > 2020-11-18 14:21:19 WARN SendMessageThread_10 - putMessage not in lock eclipse time(ms)=1081, bodyLength=523 > 2020-11-18 14:21:19 WARN SendMessageThread_28 - putMessage not in lock eclipse time(ms)=719, bodyLength=444 > 2020-11-18 14:22:05 INFO StoreStatsService - [STORETPS] put_tps 0.8332777814812345 get_found_tps 0.8332777814812345 get_miss_tps 4.466368908739417 get_transfered_tps 0.8332777814812345 > 2020-11-18 14:22:05 INFO StoreStatsService - [PAGECACHERT] TotalPut 50, PutMessageDistributeTime [<=0ms]:10 [0~10ms]:35 [10~50ms]:0 [50~100ms]:0 [100~200ms]:0 [200~500ms]:0 [500ms~1s]:1 [1~2s]:1 [2~3s]:1 [3~4s]:0 [4~5s]:2 [5~10s]:0 [10s~]:0 > ``` > > #### GC的情况: > 查看了GC的日志,发现GC是没有什么问题的,GC时间都很短而且在RemotingTimeoutException异常发生时没有出现GC > > 认真研读了Commitlog#putmessage(msg) 的逻辑,同时观察日志没有出现明显的异常,现在似乎是卡在这里了,是不是我们的内存配置,或者是主从交换的配置有什么问题,看到有一个issue:[issues428](https://github.com/apache/rocketmq/issues/428)的情况跟我一样,是有关于心跳的问题吗? > > @RongtongJin 期待您的回答,万分感谢您! 那就看看是不是磁盘的问题,应该刷盘慢的时间段await, io util也很高,但看起来业务量不大。配置看起来设置比较简单,MQ这边可以调优一下线程数sendMessageThreadPoolNums,开启transientStorePoolEnable减少刷盘毛刺。可以执行下distribution/bin/os.sh脚本优化一下操作系统参数,特别是一些脏页writeback参数、IO请求调度算法等,另外尽量不要使用swap分区,需要通过工具来复现找出具体原因。 和issue428应该没有什么关系,刷盘慢是原因,timeout是结果。 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
