Re: About batch messages cause many slow logs

dongeforever Thu, 14 Jun 2018 20:04:30 -0700

BTW
now i am developing a new end-to-end batch, in which the batch will not be
decoded all the way from producer to consumer, which will get higher
throughput and lower latency.
Now the storage layer is ready, but the network layer is to be finished.
The code will be shared to github later some day when i get some spare time.




2018-06-15 10:35 GMT+08:00 dongeforever <[email protected]>:

> *Hi:*
>
>  Here is my understanding, hope it useful to you:
>
> 1. Batch sending is recommend for data collecting, such as website
> activity tracking or log aggregation, which requires high throughput and
> could tolerate some latency. On the contrary, simple sending is good for
> message delivery in latency-sensitive distributed systems, such as trading
> or financial senarios.
>
> 2.If REJECTREQUEST happens, it usually means the broker is carrying a
> heavy load. It is suggested to expand the cluster. Just as you say, add new
> Master into the cluster.
>
>
> dongeforever
> Best Regards
>
> 2018-06-15 10:17 GMT+08:00 Von Gosling <[email protected]>:
>
>> Hi,
>>
>> Batch Message is designed for the streaming case in which you could get a
>> high throughput.
>>
>> Best Regards,
>>
>>
>> 在 2018年6月15日，09:39，Jason Joo <[email protected]> 写道：
>>
>> Hello,
>>
>> We have a cluster with 2m-2s architecture. TPS is between 1000 to 6000
>> and about 200 million messages in 24 hours typically.
>> The commit log size is about 200GB and can be cached in memory fully.
>>
>> Our server(one rocketmq each, 4 servers in this cluster):
>>
>> 6 core Xeon-E5-2620 x 2 (24 thread total)
>> 256G ddr3 1600MHz
>> 240G Intel SSD 530(SSDSC2BW240H6) x 6 (Raid 1)
>>
>>
>> To "IMPROVE" producing performance(mainly reduce network packages) we
>> start to use batch sending feature, and we begain to receive so much "slow
>> log" after that(batch is 50~70, 100~1000B per message body).
>>
>> After some tuning(osPageCacheBusyTimeOutMills, waitTimeMillsInSendQueue
>> to 500ms, etc) we only increasing the "time cost" printed in slow log.
>>
>> The pagecache rt in log before using batch sending:
>>
>> *[PAGECACHERT] TotalPut 50650, PutMessageDistributeTime [<=0ms]:44250
>> [0~10ms]:5865 [10~50ms]:514 [50~100ms]:10 [100~200ms]:11 [200~500ms]:0
>> [500ms~1s]:0 [1~2s]:0 [2~3s]:0 [3~4s]:0 [4~5s]:0 [5~10s]:0 [10s~]:0 *
>>
>>
>> after:
>>
>> *[PAGECACHERT] TotalPut 1970, PutMessageDistributeTime [<=0ms]:1428
>> [0~10ms]:459 [10~50ms]:4 [50~100ms]:2 [100~200ms]:5 [200~500ms]:10
>> [500ms~1s]:10 [1~2s]:25 [2~3s]:24 [3~4s]:3 [4~5s]:0 [5~10s]:0 [10s~]:0*
>>
>>
>>
>>
>> After reading CommitLog i found the whole batch message is surrounded in
>> one lock together. So it make sense the average executing time of PUT would
>> increase. but we alse will receive some "in lock slow log" which will
>> trigger flow control for a while:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2018-06-14 21:02:42 WARN SendMessageThread_102 - [NOTIFYME]putMessages
>> in lock cost time(ms)=1373, bodyLength=49380
>> AppendMessageResult=AppendMessageResult{status=PUT_OK,
>> wroteOffset=68238809866237, wroteBytes=54420, 
>> msgId='0A0A63CF00002A9F00003E1016059FFD,0A0A63CF00002A9F00003E101605A37D,0A0A63CF00002A9F00003E101605A6DE,0A0A63CF00002A9F00003E101605A8C3,0A0A63CF00002A9F00003E101605ABA3,0A0A63CF00002A9F00003E101605AE81,0A0A63CF00002A9F00003E101605B066,0A0A63CF00002A9F00003E101605B24B,0A0A63CF00002A9F00003E101605B4CC,0A0A63CF00002A9F00003E101605B7F5,0A0A63CF00002A9F00003E101605BB78,0A0A63CF00002A9F00003E101605BE70,0A0A63CF00002A9F00003E101605C19F,0A0A63CF00002A9F00003E101605C4BF,0A0A63CF00002A9F00003E101605C78B,0A0A63CF00002A9F00003E101605CAE4,0A0A63CF00002A9F00003E101605CDC4,0A0A63CF00002A9F00003E101605D206,0A0A63CF00002A9F00003E101605D581,0A0A63CF00002A9F00003E101605D83B,0A0A63CF00002A9F00003E101605DB5B,0A0A63CF00002A9F00003E101605DED0,0A0A63CF00002A9F00003E101605E1AA,0A0A63CF00002A9F00003E101605E54D,0A0A63CF00002A9F00003E101605E8C7,0A0A63CF00002A9F00003E101605EB92,0A0A63CF00002A9F00003E101605EF0C,0A0A63CF00002A9F00003E101605F0D8,0A0A63CF00002A9F00003E101605F346,0A0A63CF00002A9F00003E101605F5FB,0A0A63CF00002A9F00003E101605F8D6,0A0A63CF00002A9F00003E101605FC51,0A0A63CF00002A9F00003E101605FFD1,0A0A63CF00002A9F00003E101606019D,0A0A63CF00002A9F00003E1016060509,0A0A63CF00002A9F00003E1016060816,0A0A63CF00002A9F00003E1016060B7A,0A0A63CF00002A9F00003E1016060E61,0A0A63CF00002A9F00003E10160611DF,0A0A63CF00002A9F00003E1016061558,0A0A63CF00002A9F00003E10160618A6,0A0A63CF00002A9F00003E1016061B68,0A0A63CF00002A9F00003E1016061E6B,0A0A63CF00002A9F00003E10160621E5,0A0A63CF00002A9F00003E10160624A2,0A0A63CF00002A9F00003E101606281D,0A0A63CF00002A9F00003E1016062AE3,0A0A63CF00002A9F00003E1016062E60,0A0A63CF00002A9F00003E101606313E,0A0A63CF00002A9F00003E101606343A,0A0A63CF00002A9F00003E10160637B5,0A0A63CF00002A9F00003E1016063AE3,0A0A63CF00002A9F00003E1016063E67,0A0A63CF00002A9F00003E1016064218,0A0A63CF00002A9F00003E10160644E0,0A0A63CF00002A9F00003E101606487A,0A0A63CF00002A9F00003E1016064BF4,0A0A63CF00002A9F00003E1016064E61,0A0A63CF00002A9F00003E101606512E,0A0A63CF00002A9F00003E1016065489,0A0A63CF00002A9F00003E101606574F,0A0A63CF00002A9F00003E1016065AE3,0A0A63CF00002A9F00003E1016065CC8,0A0A63CF00002A9F00003E1016065FED,0A0A63CF00002A9F00003E10160662E2,0A0A63CF00002A9F00003E10160664C7,0A0A63CF00002A9F00003E10160667F6,0A0A63CF00002A9F00003E1016066B81,0A0A63CF00002A9F00003E1016066E54,0A0A63CF00002A9F00003E1016067132',
>> storeTimestamp=1528981361218, logicsOffset=5784993480, pagecacheRT=1373,
>> msgNum=70}2018-06-14 21:02:42 WARN SendMessageThread_102 - not in lock
>> eclipse time(ms)=1374, bodyLength=493802018-06-14 21:02:42 WARN
>> SendMessageThread_79 - not in lock eclipse time(ms)=688,
>> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_84 - not in lock
>> eclipse time(ms)=562, bodyLength=482422018-06-14 21:02:42 WARN
>> SendMessageThread_107 - not in lock eclipse time(ms)=839,
>> bodyLength=2742018-06-14 21:02:42 WARN SendMessageThread_97 - not in lock
>> eclipse time(ms)=1199, bodyLength=494692018-06-14 21:02:42 WARN
>> SendMessageThread_111 - not in lock eclipse
>> time(ms)=1307, bodyLength=473292018-06-14 21:02:42 WARN
>> SendMessageThread_91 - not in lock eclipse time(ms)=786,
>> bodyLength=2752018-06-14 21:02:42 WARN SendMessageThread_67 - not in lock
>> eclipse time(ms)=807, bodyLength=5752018-06-14 21:02:42 WARN
>> SendMessageThread_65 - not in lock eclipse time(ms)=1326,
>> bodyLength=472942018-06-14 21:02:43 INFO FlushRealTimeService - Flush data
>> to disk costs 2271 ms2018-06-14 21:02:43 WARN SendMessageThread_116 - not
>> in lock eclipse time(ms)=1542, bodyLength=500392018-06-14 21:02:43 WARN
>> SendMessageThread_119 - not in lock eclipse
>> time(ms)=1202, bodyLength=500242018-06-14 21:02:43 WARN
>> SendMessageThread_55 - not in lock eclipse time(ms)=1436, bodyLength=907*
>>
>>
>> So the question is:
>>
>>    - Is batch sending recommended?
>>    - It maybe "slower" compared to simple sending, should i increase the
>>    in-lock-timeout?
>>    - how we can solve the performance problem(many REJECTREQUEST will
>>    generated in producers), add new Master into the cluster?
>>
>>
>>
>> My question
>>
>> best regards,
>>
>> Jason
>>
>>
>>
>

Re: About batch messages cause many slow logs

Reply via email to