Hi all,

Another update on this topic.
- I managed to do a much better capture of the TCP traffic between the PUB
->SUB using a switch Port Mirror feature. Indeed now larger-than-MTU
packets disappeared
- I measured with more accuracy the message generation rate for my use case
and it turns out to be 2.5usec.
- I measured average msg size more precisely; in my use case it's 296B
- Some trivial computations on the PHY link I'm using (with raw speed of
20Gbps) show that to send ~300B on a 20Gbps link takes just 20nsec
- Some trivial computations also provide as TCP upperbound for a frame
generation time of 2.5usec on a 20Gbps link to be around roughly 900Mbps
which is exactly what I'm measuring as outgoing throughput from the PUB
socket

Based on the considerations above I now believe that my problem is not
anymore the "quality"  of the TCP connection produced by my PUB socket...
my software is bound by the frame generation rate, not by the speed of the
link.

However I'm still far from solving my original problem. My software is
receiving roughly 900Mbps (from a SUB socket) and generating 900Mbps (out
of a PUB socket). To do that it's scaled to 16 ZMQ background threads
(!!)... that really sounds too much.

I'll start a different email thread (just for the mailing list history)
with another "strange" effect I found and that's impacting on the CPU usage
of ZMQ background threads...

Francesco


Il giorno dom 28 mar 2021 alle ore 17:43 Francesco <
francesco.monto...@gmail.com> ha scritto:

> Hi all,
>
> A few more questions after inspecting ZMQ source code:
> - I see that in June 2019 the following PR was merged:
> https://github.com/zeromq/libzmq/pull/3555   This one exposes 
> ZMQ_OUT_BATCH_SIZE.
> At first look it may seem exactly what I was looking for, but the thing is
> that the default value is already quite high (8192)... in my use case
> probably it would be enough to coalesce together a max of 5 or 6 messages
> to reach the MTU size.
> - The thread that is publishing on my PUB zmq socket probably takes
> between 100-500usec to generate a new message. That means that to generate
> 5 messages in worst case it might take 2.5msec. I would be OK to pay this
> latency in order to improve throughput... .is there any way to achieve
> that? What happens if I disable the code in ZMQ that sets TCP_NODELAY and
> replace it with TCP_CORK ? Do you think I could get some kind of breakage
> of my PUB/SUB connections?
>
> and one consideration:
>  - I discovered why my tcpdump capture contains larger-than-MTU packets
> (even though they are <1%): the reason is that capturing traffic on the
> same server sending/receiving the traffic is not a  good idea:
>
> https://blog.packet-foo.com/2014/05/the-drawbacks-of-local-packet-captures/
> https://packetbomb.com/how-can-the-packet-size-be-greater-than-the-mtu/
> I will try to acquire tcpdumps from the SPAN port of a managed switch. I
> don't think the results will change much though
>
> Thanks for any hint,
> Francesco
>
>
> Il giorno sab 27 mar 2021 alle ore 10:22 Francesco <
> francesco.monto...@gmail.com> ha scritto:
>
>> Hi Jim,
>> You're right and I have in plan to change the MTU to 9000 for sure.
>> However even now, with the MTU being 1500, I see most packets are very far
>> from the limit.
>> Attached is a screenshot of the capture:
>>
>> [image: tcp_capture.png]
>>
>> By looking at the timestamps I see that the packets of size 583B and 376B
>> are spaced just 100us roughly and between the packet of 376B and 366B are
>> spaced 400us.
>> In this case I'd be more than welcome to pay some extra latency and merge
>> all these 3 packets together.
>>
>> After some more digging I found this code in ZMQ:
>>
>>     //  Disable Nagle's algorithm. We are doing data batching on 0MQ
>> level,
>>     //  so using Nagle wouldn't improve throughput in anyway, but it would
>>     //  hurt latency.
>>     int nodelay = 1;
>>     const int rc =
>>       setsockopt (s_, IPPROTO_TCP, TCP_NODELAY,
>>                   reinterpret_cast<char *> (&nodelay), sizeof (int));
>>     assert_success_or_recoverable (s_, rc);
>>     if (rc != 0)
>>         return rc;
>>
>> Now my next question is: where is this " data batching on 0MQ level"
>> happening? Can I tune it somehow? Can I restore Nagle algorithm ?
>> I saw also from here
>>   https://man7.org/linux/man-pages/man7/tcp.7.html
>> that there's the possibility to set TCP_CORK as option on the socket to
>> try to optimize throughput ... any way to do that through ZMQ?
>>
>> Thanks!!
>>
>> Francesco
>>
>>
>>
>>
>> Il giorno sab 27 mar 2021 alle ore 05:01 Jim Melton <jim@melton.space>
>> ha scritto:
>>
>>> Small TCP packets will never achieve maximum throughput. This is
>>> independent of ZMQ. Each TCP packet requires a synchronous round-trip.
>>>
>>> For a 20 Gbps network, you need a larger MTU to achieve close to
>>> theoretical bandwidth, and each packet needs to be close to MTU. Jumbo MTU
>>> is typically 9000 bytes. The TCP ACK packets will kill your throughput,
>>> though.
>>> --
>>> Jim Melton
>>> (303) 829-0447
>>> http://blogs.melton.space/pharisee/
>>> jim@melton.space
>>>
>>>
>>>
>>>
>>> On Mar 26, 2021, at 4:17 PM, Francesco <francesco.monto...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm using ZMQ in a product that moves a lot of data using TCP as
>>> transport and PUB-SUB as communication pattern. "A lot" here means around
>>> 1Gbps. The software is actually a mono-directional chain of small
>>> components each linked to the previous with a SUB socket (to receive data)
>>> and a PUB socket (to send data to next stage).
>>> I'm debugging an issue with one of these components receiving 1.1Gbps
>>> from its SUB socket and sending out 1.1Gbps on its PUB socket (no wonder
>>> the two numbers match since the component does not aggregation whatsoever).
>>>
>>> The "problem" is that we are currently using 16 ZMQ background threads
>>> to move a total of 2.2Gbps for that software component (note the physical
>>> links can carry up to 20Gbps so we're far from saturation of the link).
>>> IIRC the "golden rule" for sizing number of ZMQ background threads is 1Gbps
>>> = 1 thread.
>>> As you can see we're very far from this golden rule, and that's what I'm
>>> trying to debug.
>>>
>>> The ZMQ background threads have a CPU usage ranging from 98% to 80%.
>>> Using "strace" I see that most of the time for these threads is spent in
>>> the "sendto" syscall.
>>> So I started digging on the quality of the TX side of the TCP
>>> connection, recording a short trace of the traffic outgoing from the
>>> software component.
>>>
>>> Analyzing the traffic with wireshark it turns out that the TCP packets
>>> for the PUB connection are pretty small:
>>> * 50% of them are 66B long; these are the TCP ACK packets (incoming)
>>> * 21% of them are in the range 160B-320B
>>> * 18% in the range 320B-640B
>>> * 5% in range 640B-1280B
>>> * just 3% reach the MTU equal to 1500B
>>> * [there are a <1% fraction that also exceed the MTU=1500B of the link,
>>> which I'm not sure how is possible]
>>>
>>> My belief is that having a fewer number of packets, all close to the MTU
>>> of the link should greatly improve the performances. Would you agree with
>>> that?
>>> Is there any configuration I can apply on the PUB socket to force the
>>> Linux TCP stack to generate fewer but larger TCP segments on the wire?
>>>
>>> Thanks for any hint,
>>>
>>> Francesco
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev@lists.zeromq.org
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> zeromq-dev@lists.zeromq.org
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to