Re: [zeromq-dev] Performance results on 100Gbps direct link

Alexander Hill Thu, 24 Oct 2019 23:03:47 -0700

To explain point two, you can't easily impose an order of messages across
multiple connections, even to the same peer. It's sort of a fundamental
limit: the only reason a single ZMQ connection can provide in-order
delivery is because it leans on TCP to correct the multiple delivery,
out-of-order delivery, random bit-flips, lost segments, and other chaos
that goes on at the IP level. If all your application needs to worry about
is getting many messages to the other end as fast as possible then by all
means open multiple connections, and similarly if you have urgent messages
that need to be processed asynchronously and dodge head-of-line blocking in
the high volume channel, but if you need to make sure your messages are
processed in some global order you're better off using a single connection
and not needing to reinvent half of TCP all over again.

You also can't use credit based flow control here. PUSH and PULL are
unidirectional, you can't send credit from the PULL socket to the PUSH.
Replace PUSH with DEALER and PULL with ROUTER and you can, but even then
credit based flow control is about limiting the amount of
messages/bytes/other unit of work currently in flight in the system. If
your goal is solely to saturate a link then it's actually the opposite of
what you want.

Now. My experience is sort of at the extreme other end of the spectrum from
your use case but if I were inclined to optimize for Maximum Fast, at this
point two lines of investigation occur to me. First, what is happening at
the network level once throughput hits that plateau? I'd take a pcap and
then open it in wireshark after the test is over; look at the times
payload-carrying packets go out and when the corresponding ACK packets come
back. If they bunch up in any way, some tuning of TCP options at the socket
option or possibly kernel sysctl level may be called for.

Secondly, where is that CPU time actually being spent? Intuitively I expect
this would take more effort to bear fruit which is why I'd save it for
after poking at the network, but I'd make a flamegraph[1] (you can
substitute your favorite profiler here) and look for hotspots I might be
able to optimize. Past the threshold where cpu decreases things get harder,
since that looks like time waiting on locks or hardware starts to dominate.
There's ways to interrogate waits into and past the kernel, but I've never
had to do it so I can't tell you how painful it might be.

Good luck, friend.

[1]: https://github.com/brendangregg/FlameGraph

On Thu, Oct 24, 2019, 9:38 PM Brett Viren via zeromq-dev <
zeromq-dev@lists.zeromq.org> wrote:

> Hi again,
>
> Doron Somech <somdo...@gmail.com> writes:
>
> > You need to create multiple connections to enjoy the multiple io threads.
> >
> > So in the remote/local_thr connect to the same endpoint 100 times and
> create 10 io
> > threads.
> >
> > You don't need to create multiple sockets, just call connect multiple
> times with same
> > address.
>
> I keep working on evaluating ZeroMQ against this 100 Gbps network when I
> get a spare moment.  You can see some initial results in the attached
> PNG.  As-is, things look pretty decent but there are two effects I see
> which I don't fully understand and I think are important to achieving
> something closer to saturation.
>
> 1) As you can see in the PNG, as the message size increases beyond 10 kB
> the 10 I/O threads become less and less active.  This activity seems
> correlated with throughput.  But, why the die-off as we go to higher
> message size and why the resurgence at ~1 MB?  Might there be some
> additional tricks to lift the throughput?  Are such large messages
> simply not reasonable?
>
> 2) I've instrumented my tester to include a sequence count in each
> message and it uncovers that this multi-thread/multi-connect trick may
> lead to messages arriving to the receiver out-of-order.  Given PUSH is
> round-robin and PULL is fair queued, I naively didn't expect this.  But
> seeing it, I have two guesses.  1) I don't actually know what
> "fair-queued" really means :) and 2) if a mute state is getting hit then
> maybe all bets are off.  I do wonder if adding "credit based" transfers
> might solve this ordering.  Eg, if N credits (or fewer) are used given N
> connects, might the round-robin/fair-queue ordering stay in lock step?
>
>
> Any ideas are welcome.  Thanks!
>
> -Brett.
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>

_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Performance results on 100Gbps direct link

Reply via email to