Hi all,

today I've taken some time to attempt building a memory-pooling
mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
Here's the result:
 https://github.com/zeromq/libzmq/pull/3631
This PR is a work in progress and is a simple modification to show the
effects of avoiding malloc/free when creating zmq_msg_t with the
standard benchmark utils of ZMQ.

In particular the very fast, zero-lock,
single-producer/single-consumer queue from:
https://github.com/cameron314/readerwriterqueue
is used to maintain between the "remote_thr" main thread and its ZMQ
background IO thread a list of free buffers that can be used.

Here are the graphical results:
with mallocs / no memory pool:
   
https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
with memory pool:
   
https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png

Doing the math the memory pooled approach shows:

mostly the same performances for messages <= 32B
+15% pps/throughput increase @ 64B,
+60% pps/throughput increase @ 128B,
+70% pps/throughput increase @ 210B

[the tests were stopped at 210B because my current quick-dirty memory
pool approach has fixed max msg size of about 210B].

Honestly this is not a huge speedup, even if still interesting.
Indeed with these changes the performances now seem to be bounded by
the "local_thr" side and not by the "remote_thr" anymore. Indeed the
zmq background IO thread for local_thr is the only thread at 100% in
the 2 systems and its "perf top" now shows:

  15,02%  libzmq.so.5.2.3     [.] zmq::metadata_t::add_ref
  14,91%  libzmq.so.5.2.3     [.] zmq::v2_decoder_t::size_ready
   8,94%  libzmq.so.5.2.3     [.] zmq::ypipe_t<zmq::msg_t, 256>::write
   6,97%  libzmq.so.5.2.3     [.] zmq::msg_t::close
   5,48%  libzmq.so.5.2.3     [.]
zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
   5,40%  libzmq.so.5.2.3     [.] zmq::pipe_t::write
   4,94%  libzmq.so.5.2.3     [.] zmq::shared_message_memory_allocator::inc_ref
   2,59%  libzmq.so.5.2.3     [.] zmq::msg_t::init_external_storage
   1,63%  [kernel]            [k] copy_user_enhanced_fast_string
   1,56%  libzmq.so.5.2.3     [.] zmq::msg_t::data
   1,43%  libzmq.so.5.2.3     [.] zmq::msg_t::init
   1,34%  libzmq.so.5.2.3     [.] zmq::pipe_t::check_write
   1,24%  libzmq.so.5.2.3     [.] zmq::stream_engine_base_t::in_event_internal
   1,24%  libzmq.so.5.2.3     [.] zmq::msg_t::size

Do you know what this stacktrace might mean?
I would expect to have that ZMQ background thread topping in its
read() system call (from TCP socket)...

Thanks,
Francesco


Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
<francesco.monto...@gmail.com> ha scritto:
>
> Hi Yan,
> Unfortunately I have interrupted my attempts in this area after getting some 
> strange results (possibly due to the fact that I tried in a complex 
> application context... I should probably try hacking a simple zeromq example 
> instead!).
>
> I'm also a bit surprised that nobody has tried and posted online a way to 
> achieve something similar (Memory pool zmq send) ... But anyway It remains in 
> my plans to try that out when I have a bit more spare time...
> If you manage to have some results earlier, I would be eager to know :-)
>
> Francesco
>
>
> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) 
> <liming....@nokia-sbell.com> ha scritto:
>>
>> Hi,  Francesco
>>    Could you please share the final solution and benchmark result for plan 
>> 2?  Big Thanks.
>>    I'm concerning this because I had tried the similar before with 
>> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.  1) 
>>  My process is running in background for long time and finally I found it 
>> occupies more and more memory, until it exhausted the system memory. It 
>> seems there's memory leak with this way.   2) I provided *ffn for 
>> deallocation but the memory freed back is much slower than consumer. So 
>> finally my own customized pool could also be exhausted. How do you solve 
>> this?
>>    I had to turn back to use zmq_send(). I know it has memory copy penalty 
>> but it's the easiest and most stable way to send message. I'm still using 
>> 0MQ 4.1.x.
>>    Thanks.
>>
>> BR
>> Yan Limin
>>
>> -----Original Message-----
>> From: zeromq-dev [mailto:zeromq-dev-boun...@lists.zeromq.org] On Behalf Of 
>> Luca Boccassi
>> Sent: Friday, July 05, 2019 4:58 PM
>> To: ZeroMQ development list <zeromq-dev@lists.zeromq.org>
>> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>>
>> There's no need to change the source for experimenting, you can just use 
>> _init_data without a callback and with a callback (yes the first case will 
>> leak memory but it's just a test), and measure the difference between the 
>> two cases. You can then immediately see if it's worth pursuing further 
>> optimisations or not.
>>
>> _external_storage is an implementation detail, and it's non-shared because 
>> it's used in the receive case only, as it's used with a reference to the TCP 
>> buffer used in the system call for zero-copy receives. Exposing that means 
>> that those kind of messages could not be used with pub-sub or radio-dish, as 
>> they can't have multiple references without copying them, which means there 
>> would be a semantic difference between the different message initialisation 
>> APIs, unlike now when the difference is only in who owns the buffer. It 
>> would make the API quite messy in my opinion, and be quite confusing as 
>> pub/sub is probably the most well known pattern.
>>
>> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>> > Hi Luca,
>> > thanks for the details. Indeed I understand why the "content_t" needs
>> > to be allocated dynamically: it's just like the control block used by
>> > STL's std::shared_ptr<>.
>> >
>> > And you're right: I'm not sure how much gain there is in removing 100%
>> > of malloc operations from my TX path... still I would be curious to
>> > find it out but right now it seems I need to patch ZMQ source code to
>> > achieve that.
>> >
>> > Anyway I wonder if it could be possible to expose in the public API a
>> > method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
>> > to create a non-shared zero-copy long message... it appears to be used
>> > only by v2 decoder internally right now...
>> > Is there a specific reason why that's not accessible from the public
>> > API?
>> >
>> > Thanks,
>> > Francesco
>> >
>> >
>> >
>> >
>> >
>> > Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
>> > luca.bocca...@gmail.com> ha scritto:
>> > > Another reason for that small struct to be on the heap is so that it
>> > > can be shared among all the copies of the message (eg: a pub socket
>> > > has N copies of the message on the stack, one for each subscriber).
>> > > The struct has an atomic counter in it, so that when all the copies
>> > > of the message on the stack have been closed, the userspace buffer
>> > > deallocation callback can be invoked. If the atomic counter were on
>> > > the stack inlined in the message, this wouldn't work.
>> > > So even if room were to be found, a malloc would still be needed.
>> > >
>> > > If you _really_ are worried about it, and testing shows it makes a
>> > > difference, then one option could be to pre-allocate a set of these
>> > > metadata structures at startup, and just assign them when the
>> > > message is created. It's possible, but increases complexity quite a
>> > > bit, so it needs to be worth it.
>> > >
>> > > On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>> > > > The second malloc cannot be avoided, but it's tiny and fixed in
>> > > size
>> > > > at
>> > > > compile time, so the compiler and glibc will be able to optimize
>> > > it
>> > > > to
>> > > > death.
>> > > >
>> > > > The reason for that is that there's not enough room in the 64
>> > > bytes
>> > > > to
>> > > > store that structure, and increasing the message allocation on
>> > > the
>> > > > stack past 64 bytes means it will no longer fit in a single cache
>> > > > line, which will incur in a performance penalty far worse than the
>> > > small
>> > > > malloc (I tested this some time ago). That is of course unless
>> > > you
>> > > > are
>> > > > running on s390 or a POWER with 256 bytes cacheline, but given
>> > > it's
>> > > > part of the ABI it would be a bit of a mess for the benefit of
>> > > very
>> > > > few
>> > > > users if any.
>> > > >
>> > > > So I'd recommend to just go with the second plan, and compare
>> > > what
>> > > > the
>> > > > result is when passing a deallocation function vs not passing it
>> > > (yes
>> > > > it will leak the memory but it's just for the test). My bet is
>> > > that
>> > > > the
>> > > > difference will not be that large.
>> > > >
>> > > > On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>> > > > > Hi Stephan, Hi Luca,
>> > > > >
>> > > > > thanks for your hints. However I inspected
>> > > > >
>> > > https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
>> > > sher.cpp
>> > > > >
>> > > > >  and I don't think it's saving from malloc()...  see my point
>> > > 2)
>> > > > > below:
>> > > > >
>> > > > > Indeed I realized that probably current ZMQ API does not allow
>> > > me
>> > > > > to
>> > > > > achieve the 100% of what I intended to do.
>> > > > > Let me rephrase my target: my target is to be able to
>> > > > >  - memory pool creation: do a large memory allocation of, say,
>> > > 1M
>> > > > > zmq_msg_t only at the start of my program; let's say I create
>> > > all
>> > > > > these zmq_msg_t of a size of 2k bytes each (let's assume this
>> > > is
>> > > > > the
>> > > > > max size of message possible in my app)
>> > > > >  - during application lifetime: call zmq_msg_send() at anytime
>> > > > > always avoiding malloc() operations (just picking the first
>> > > > > available unused entry of zmq_msg_t from the memory pool).
>> > > > >
>> > > > > Initially I thought that was possible but I think I have
>> > > identified
>> > > > > 2
>> > > > > blocking issues:
>> > > > > 1) If I try to recycle zmq_msg_t directly: in this case I will
>> > > fail
>> > > > > because I cannot really change only the "size" member of a
>> > > > > zmq_msg_t without reallocating it... so that I'm forced (in my
>> > > > > example)
>> > > to
>> > > > > always send 2k bytes out (!!)
>> > > > > 2) if I do create only a memory pool of buffers of 2k bytes and
>> > > > > then wrap the first available buffer inside a zmq_msg_t
>> > > > > (allocated
>> > > on
>> > > > > the
>> > > > > stack, not in the heap): in this case I need to know when the
>> > > > > internals of ZMQ have completed using the zmq_msg_t and thus
>> > > when I
>> > > > > can mark that buffer as available again in my memory pool.
>> > > However
>> > > > > I
>> > > > > see that zmq_msg_init_data() ZMQ code contains:
>> > > > >
>> > > > >     //  Initialize constant message if there's no need to
>> > > > > deallocate
>> > > > >     if (ffn_ == NULL) {
>> > > > > ...
>> > > > >         _u.cmsg.data = data_;
>> > > > >         _u.cmsg.size = size_;
>> > > > > ...
>> > > > >     } else {
>> > > > > ...
>> > > > >         _u.lmsg.content =
>> > > > >           static_cast<content_t *> (malloc (sizeof
>> > > (content_t)));
>> > > > > ...
>> > > > >         _u.lmsg.content->data = data_;
>> > > > >         _u.lmsg.content->size = size_;
>> > > > >         _u.lmsg.content->ffn = ffn_;
>> > > > >         _u.lmsg.content->hint = hint_;
>> > > > >         new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>> > > ();
>> > > > >     }
>> > > > >
>> > > > > So that I skip malloc() operation only if I pass ffn_ == NULL.
>> > > The
>> > > > > problem is that if I pass ffn_ == NULL, then I have no way to
>> > > know
>> > > > > when the internals of ZMQ have completed using the zmq_msg_t...
>> > > > >
>> > > > > Any way to workaround either issue 1) or issue 2) ?
>> > > > >
>> > > > > I understand that the malloc is just of size(content_t)~=
>> > > 40B...
>> > > > > but
>> > > > > still I'd like to avoid it...
>> > > > >
>> > > > > Thanks!
>> > > > > Francesco
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
>> > > > > op...@vs.uni-kassel.de
>> > > > > > ha scritto:
>> > > > > > On 04.07.19 14:29, Luca Boccassi wrote:
>> > > > > > > How users make use of these primitives is up to them
>> > > though, I
>> > > > > >
>> > > > > > don't
>> > > > > > > think anything special was shared before, as far as I
>> > > remember.
>> > > > > >
>> > > > > > Some example can be found here:
>> > > > > >
>> > > https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>> > > > > >
>> > > > > >
>> > > > > > The classes Publisher and Subscriber should replace the
>> > > publisher
>> > > > > > and
>> > > > > > subscriber in a former Robot-Operating-System-based System. I
>> > > > > > hope that the subscriber is actually using the method Luca is
>> > > > > > talking
>> > > about
>> > > > > > on the
>> > > > > > receiving side.
>> > > > > >
>> > > > > > The message data here is a Cap'n Proto container that we
>> > > > > > "simply"
>> > > > > > serialize and send via ZeroMQ -> therefore the name Cap'nZero
>> > > ;-)
>> > > > > >
>> > > > > > _______________________________________________
>> > > > > > zeromq-dev mailing list
>> > > > > > zeromq-dev@lists.zeromq.org
>> > > > > >
>> > > > > >
>> > > > > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > _______________________________________________
>> > > zeromq-dev mailing list
>> > > zeromq-dev@lists.zeromq.org
>> > >
>> > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>> > >
>> >
>> >
>> --
>> Kind regards,
>> Luca Boccassi
>> _______________________________________________
>> zeromq-dev mailing list
>> zeromq-dev@lists.zeromq.org
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to