[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Nathaniel Smith Wed, 06 May 2020 12:39:25 -0700

On Wed, May 6, 2020 at 10:03 AM Antoine Pitrou <solip...@pitrou.net> wrote:
>
> On Tue, 5 May 2020 18:59:34 -0700
> Nathaniel Smith <n...@pobox.com> wrote:
> > On Tue, May 5, 2020 at 3:47 PM Guido van Rossum <gu...@python.org> wrote:
> > >
> > > This sounds like a significant milestone!
> > >
> > > Is there some kind of optimized communication possible yet between 
> > > subinterpreters? (Otherwise I still worry that it's no better than 
> > > subprocesses -- and it could be worse because when one subinterpreter 
> > > experiences a hard crash or runs out of memory, all others have to die 
> > > with it.)
> >
> > As far as I understand it, the subinterpreter folks have given up on
> > optimized passing of objects, and are only hoping to do optimized
> > (zero-copy) passing of raw memory buffers.
>
> Which would be useful already, especially with pickle out-of-band
> buffers.

Sure, zero cost is always better than some cost, I'm not denying that
:-). What I'm trying to understand is whether the difference is
meaningful enough to justify subinterpreters' increased complexity,
fragility, and ecosystem breakage.

If your data is in large raw memory buffers to start with (like numpy
arrays or arrow dataframes), then yeah, serialization costs are
smaller proportion of IPC costs. And out-of-band buffers are an
elegant way of letting pickle users take advantage of that speedup
while still using the familiar pickle API. Thanks for writing that PEP
:-).

But when you're in the regime where you're working with large raw
memory buffers, then that's also the regime where inter-process
shared-memory becomes really efficient. Hence projects like Ray/Plasma
[1], which exist today, and even work for sharing data across
languages and across multi-machine clusters. And the pickle
out-of-band buffer API is general enough to work with shared memory
too.

And even if you can't quite manage zero-copy, and have to settle for
one-copy... optimized raw data copying is just *really fast*, similar
to memory access speeds. And CPU-bound, big-data-crunching apps are by
definition going to access that memory and do stuff with it that's
much more expensive than a single memcpy. So I still have trouble
figuring out how skipping a single memcpy will make subinterpreters
significantly faster that subprocesses in any real-world scenario.

-n

[1]
https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
https://github.com/ray-project/ray

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/PCLCXUK2OOHL2DHEHKMB3LGCIT7247WM/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Reply via email to