[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Emily Bowman Thu, 07 May 2020 01:52:01 -0700

On Wed, May 6, 2020 at 12:36 PM Nathaniel Smith <n...@pobox.com> wrote:


>
> Sure, zero cost is always better than some cost, I'm not denying that
> :-). What I'm trying to understand is whether the difference is
> meaningful enough to justify subinterpreters' increased complexity,
> fragility, and ecosystem breakage.
>
> If your data is in large raw memory buffers to start with (like numpy
> arrays or arrow dataframes), then yeah, serialization costs are
> smaller proportion of IPC costs. And out-of-band buffers are an
> elegant way of letting pickle users take advantage of that speedup
> while still using the familiar pickle API. Thanks for writing that PEP
> :-).
>
> But when you're in the regime where you're working with large raw
> memory buffers, then that's also the regime where inter-process
> shared-memory becomes really efficient. Hence projects like Ray/Plasma
> [1], which exist today, and even work for sharing data across
> languages and across multi-machine clusters. And the pickle
> out-of-band buffer API is general enough to work with shared memory
> too.
>
> And even if you can't quite manage zero-copy, and have to settle for
> one-copy... optimized raw data copying is just *really fast*, similar
> to memory access speeds. And CPU-bound, big-data-crunching apps are by
> definition going to access that memory and do stuff with it that's
> much more expensive than a single memcpy. So I still have trouble
> figuring out how skipping a single memcpy will make subinterpreters
> significantly faster that subprocesses in any real-world scenario.
>

While large object copies are fairly fast -- I wouldn't say trivial, a
gigabyte copy will introduce noticeable lag when processing enough of them
-- the flip side of having large objects is that you want to avoid having
so many copies that you run into memory pressure and the dreaded swapping.
A multiprocessing engine that's fully parallel, every fork takes chunks of
data and does everything needed to them won't gain much from zero-copy as
long as memory limits aren't hit. But a pipeline of processing would
involve many copies, especially if you have a central dispatch thread that
passes things from stage to stage. This is a big deal where stages may take
longer or slower at any time, especially in low-latency applications, like
video conferencing, where dispatch needs the flexibility to skip steps or
add extra workers to shove a frame out the door, and using signals to
interact with separate processes to tell them to do so is more latency and
overhead.

Not that I'm recommending someone go out and make a pure Python
videoconferencing unit right now, but it's a use case I'm familiar with.
(Since I use Python to test new ideas before converting them into C++.)

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NBO4EZ5OHSDBTHSTWLPG45IAD3OHN3AL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: PoC: Subinterpreters 4x faster than sequential execution or threads on CPU-bound workaround

Reply via email to