Re: Digging into multiprocessing

Chris Angelico Tue, 13 Aug 2013 08:07:22 -0700

On Tue, Aug 13, 2013 at 12:17 AM, Demian Brecht <demianbre...@gmail.com> wrote:
> Hi all,
>
> Some work that I'm doing atm is in some serious need of
> parallelization. As such, I've been digging into the multiprocessing
> module more than I've had to before and I had a few questions come up
> as a result:
>
> (Running 2.7.5+ on OSX)
>
> 1. From what I've read, a new Python interpreter instance is kicked
> off for every worker. My immediate assumption was that the file that
> the code was in would be reloaded for every instance. After some
> digging, this is obviously not the case (print __name__ at the top of
> the file only yield a single output line). So, I'm assuming that
> there's some optimization that passes of the bytecode within the
> interpreter? How, exactly does this work? (I couldn't really find much
> in the docs about it, or am I just not looking in the right place?)


I don't know about OSX specifically, but I believe it forks, same as
on Linux. That means all your initialization code is done once. Be
aware that this is NOT the case on Windows.

http://en.wikipedia.org/wiki/Fork_(operating_system)

Effectively, code execution proceeds down a single thread until the
point of forking, and then the fork call returns twice. Can be messy
to explain but it makes great sense once you grok it!

> 2. For cases using methods such as map_async/wait, once the bytecode
> has been passed into the child process, `target` is called `n` times
> until the current queue is empty. Is this correct?

That would be about right, yes. The intention is that it's equivalent
to map(), only it splits the work across multiple processes; so the
expectation is that it will call target for each yielded item in the
iterable.

> 3. Because __main__ is only run when the root process imports, if
> using global, READ-ONLY objects, such as, say, a database connection,
> then it might be better from a performance standpoint to initialize
> that at main, relying on the interpreter references to be passed
> around correctly. I've read some blogs and such that suggest that you
> should create a new database connection within your child process
> targets (or code called into by the targets). This seems to be less
> than optimal to me if my assumption is correct.

This depends hugely on the objects you're working with. If your
database connection uses a TCP socket, for instance, all forked
processes will share the same socket, which will most likely result in
interleaved writes and messed-up reads. But with a log file, that
might be okay (especially if you have some kind of atomicity guarantee
that ensures that individual log entries don't interleave). The
problem isn't really the Python objects (which will have been happily
cloned by the fork() procedure), but the OS-level resources used.

With a good database like PostgreSQL, and reasonable numbers of
workers (say, 10-50, rather than 1000-5000), you should be able to
simply establish separate connections for each subprocess without
worrying about performance. If you really need billions of worker
processes, it might be best to use one of the multiprocessing module's
queueing/semaphoring facilities and either have one process that does
all databasing, or let them all use it but serially. But if you can
manage with separate connections, that would be the easiest, safest,
and simplest to debug.

> 4. Related to 3, read-only objects that are initialized prior to being
> passed into a sub-process are safe to reuse as long as they are
> treated as being immutable. Any other objects should use one of the
> shared memory features.
>
> Is this more or less correct, or am I just off my rocker?

When you fork, each process will get its own clone of the objects in
the parent. For read-only objects (module-level constants and such),
this is fine, as you say. The issue is if you want another process to
"see" the change you made. That's when you need some form of shared
data.

So, yes, more or less correct; at least, what you've said is mostly
right for Unix - there may be some additional caveats for OSX
specifically that I'm not aware of. But I expect they'll be minor;
it's mainly Windows, which doesn't *have* fork(2), where there are
major differences.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Digging into multiprocessing

Reply via email to