Awesome, thanks for the detailed response Chris. On Tue, Aug 13, 2013 at 8:03 AM, Chris Angelico <ros...@gmail.com> wrote: > On Tue, Aug 13, 2013 at 12:17 AM, Demian Brecht <demianbre...@gmail.com> > wrote: >> Hi all, >> >> Some work that I'm doing atm is in some serious need of >> parallelization. As such, I've been digging into the multiprocessing >> module more than I've had to before and I had a few questions come up >> as a result: >> >> (Running 2.7.5+ on OSX) >> >> 1. From what I've read, a new Python interpreter instance is kicked >> off for every worker. My immediate assumption was that the file that >> the code was in would be reloaded for every instance. After some >> digging, this is obviously not the case (print __name__ at the top of >> the file only yield a single output line). So, I'm assuming that >> there's some optimization that passes of the bytecode within the >> interpreter? How, exactly does this work? (I couldn't really find much >> in the docs about it, or am I just not looking in the right place?) > > I don't know about OSX specifically, but I believe it forks, same as > on Linux. That means all your initialization code is done once. Be > aware that this is NOT the case on Windows. > > http://en.wikipedia.org/wiki/Fork_(operating_system) > > Effectively, code execution proceeds down a single thread until the > point of forking, and then the fork call returns twice. Can be messy > to explain but it makes great sense once you grok it! > >> 2. For cases using methods such as map_async/wait, once the bytecode >> has been passed into the child process, `target` is called `n` times >> until the current queue is empty. Is this correct? > > That would be about right, yes. The intention is that it's equivalent > to map(), only it splits the work across multiple processes; so the > expectation is that it will call target for each yielded item in the > iterable. > >> 3. Because __main__ is only run when the root process imports, if >> using global, READ-ONLY objects, such as, say, a database connection, >> then it might be better from a performance standpoint to initialize >> that at main, relying on the interpreter references to be passed >> around correctly. I've read some blogs and such that suggest that you >> should create a new database connection within your child process >> targets (or code called into by the targets). This seems to be less >> than optimal to me if my assumption is correct. > > This depends hugely on the objects you're working with. If your > database connection uses a TCP socket, for instance, all forked > processes will share the same socket, which will most likely result in > interleaved writes and messed-up reads. But with a log file, that > might be okay (especially if you have some kind of atomicity guarantee > that ensures that individual log entries don't interleave). The > problem isn't really the Python objects (which will have been happily > cloned by the fork() procedure), but the OS-level resources used. > > With a good database like PostgreSQL, and reasonable numbers of > workers (say, 10-50, rather than 1000-5000), you should be able to > simply establish separate connections for each subprocess without > worrying about performance. If you really need billions of worker > processes, it might be best to use one of the multiprocessing module's > queueing/semaphoring facilities and either have one process that does > all databasing, or let them all use it but serially. But if you can > manage with separate connections, that would be the easiest, safest, > and simplest to debug. > >> 4. Related to 3, read-only objects that are initialized prior to being >> passed into a sub-process are safe to reuse as long as they are >> treated as being immutable. Any other objects should use one of the >> shared memory features. >> >> Is this more or less correct, or am I just off my rocker? > > When you fork, each process will get its own clone of the objects in > the parent. For read-only objects (module-level constants and such), > this is fine, as you say. The issue is if you want another process to > "see" the change you made. That's when you need some form of shared > data. > > So, yes, more or less correct; at least, what you've said is mostly > right for Unix - there may be some additional caveats for OSX > specifically that I'm not aware of. But I expect they'll be minor; > it's mainly Windows, which doesn't *have* fork(2), where there are > major differences. > > ChrisA > -- > http://mail.python.org/mailman/listinfo/python-list
-- Demian Brecht http://demianbrecht.github.com -- http://mail.python.org/mailman/listinfo/python-list