On Tue, Aug 13, 2013 at 12:17 AM, Demian Brecht <demianbre...@gmail.com> wrote: > Hi all, > > Some work that I'm doing atm is in some serious need of > parallelization. As such, I've been digging into the multiprocessing > module more than I've had to before and I had a few questions come up > as a result: > > (Running 2.7.5+ on OSX) > > 1. From what I've read, a new Python interpreter instance is kicked > off for every worker. My immediate assumption was that the file that > the code was in would be reloaded for every instance. After some > digging, this is obviously not the case (print __name__ at the top of > the file only yield a single output line). So, I'm assuming that > there's some optimization that passes of the bytecode within the > interpreter? How, exactly does this work? (I couldn't really find much > in the docs about it, or am I just not looking in the right place?)
I don't know about OSX specifically, but I believe it forks, same as on Linux. That means all your initialization code is done once. Be aware that this is NOT the case on Windows. http://en.wikipedia.org/wiki/Fork_(operating_system) Effectively, code execution proceeds down a single thread until the point of forking, and then the fork call returns twice. Can be messy to explain but it makes great sense once you grok it! > 2. For cases using methods such as map_async/wait, once the bytecode > has been passed into the child process, `target` is called `n` times > until the current queue is empty. Is this correct? That would be about right, yes. The intention is that it's equivalent to map(), only it splits the work across multiple processes; so the expectation is that it will call target for each yielded item in the iterable. > 3. Because __main__ is only run when the root process imports, if > using global, READ-ONLY objects, such as, say, a database connection, > then it might be better from a performance standpoint to initialize > that at main, relying on the interpreter references to be passed > around correctly. I've read some blogs and such that suggest that you > should create a new database connection within your child process > targets (or code called into by the targets). This seems to be less > than optimal to me if my assumption is correct. This depends hugely on the objects you're working with. If your database connection uses a TCP socket, for instance, all forked processes will share the same socket, which will most likely result in interleaved writes and messed-up reads. But with a log file, that might be okay (especially if you have some kind of atomicity guarantee that ensures that individual log entries don't interleave). The problem isn't really the Python objects (which will have been happily cloned by the fork() procedure), but the OS-level resources used. With a good database like PostgreSQL, and reasonable numbers of workers (say, 10-50, rather than 1000-5000), you should be able to simply establish separate connections for each subprocess without worrying about performance. If you really need billions of worker processes, it might be best to use one of the multiprocessing module's queueing/semaphoring facilities and either have one process that does all databasing, or let them all use it but serially. But if you can manage with separate connections, that would be the easiest, safest, and simplest to debug. > 4. Related to 3, read-only objects that are initialized prior to being > passed into a sub-process are safe to reuse as long as they are > treated as being immutable. Any other objects should use one of the > shared memory features. > > Is this more or less correct, or am I just off my rocker? When you fork, each process will get its own clone of the objects in the parent. For read-only objects (module-level constants and such), this is fine, as you say. The issue is if you want another process to "see" the change you made. That's when you need some form of shared data. So, yes, more or less correct; at least, what you've said is mostly right for Unix - there may be some additional caveats for OSX specifically that I'm not aware of. But I expect they'll be minor; it's mainly Windows, which doesn't *have* fork(2), where there are major differences. ChrisA -- http://mail.python.org/mailman/listinfo/python-list