OK, I've done a little more digging. It seems that in v0.4, remote workers are started differently. This is my understanding: Only one worker for each host is started directly from the master process. Additional workers on each host are started from the first worker on that host. Thus output from these additional workers is routed via the first worker on the host (rather than directly to master process). Somehow this causes the intermingled output.
To overcome this, I can start all workers directly from the master process, and output is orderly again (as for v0.3). Presumably, the new v0.4 indirect method was to speed up adding remote workers. Clearly, I don't really understand much of this. And I'm not sure how connecting all workers directly to master process affects performance or scalability. Intuitively, it doesn't sound good, but for my purpose it does give more readable output. To help speed up the startup of workers, I can start workers on different hosts in parallel (but each worker on host is started serially and directly from master process) @sync begin for each (host, nworkers) in machines @async begin for i = 1:nworkers addprocs([(host,1)]) end end end end