[issue40110] multiprocessing.Pool.imap() should be lazy

Nick Guenther Wed, 01 Apr 2020 15:35:40 -0700


Nick Guenther <n...@kousu.ca> added the comment:


Thank you for taking the time to consider my points! Yes, I think you 
understood exactly what I was getting at.

I slept on it and thought about what I'd posted the day after and realized most 
of the points you raise, especially that serialized next() would mean 
serialized processing. So the naive approach is out.

I am motivated by trying to introduce backpressure to my pipelines. The example 
you gave has potentially infinite memory usage; if I simply slow it down with 
sleep() I get a memory leak and the main python proc pinning my CPU, even 
though it "isn't" doing anything:

    with multiprocessing.Pool(4) as pool:
        for i, v in enumerate(pool.imap(worker, itertools.count(1)), 1):
            time.sleep(7)
            print(f"At {i}: {v}, memory usage is {sys.getallocatedblocks()}")

At 1->1, memory usage is 230617
At 2->8, memory usage is 411053
At 3->27, memory usage is 581439
At 4->64, memory usage is 748584
At 5->125, memory usage is 909694
At 6->216, memory usage is 1074304
At 7->343, memory usage is 1238106
At 8->512, memory usage is 1389162
At 9->729, memory usage is 1537830
At 10->1000, memory usage is 1648502
At 11->1331, memory usage is 1759864
At 12->1728, memory usage is 1909807
At 13->2197, memory usage is 2005700
At 14->2744, memory usage is 2067407
At 15->3375, memory usage is 2156479
At 16->4096, memory usage is 2240936
At 17->4913, memory usage is 2328123
At 18->5832, memory usage is 2456865
At 19->6859, memory usage is 2614602
At 20->8000, memory usage is 2803736
At 21->9261, memory usage is 2999129

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
11314 kousu     20   0  303308  40996   6340 S  91.4  2.1   0:21.68 python3.8   
11317 kousu     20   0   54208  10264   4352 S  16.2  0.5   0:03.76 python3.8   
11315 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11316 kousu     20   0   54208  10260   4352 S  15.8  0.5   0:03.74 python3.8   
11318 kousu     20   0   54208  10264   4352 S  15.5  0.5   0:03.72 python3.8 

It seems to me like any usage of Pool.imap() either has the same issue lurking 
or is run on a small finite data set where you are better off using Pool.map().

I like generators because they keep constant-time constant-memory work, which 
seems like a natural fit for stream processing situations. Unix pipelines have 
backpressure built-in, because write() blocks when the pipe buffer is full.

I did come up with one possibility after sleeping on it again: run the final 
iteration in parallel, perhaps by a special plist() method that makes as many 
parallel next() calls as possible. There's definitely details to work out but I 
plan to prototype when I have spare time in the next couple weeks.

You're entirely right that it's a risky change to suggest, so maybe it would be 
best expressed as a package if I get it working. Can I keep this issue open to 
see if it draws in insights from anyone else in the meantime?

Thanks again for taking a look! That's really cool of you!

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40110>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40110] multiprocessing.Pool.imap() should be lazy

Reply via email to