Re: [I] [Python] pyarrow.fs.copy_files hangs indefinitely [arrow]

via GitHub Tue, 26 Nov 2024 14:51:07 -0800


Tom-Newton commented on issue #15233:
URL: https://github.com/apache/arrow/issues/15233#issuecomment-2502104233

We just discovered this problem and I'm quite keen to get it fixed. In our
case we are using the `AzureFileSystem`, but it implements `background_writes`
in a very similar way to the `S3FileSystem`, so it's not a surprise that it
suffers from the same problem.

I think the problem is exactly what @westonpace was describing but to be
precise about how it manifests in this case:
1. With a large number of files and `use_threads=true`, `CopyFiles` submits
a large number of copy operations to the IO thread pool, enough to occupy all
the IO threads.
https://github.com/apache/arrow/blob/main/cpp/src/arrow/filesystem/filesystem.cc#L648-L650
2. With `background_writes=true` each `Write`
https://github.com/apache/arrow/tree/main/cpp/src/arrow/filesystem/util_internal.cc#L56
to Azure or S3 actually just writes to an in memory buffer and submits a task
on the IO thread pool, to do the actual upload.
3. After all the write calls the `OutputStream` is destructed, which calls
`Flush()`. `Flush()` blocks, waiting for all the tasks, submitted by calls to
`Write` to complete there actual uploads. Unfortunately, this is a deadlock
because there are no free threads in the IO thread pool to actually do any of
the uploads.

If we compare to @westonpace 's rules:
> This yields roughly the following rules:
> - The user thread (the python top-level thread) should block on a
top-level future
> - CPU threads should never block (outside of minor blocking on mutex
guards to sequence a tiny critical section)
> - I/O threads should only block on OS calls. They should never block
waiting for other tasks.

This third rule is violated because the IO threads spawned by copy are
waiting on background write IO threads to complete.

A few vague ideas on solutions, but none of them seem particularly
attractive:
1. `OutputStream` destructor does not synchronously `Flush`. Instead it
returns a future for the `Flush`.
1. This is what @westonpace described
2. I fear this could require changes to the `Filesystem` interface
particularly around the destructor calling `Flush`. Maybe there are other
options though.
1. Try setting priorities when submitting tasks.
1. Copy files spawns tasks with lower priority so they can’t block
threads that actually do the copies from running.
1. If it can preempt lower priority, then this seems like a perfect
solution. If it cannot preempt there will still be rare race conditions when
all the IO threads can get filled with copy operations and cause the deadlock.
1. Use a different thread pool e.g. the cpu thread pool for copy files
1. Sort of makes sense to call it CPU on uploads with
background_writes=true because it's basically just writing to an in memory
buffer, which is probably more CPU than IO.
1. Using the same thread pool at both levels is good for controlling
concurrency.
3. For downloads its definitely wrong to use the CPU pool because it
will run the blocking IO directly in the CPU pool.
1. Limit how many tasks copy files can run in parallel, to be slightly less
than the io thread pool’s capacity.
1. There will still be lots of wasted threads.
1. Disable background writes if there are more files than IO threads.
1. If there is a really huge file, among lots of small ones the large
file will be uploaded synchronously, which will be bad for performance.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] pyarrow.fs.copy_files hangs indefinitely [arrow]

Reply via email to