Documentation of concurrency of the compute API?

Aldrin Thu, 10 Mar 2022 14:01:29 -0800

Hello!

I'm wondering if there's any documentation that describes the
concurrency/parallelism architecture for the compute API. I'd also be
interested if there are recommended approaches for seeing performance of
threads used by Arrow--should I try to check a processor ID and infer
performance or are there particular tools that the community uses?


Specifically, I am wondering if the concurrency is going to be different
when using a ChunkedArray as an input compared to an Array or for
ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). I see
a large difference between the total time to apply compute functions to a
single table (concatenated from many small tables) compared to applying
compute functions to each sub-table in the composition. I'm trying to
figure out where that difference may come from and I'm wondering if it's
related to parallelism within Arrow.

I tried using the github issues and JIRA issues (e.g.  [1]) as a way to
sleuth the info, but I couldn't find anything. The pyarrow API seems to
have functions I could try and use to figure it out (cpu_count and
set_cpu_count), but that seems like a vague road.

[1]: https://issues.apache.org/jira/browse/ARROW-12726


Thank you!

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz

Documentation of concurrency of the compute API?

Reply via email to