As far as I know (and my knowledge here may be dated) the compute
kernels themselves do not do any concurrency.  There are certainly
compute kernels that could benefit from concurrency in this manner
(many kernels naively so) and I think things are setup so that, if we
decide to tackle this feature, we could do so in a systematic way
(instead of writing something for each kernel).

I believe that kernels, if given a unique kernel context, should be thread safe.

The streaming compute engine, on the other hand, does support
concurrency.  It is mostly driven by the scanner at the moment (e.g.
each batch we fetch from the scanner gets a fresh thread task for
running through the execution plan) but there is some intra-node
concurrency in the hash join and (I think) the hash aggregate nodes.
This has been sufficient to saturate cores on the benchmarks we run.
I know there is ongoing interest in understanding and improving our
concurrency here.

The scanner supports concurrency.  It will typically fetch multiple
files at once and, for each file, it will fetch multiple batches at
once (assuming the file has more than one batch).

> I see a large difference between the total time to apply compute functions to 
> a single table (concatenated from many small tables) compared to applying 
> compute functions to each sub-table in the composition.

Which one is better?  Can you share a reproducible example?

On Thu, Mar 10, 2022 at 12:01 PM Aldrin <akmon...@ucsc.edu> wrote:
>
> Hello!
>
> I'm wondering if there's any documentation that describes the 
> concurrency/parallelism architecture for the compute API. I'd also be 
> interested if there are recommended approaches for seeing performance of 
> threads used by Arrow--should I try to check a processor ID and infer 
> performance or are there particular tools that the community uses?
>
> Specifically, I am wondering if the concurrency is going to be different when 
> using a ChunkedArray as an input compared to an Array or for ChunkedArrays 
> with various chunk sizes (1 chunk vs tens or hundreds). I see a large 
> difference between the total time to apply compute functions to a single 
> table (concatenated from many small tables) compared to applying compute 
> functions to each sub-table in the composition. I'm trying to figure out 
> where that difference may come from and I'm wondering if it's related to 
> parallelism within Arrow.
>
> I tried using the github issues and JIRA issues (e.g.  [1]) as a way to 
> sleuth the info, but I couldn't find anything. The pyarrow API seems to have 
> functions I could try and use to figure it out (cpu_count and set_cpu_count), 
> but that seems like a vague road.
>
> [1]: https://issues.apache.org/jira/browse/ARROW-12726
>
>
> Thank you!
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz

Reply via email to