Re: Documentation of concurrency of the compute API?

Aldrin Thu, 10 Mar 2022 23:36:05 -0800

I think there's one minor misunderstanding, but I like the essence of the
feedback.


To clarify, the MeanAggr::Accumulate function is used to gather over points
of a sample, where a row is considered a sample, and columns are
corresponding values, e.g.:

columns (values) |  c0  |  c1  |  c2 |  c3 |   c4
row 0 (sample 0) |   1  |   2  |   3 |   4 |     5
row 1 (sample 1) |   1  |   4  |  27 | 256 |  3125

For this tiny example, applying Accumulate "by slice" means that I apply it
once on row 0, then again on row 1, and I add the times together. "By
Table" means that I concatenate row 0 and row 1, then apply Accumulate on
the resulting table. Combine isn't currently being considered (it's for
when I split on columns). You can sort of see this in [1], but it also
illustrates sequential calls of Accumulate instead of using Combine. I will
explain this more in a reproducible example.

Given the clarification, I am not sure if the suggested local calculations
are helpful, but maybe you mean I shouldn't use so many shared pointers?
Although, I do think I'll try reducing the code path by using Arrays when
I'm applying to a Table that I know has only 1 chunk (because I have
specified it that way). This seems like it should help isolate some of the
overhead.

Thanks for the feedback!

[1]:
https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/fb688531169421a5b5985d2cbfee100e793cae2f/resources/assets/TStatistic_Diagram.png

Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Thu, Mar 10, 2022 at 7:49 PM Niranda Perera <[email protected]>
wrote:

> Okay, one thing I immediately see is that there are a lot of memory
> allocations/ deallocations happening in the approach you have given IMO.
> arrow::compute methods are immutable, so when you get an answer, it would
> be allocated freshly in memory, and when you update an existing shared_ptr,
> you would be deallocating the previous buffers. In both, MeanAggr::Combine
> and MeanAggr::Accumulate this is happening and this could be a reason why
> the splitted version is slower. Single table version only has to go through
> MeanAggr::Accumulate.
>
> If I may suggest an alternative approach, I'd do this for variance
> calculation,
> class MeanAggr{
> int64_t count_;
> vector<Array> sums_;
> vector<Array> sum_squares_;
> }
>
> At every Accumulate, I will calculate local sums, sum squares, and extend
> the class members with the resultant ChunkArray's chunks (which are
> Arrays).
> At the end, I'll create some ChunkArrays from these vectors, and use
> E(x^2)-E(x)^2 to calculate the variance. I feel like this might reduce the
> number of extra allocs and deallocs.
>
> On Thu, Mar 10, 2022 at 9:47 PM Aldrin <[email protected]> wrote:
>
>> You're correct with the first clarification. I am not (currently) slicing
>> column-wise.
>>
>> And yes, I am calculating variance, mean, etc. so that I can calculate
>> the t-statistic.
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Thu, Mar 10, 2022 at 5:16 PM Niranda Perera <[email protected]>
>> wrote:
>>
>>> Or are you slicing column-wise?
>>>
>>> On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera <[email protected]>
>>> wrote:
>>>
>>>> From the looks of it, you are trying to calculate variance, mean, etc
>>>> over rows, isn't it?
>>>>
>>>> I need to clarify a bit on this statement.
>>>> "Where "by slice" is total time, summed from running the function on
>>>> each slice and "by table" is the time of just running the function on the
>>>> table concatenated from each slice."
>>>> So, I assume you are originally using a `vector<shared_ptr<Table>>
>>>> slices`. For the former case, you are passing each slice to
>>>> `MeanAggr::Accumulate`, and for the latter case, you are calling
>>>> arrow::Concatenate(slices) and passing the result as a single table?
>>>>
>>>> On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote:
>>>>
>>>>> Oh, but the short answer is that I'm using: Add, Subtract, Divide,
>>>>> Multiply, Power, and Absolute. Sometimes with both inputs being
>>>>> ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
>>>>> being a scalar.
>>>>>
>>>>> Aldrin Montana
>>>>> Computer Science PhD Student
>>>>> UC Santa Cruz
>>>>>
>>>>>
>>>>> On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote:
>>>>>
>>>>>> Hi Niranda!
>>>>>>
>>>>>> Sure thing, I've linked to my code. [1] is essentially the function
>>>>>> being called, and [2] is an example of a wrapper function (more in that
>>>>>> file) I wrote to reduce boilerplate (to make [1] more readable). But, now
>>>>>> that I look at [2] again, which I wrote before I really knew much about
>>>>>> smart pointers, I wonder if some of what I benchmarked is overhead from
>>>>>> misusing C++ structures?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> [1]:
>>>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
>>>>>> [2]:
>>>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18
>>>>>>
>>>>>> Aldrin Montana
>>>>>> Computer Science PhD Student
>>>>>> UC Santa Cruz
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Aldrin,
>>>>>>>
>>>>>>> It would be helpful to know what sort of compute operators you are
>>>>>>> using.
>>>>>>>
>>>>>>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>>>>>>>
>>>>>>>> I will work on a reproducible example.
>>>>>>>>
>>>>>>>> As a sneak peek, what I was seeing was the following (pasted in
>>>>>>>> gmail, see [1] for markdown version):
>>>>>>>>
>>>>>>>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>>>>>>>> total; by slice Time (ms)
>>>>>>>> total; by table
>>>>>>>> E-GEOD-100618 415 20631 299 69 644.065 410
>>>>>>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>>>>>>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>>>>>>>
>>>>>>>> Where "by slice" is total time, summed from running the function on
>>>>>>>> each slice and "by table" is the time of just running the function on 
>>>>>>>> the
>>>>>>>> table concatenated from each slice.
>>>>>>>>
>>>>>>>> The difference was large (but not *so* large) for ~70 iterations
>>>>>>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) 
>>>>>>>> the
>>>>>>>> difference became significant (~10x).
>>>>>>>>
>>>>>>>> I will follow up here when I have a more reproducible example. I
>>>>>>>> also started doing this before tensors were available, so I'll try to 
>>>>>>>> see
>>>>>>>> how that changes performance.
>>>>>>>>
>>>>>>>>
>>>>>>>> [1]: https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>>>>>>>
>>>>>>>> Aldrin Montana
>>>>>>>> Computer Science PhD Student
>>>>>>>> UC Santa Cruz
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> As far as I know (and my knowledge here may be dated) the compute
>>>>>>>>> kernels themselves do not do any concurrency.  There are certainly
>>>>>>>>> compute kernels that could benefit from concurrency in this manner
>>>>>>>>> (many kernels naively so) and I think things are setup so that, if
>>>>>>>>> we
>>>>>>>>> decide to tackle this feature, we could do so in a systematic way
>>>>>>>>> (instead of writing something for each kernel).
>>>>>>>>>
>>>>>>>>> I believe that kernels, if given a unique kernel context, should
>>>>>>>>> be thread safe.
>>>>>>>>>
>>>>>>>>> The streaming compute engine, on the other hand, does support
>>>>>>>>> concurrency.  It is mostly driven by the scanner at the moment
>>>>>>>>> (e.g.
>>>>>>>>> each batch we fetch from the scanner gets a fresh thread task for
>>>>>>>>> running through the execution plan) but there is some intra-node
>>>>>>>>> concurrency in the hash join and (I think) the hash aggregate
>>>>>>>>> nodes.
>>>>>>>>> This has been sufficient to saturate cores on the benchmarks we
>>>>>>>>> run.
>>>>>>>>> I know there is ongoing interest in understanding and improving our
>>>>>>>>> concurrency here.
>>>>>>>>>
>>>>>>>>> The scanner supports concurrency.  It will typically fetch multiple
>>>>>>>>> files at once and, for each file, it will fetch multiple batches at
>>>>>>>>> once (assuming the file has more than one batch).
>>>>>>>>>
>>>>>>>>> > I see a large difference between the total time to apply compute
>>>>>>>>> functions to a single table (concatenated from many small tables) 
>>>>>>>>> compared
>>>>>>>>> to applying compute functions to each sub-table in the composition.
>>>>>>>>>
>>>>>>>>> Which one is better?  Can you share a reproducible example?
>>>>>>>>>
>>>>>>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]> wrote:
>>>>>>>>> >
>>>>>>>>> > Hello!
>>>>>>>>> >
>>>>>>>>> > I'm wondering if there's any documentation that describes the
>>>>>>>>> concurrency/parallelism architecture for the compute API. I'd also be
>>>>>>>>> interested if there are recommended approaches for seeing performance 
>>>>>>>>> of
>>>>>>>>> threads used by Arrow--should I try to check a processor ID and infer
>>>>>>>>> performance or are there particular tools that the community uses?
>>>>>>>>> >
>>>>>>>>> > Specifically, I am wondering if the concurrency is going to be
>>>>>>>>> different when using a ChunkedArray as an input compared to an Array 
>>>>>>>>> or for
>>>>>>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or hundreds). 
>>>>>>>>> I see
>>>>>>>>> a large difference between the total time to apply compute functions 
>>>>>>>>> to a
>>>>>>>>> single table (concatenated from many small tables) compared to 
>>>>>>>>> applying
>>>>>>>>> compute functions to each sub-table in the composition. I'm trying to
>>>>>>>>> figure out where that difference may come from and I'm wondering if 
>>>>>>>>> it's
>>>>>>>>> related to parallelism within Arrow.
>>>>>>>>> >
>>>>>>>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as a
>>>>>>>>> way to sleuth the info, but I couldn't find anything. The pyarrow API 
>>>>>>>>> seems
>>>>>>>>> to have functions I could try and use to figure it out (cpu_count and
>>>>>>>>> set_cpu_count), but that seems like a vague road.
>>>>>>>>> >
>>>>>>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Thank you!
>>>>>>>>> >
>>>>>>>>> > Aldrin Montana
>>>>>>>>> > Computer Science PhD Student
>>>>>>>>> > UC Santa Cruz
>>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44 <https://twitter.com/N1R44>
>>>>
>>>>
>>>
>>> --
>>> Niranda Perera
>>> https://niranda.dev/
>>> @n1r44 <https://twitter.com/N1R44>
>>>
>>>
>
> --
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>
>

Re: Documentation of concurrency of the compute API?

Reply via email to