Re: Documentation of concurrency of the compute API?

Niranda Perera Fri, 11 Mar 2022 06:45:19 -0800

Sore, I think I've missed the smart pointers in my response. It *should* be
smart pointers, otherwise you'll lose the allocation when you go out of
context. It should have been,
class MeanAggr{
int64_t count_;
vector<shared_ptr<Array>> sums_;
vector<shared_ptr<Array>> sum_squares_;
}


On Fri, Mar 11, 2022 at 3:16 AM Aldrin <[email protected]> wrote:

> Actually, I think I understand now; I misread "extending the class
> members". But I think the point got across--if I know my table has a single
> chunk, then I can do the operations on the arrays and then I can wrap the
> result in a ChunkedArray or Table. For each slice, I can just maintain the
> results in a vector without smart pointers.
>
> I'll definitely try this. Thanks!
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Thu, Mar 10, 2022 at 11:35 PM Aldrin <[email protected]> wrote:
>
>> I think there's one minor misunderstanding, but I like the essence of the
>> feedback.
>>
>> To clarify, the MeanAggr::Accumulate function is used to gather over
>> points of a sample, where a row is considered a sample, and columns are
>> corresponding values, e.g.:
>>
>> columns (values) |  c0  |  c1  |  c2 |  c3 |   c4
>> row 0 (sample 0) |   1  |   2  |   3 |   4 |     5
>> row 1 (sample 1) |   1  |   4  |  27 | 256 |  3125
>>
>> For this tiny example, applying Accumulate "by slice" means that I apply
>> it once on row 0, then again on row 1, and I add the times together. "By
>> Table" means that I concatenate row 0 and row 1, then apply Accumulate on
>> the resulting table. Combine isn't currently being considered (it's for
>> when I split on columns). You can sort of see this in [1], but it also
>> illustrates sequential calls of Accumulate instead of using Combine. I will
>> explain this more in a reproducible example.
>>
>> Given the clarification, I am not sure if the suggested local
>> calculations are helpful, but maybe you mean I shouldn't use so many shared
>> pointers? Although, I do think I'll try reducing the code path by using
>> Arrays when I'm applying to a Table that I know has only 1 chunk (because I
>> have specified it that way). This seems like it should help isolate some of
>> the overhead.
>>
>> Thanks for the feedback!
>>
>> [1]:
>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/fb688531169421a5b5985d2cbfee100e793cae2f/resources/assets/TStatistic_Diagram.png
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Thu, Mar 10, 2022 at 7:49 PM Niranda Perera <[email protected]>
>> wrote:
>>
>>> Okay, one thing I immediately see is that there are a lot of memory
>>> allocations/ deallocations happening in the approach you have given IMO.
>>> arrow::compute methods are immutable, so when you get an answer, it would
>>> be allocated freshly in memory, and when you update an existing shared_ptr,
>>> you would be deallocating the previous buffers. In both, MeanAggr::Combine
>>> and MeanAggr::Accumulate this is happening and this could be a reason why
>>> the splitted version is slower. Single table version only has to go through
>>> MeanAggr::Accumulate.
>>>
>>> If I may suggest an alternative approach, I'd do this for variance
>>> calculation,
>>> class MeanAggr{
>>> int64_t count_;
>>> vector<Array> sums_;
>>> vector<Array> sum_squares_;
>>> }
>>>
>>> At every Accumulate, I will calculate local sums, sum squares, and
>>> extend the class members with the resultant ChunkArray's chunks (which are
>>> Arrays).
>>> At the end, I'll create some ChunkArrays from these vectors, and use
>>> E(x^2)-E(x)^2 to calculate the variance. I feel like this might reduce the
>>> number of extra allocs and deallocs.
>>>
>>> On Thu, Mar 10, 2022 at 9:47 PM Aldrin <[email protected]> wrote:
>>>
>>>> You're correct with the first clarification. I am not (currently)
>>>> slicing column-wise.
>>>>
>>>> And yes, I am calculating variance, mean, etc. so that I can calculate
>>>> the t-statistic.
>>>>
>>>> Aldrin Montana
>>>> Computer Science PhD Student
>>>> UC Santa Cruz
>>>>
>>>>
>>>> On Thu, Mar 10, 2022 at 5:16 PM Niranda Perera <
>>>> [email protected]> wrote:
>>>>
>>>>> Or are you slicing column-wise?
>>>>>
>>>>> On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> From the looks of it, you are trying to calculate variance, mean, etc
>>>>>> over rows, isn't it?
>>>>>>
>>>>>> I need to clarify a bit on this statement.
>>>>>> "Where "by slice" is total time, summed from running the function on
>>>>>> each slice and "by table" is the time of just running the function on the
>>>>>> table concatenated from each slice."
>>>>>> So, I assume you are originally using a `vector<shared_ptr<Table>>
>>>>>> slices`. For the former case, you are passing each slice to
>>>>>> `MeanAggr::Accumulate`, and for the latter case, you are calling
>>>>>> arrow::Concatenate(slices) and passing the result as a single table?
>>>>>>
>>>>>> On Thu, Mar 10, 2022 at 7:41 PM Aldrin <[email protected]> wrote:
>>>>>>
>>>>>>> Oh, but the short answer is that I'm using: Add, Subtract, Divide,
>>>>>>> Multiply, Power, and Absolute. Sometimes with both inputs being
>>>>>>> ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
>>>>>>> being a scalar.
>>>>>>>
>>>>>>> Aldrin Montana
>>>>>>> Computer Science PhD Student
>>>>>>> UC Santa Cruz
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 10, 2022 at 4:38 PM Aldrin <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Niranda!
>>>>>>>>
>>>>>>>> Sure thing, I've linked to my code. [1] is essentially the function
>>>>>>>> being called, and [2] is an example of a wrapper function (more in that
>>>>>>>> file) I wrote to reduce boilerplate (to make [1] more readable). But, 
>>>>>>>> now
>>>>>>>> that I look at [2] again, which I wrote before I really knew much about
>>>>>>>> smart pointers, I wonder if some of what I benchmarked is overhead from
>>>>>>>> misusing C++ structures?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> [1]:
>>>>>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/statops.cpp#L96
>>>>>>>> [2]:
>>>>>>>> https://gitlab.com/skyhookdm/skytether-singlecell/-/blob/58839eb921c53d17ac32129be6af214ae4b58a13/src/cpp/processing/numops.cpp#L18
>>>>>>>>
>>>>>>>> Aldrin Montana
>>>>>>>> Computer Science PhD Student
>>>>>>>> UC Santa Cruz
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 10, 2022 at 4:30 PM Niranda Perera <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Aldrin,
>>>>>>>>>
>>>>>>>>> It would be helpful to know what sort of compute operators you are
>>>>>>>>> using.
>>>>>>>>>
>>>>>>>>> On Thu, Mar 10, 2022, 19:12 Aldrin <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I will work on a reproducible example.
>>>>>>>>>>
>>>>>>>>>> As a sneak peek, what I was seeing was the following (pasted in
>>>>>>>>>> gmail, see [1] for markdown version):
>>>>>>>>>>
>>>>>>>>>> Table ID Columns Rows Rows (slice) Slice count Time (ms)
>>>>>>>>>> total; by slice Time (ms)
>>>>>>>>>> total; by table
>>>>>>>>>> E-GEOD-100618 415 20631 299 69 644.065 410
>>>>>>>>>> E-GEOD-76312 2152 27120 48 565 25607.927 2953
>>>>>>>>>> E-GEOD-106540 2145 24480 45 544 25193.507 3088
>>>>>>>>>>
>>>>>>>>>> Where "by slice" is total time, summed from running the function
>>>>>>>>>> on each slice and "by table" is the time of just running the 
>>>>>>>>>> function on
>>>>>>>>>> the table concatenated from each slice.
>>>>>>>>>>
>>>>>>>>>> The difference was large (but not *so* large) for ~70 iterations
>>>>>>>>>> (1.5x); but for ~550 iterations (and 6x fewer rows, 5x more columns) 
>>>>>>>>>> the
>>>>>>>>>> difference became significant (~10x).
>>>>>>>>>>
>>>>>>>>>> I will follow up here when I have a more reproducible example. I
>>>>>>>>>> also started doing this before tensors were available, so I'll try 
>>>>>>>>>> to see
>>>>>>>>>> how that changes performance.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>> https://gist.github.com/drin/4b2e2ea97a07c9ad54647bcdc462611a
>>>>>>>>>>
>>>>>>>>>> Aldrin Montana
>>>>>>>>>> Computer Science PhD Student
>>>>>>>>>> UC Santa Cruz
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 10, 2022 at 2:32 PM Weston Pace <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> As far as I know (and my knowledge here may be dated) the compute
>>>>>>>>>>> kernels themselves do not do any concurrency.  There are
>>>>>>>>>>> certainly
>>>>>>>>>>> compute kernels that could benefit from concurrency in this
>>>>>>>>>>> manner
>>>>>>>>>>> (many kernels naively so) and I think things are setup so that,
>>>>>>>>>>> if we
>>>>>>>>>>> decide to tackle this feature, we could do so in a systematic way
>>>>>>>>>>> (instead of writing something for each kernel).
>>>>>>>>>>>
>>>>>>>>>>> I believe that kernels, if given a unique kernel context, should
>>>>>>>>>>> be thread safe.
>>>>>>>>>>>
>>>>>>>>>>> The streaming compute engine, on the other hand, does support
>>>>>>>>>>> concurrency.  It is mostly driven by the scanner at the moment
>>>>>>>>>>> (e.g.
>>>>>>>>>>> each batch we fetch from the scanner gets a fresh thread task for
>>>>>>>>>>> running through the execution plan) but there is some intra-node
>>>>>>>>>>> concurrency in the hash join and (I think) the hash aggregate
>>>>>>>>>>> nodes.
>>>>>>>>>>> This has been sufficient to saturate cores on the benchmarks we
>>>>>>>>>>> run.
>>>>>>>>>>> I know there is ongoing interest in understanding and improving
>>>>>>>>>>> our
>>>>>>>>>>> concurrency here.
>>>>>>>>>>>
>>>>>>>>>>> The scanner supports concurrency.  It will typically fetch
>>>>>>>>>>> multiple
>>>>>>>>>>> files at once and, for each file, it will fetch multiple batches
>>>>>>>>>>> at
>>>>>>>>>>> once (assuming the file has more than one batch).
>>>>>>>>>>>
>>>>>>>>>>> > I see a large difference between the total time to apply
>>>>>>>>>>> compute functions to a single table (concatenated from many small 
>>>>>>>>>>> tables)
>>>>>>>>>>> compared to applying compute functions to each sub-table in the 
>>>>>>>>>>> composition.
>>>>>>>>>>>
>>>>>>>>>>> Which one is better?  Can you share a reproducible example?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 10, 2022 at 12:01 PM Aldrin <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hello!
>>>>>>>>>>> >
>>>>>>>>>>> > I'm wondering if there's any documentation that describes the
>>>>>>>>>>> concurrency/parallelism architecture for the compute API. I'd also 
>>>>>>>>>>> be
>>>>>>>>>>> interested if there are recommended approaches for seeing 
>>>>>>>>>>> performance of
>>>>>>>>>>> threads used by Arrow--should I try to check a processor ID and 
>>>>>>>>>>> infer
>>>>>>>>>>> performance or are there particular tools that the community uses?
>>>>>>>>>>> >
>>>>>>>>>>> > Specifically, I am wondering if the concurrency is going to be
>>>>>>>>>>> different when using a ChunkedArray as an input compared to an 
>>>>>>>>>>> Array or for
>>>>>>>>>>> ChunkedArrays with various chunk sizes (1 chunk vs tens or 
>>>>>>>>>>> hundreds). I see
>>>>>>>>>>> a large difference between the total time to apply compute 
>>>>>>>>>>> functions to a
>>>>>>>>>>> single table (concatenated from many small tables) compared to 
>>>>>>>>>>> applying
>>>>>>>>>>> compute functions to each sub-table in the composition. I'm trying 
>>>>>>>>>>> to
>>>>>>>>>>> figure out where that difference may come from and I'm wondering if 
>>>>>>>>>>> it's
>>>>>>>>>>> related to parallelism within Arrow.
>>>>>>>>>>> >
>>>>>>>>>>> > I tried using the github issues and JIRA issues (e.g.  [1]) as
>>>>>>>>>>> a way to sleuth the info, but I couldn't find anything. The pyarrow 
>>>>>>>>>>> API
>>>>>>>>>>> seems to have functions I could try and use to figure it out 
>>>>>>>>>>> (cpu_count and
>>>>>>>>>>> set_cpu_count), but that seems like a vague road.
>>>>>>>>>>> >
>>>>>>>>>>> > [1]: https://issues.apache.org/jira/browse/ARROW-12726
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > Thank you!
>>>>>>>>>>> >
>>>>>>>>>>> > Aldrin Montana
>>>>>>>>>>> > Computer Science PhD Student
>>>>>>>>>>> > UC Santa Cruz
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Niranda Perera
>>>>>> https://niranda.dev/
>>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Niranda Perera
>>>>> https://niranda.dev/
>>>>> @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>>
>>>
>>> --
>>> Niranda Perera
>>> https://niranda.dev/
>>> @n1r44 <https://twitter.com/N1R44>
>>>
>>>

-- 
Niranda Perera
https://niranda.dev/
@n1r44 <https://twitter.com/N1R44>

Re: Documentation of concurrency of the compute API?

Reply via email to