Re: Documentation of concurrency of the compute API?

2022-03-23 Thread Aldrin
Ooh, [2] was very interesting to read. I am also adding parameters to combine a specified number of chunks together (also uses combineChunks) and invoke the function on a specified limit of columns. If desirable, I can add some of these descriptions and such to the JIRA and maybe I can use it as

Re: Documentation of concurrency of the compute API?

2022-03-23 Thread Weston Pace
Thank you for providing such an interesting (and enlightening) reproduction. In both cases you are calling the compute functions exactly the same. For example, you are adding two chunked arrays in both cases. The key difference appears to be that you are either making <1 call to `Add` with a

Re: Documentation of concurrency of the compute API?

2022-03-21 Thread Aldrin
Sorry it took awhile, but here is a repository I put together that should reproducibly illustrate what I am seeing, and what I'd like to understand better (if not improve)[1]. The linked source code [2] shows 2 places where I am collecting times (using std::chrono::steady_clock in C++). The

Re: Documentation of concurrency of the compute API?

2022-03-11 Thread Niranda Perera
Sore, I think I've missed the smart pointers in my response. It *should* be smart pointers, otherwise you'll lose the allocation when you go out of context. It should have been, class MeanAggr{ int64_t count_; vector> sums_; vector> sum_squares_; } On Fri, Mar 11, 2022 at 3:16 AM Aldrin wrote:

Re: Documentation of concurrency of the compute API?

2022-03-11 Thread Aldrin
Actually, I think I understand now; I misread "extending the class members". But I think the point got across--if I know my table has a single chunk, then I can do the operations on the arrays and then I can wrap the result in a ChunkedArray or Table. For each slice, I can just maintain the

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
I think there's one minor misunderstanding, but I like the essence of the feedback. To clarify, the MeanAggr::Accumulate function is used to gather over points of a sample, where a row is considered a sample, and columns are corresponding values, e.g.: columns (values) | c0 | c1 | c2 | c3

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Niranda Perera
Okay, one thing I immediately see is that there are a lot of memory allocations/ deallocations happening in the approach you have given IMO. arrow::compute methods are immutable, so when you get an answer, it would be allocated freshly in memory, and when you update an existing shared_ptr, you

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
You're correct with the first clarification. I am not (currently) slicing column-wise. And yes, I am calculating variance, mean, etc. so that I can calculate the t-statistic. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 10, 2022 at 5:16 PM Niranda Perera wrote: > Or

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Niranda Perera
Or are you slicing column-wise? On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera wrote: > From the looks of it, you are trying to calculate variance, mean, etc over > rows, isn't it? > > I need to clarify a bit on this statement. > "Where "by slice" is total time, summed from running the function

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Niranda Perera
>From the looks of it, you are trying to calculate variance, mean, etc over rows, isn't it? I need to clarify a bit on this statement. "Where "by slice" is total time, summed from running the function on each slice and "by table" is the time of just running the function on the table concatenated

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
Oh, but the short answer is that I'm using: Add, Subtract, Divide, Multiply, Power, and Absolute. Sometimes with both inputs being ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other being a scalar. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Thu, Mar 10,

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
Hi Niranda! Sure thing, I've linked to my code. [1] is essentially the function being called, and [2] is an example of a wrapper function (more in that file) I wrote to reduce boilerplate (to make [1] more readable). But, now that I look at [2] again, which I wrote before I really knew much about

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Niranda Perera
Hi Aldrin, It would be helpful to know what sort of compute operators you are using. On Thu, Mar 10, 2022, 19:12 Aldrin wrote: > I will work on a reproducible example. > > As a sneak peek, what I was seeing was the following (pasted in gmail, see > [1] for markdown version): > > Table ID

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
I will work on a reproducible example. As a sneak peek, what I was seeing was the following (pasted in gmail, see [1] for markdown version): Table ID Columns Rows Rows (slice) Slice count Time (ms) total; by slice Time (ms) total; by table E-GEOD-100618 415 20631 299 69 644.065 410 E-GEOD-76312

Re: Documentation of concurrency of the compute API?

2022-03-10 Thread Weston Pace
As far as I know (and my knowledge here may be dated) the compute kernels themselves do not do any concurrency. There are certainly compute kernels that could benefit from concurrency in this manner (many kernels naively so) and I think things are setup so that, if we decide to tackle this

Documentation of concurrency of the compute API?

2022-03-10 Thread Aldrin
Hello! I'm wondering if there's any documentation that describes the concurrency/parallelism architecture for the compute API. I'd also be interested if there are recommended approaches for seeing performance of threads used by Arrow--should I try to check a processor ID and infer performance or