Ooh, [2] was very interesting to read.
I am also adding parameters to combine a specified number of chunks
together (also uses combineChunks) and invoke the function on a specified
limit of columns. If desirable, I can add some of these descriptions and
such to the JIRA and maybe I can use it as
Thank you for providing such an interesting (and enlightening)
reproduction. In both cases you are calling the compute functions exactly
the same. For example, you are adding two chunked arrays in both cases.
The key difference appears to be that you are either making <1 call to
`Add` with a
Sorry it took awhile, but here is a repository I put together that should
reproducibly illustrate what I am seeing, and what I'd like to understand
better (if not improve)[1].
The linked source code [2] shows 2 places where I am collecting times
(using std::chrono::steady_clock in C++). The
Sore, I think I've missed the smart pointers in my response. It *should* be
smart pointers, otherwise you'll lose the allocation when you go out of
context. It should have been,
class MeanAggr{
int64_t count_;
vector> sums_;
vector> sum_squares_;
}
On Fri, Mar 11, 2022 at 3:16 AM Aldrin wrote:
Actually, I think I understand now; I misread "extending the class
members". But I think the point got across--if I know my table has a single
chunk, then I can do the operations on the arrays and then I can wrap the
result in a ChunkedArray or Table. For each slice, I can just maintain the
I think there's one minor misunderstanding, but I like the essence of the
feedback.
To clarify, the MeanAggr::Accumulate function is used to gather over points
of a sample, where a row is considered a sample, and columns are
corresponding values, e.g.:
columns (values) | c0 | c1 | c2 | c3
Okay, one thing I immediately see is that there are a lot of memory
allocations/ deallocations happening in the approach you have given IMO.
arrow::compute methods are immutable, so when you get an answer, it would
be allocated freshly in memory, and when you update an existing shared_ptr,
you
You're correct with the first clarification. I am not (currently) slicing
column-wise.
And yes, I am calculating variance, mean, etc. so that I can calculate the
t-statistic.
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
On Thu, Mar 10, 2022 at 5:16 PM Niranda Perera
wrote:
> Or
Or are you slicing column-wise?
On Thu, Mar 10, 2022 at 8:14 PM Niranda Perera
wrote:
> From the looks of it, you are trying to calculate variance, mean, etc over
> rows, isn't it?
>
> I need to clarify a bit on this statement.
> "Where "by slice" is total time, summed from running the function
>From the looks of it, you are trying to calculate variance, mean, etc over
rows, isn't it?
I need to clarify a bit on this statement.
"Where "by slice" is total time, summed from running the function on each
slice and "by table" is the time of just running the function on the table
concatenated
Oh, but the short answer is that I'm using: Add, Subtract, Divide,
Multiply, Power, and Absolute. Sometimes with both inputs being
ChunkedArrays, sometimes with 1 input being a ChunkedArray and the other
being a scalar.
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
On Thu, Mar 10,
Hi Niranda!
Sure thing, I've linked to my code. [1] is essentially the function being
called, and [2] is an example of a wrapper function (more in that file) I
wrote to reduce boilerplate (to make [1] more readable). But, now that I
look at [2] again, which I wrote before I really knew much about
Hi Aldrin,
It would be helpful to know what sort of compute operators you are using.
On Thu, Mar 10, 2022, 19:12 Aldrin wrote:
> I will work on a reproducible example.
>
> As a sneak peek, what I was seeing was the following (pasted in gmail, see
> [1] for markdown version):
>
> Table ID
I will work on a reproducible example.
As a sneak peek, what I was seeing was the following (pasted in gmail, see
[1] for markdown version):
Table ID Columns Rows Rows (slice) Slice count Time (ms)
total; by slice Time (ms)
total; by table
E-GEOD-100618 415 20631 299 69 644.065 410
E-GEOD-76312
As far as I know (and my knowledge here may be dated) the compute
kernels themselves do not do any concurrency. There are certainly
compute kernels that could benefit from concurrency in this manner
(many kernels naively so) and I think things are setup so that, if we
decide to tackle this
Hello!
I'm wondering if there's any documentation that describes the
concurrency/parallelism architecture for the compute API. I'd also be
interested if there are recommended approaches for seeing performance of
threads used by Arrow--should I try to check a processor ID and infer
performance or
16 matches
Mail list logo