How many rows match your timestamp criteria? In other words, how many rows
are you applying the function to?  If there is an earlier exact match
filter on a timestamp that only matches 1 (or a few rows) then I are you
sure the expression evaluation (and not the filtering) is the costly spot?

> But I expected that Acero would need to only visit columnar values once

I'm not sure what this means.  There has been very little work on
optimizing expression evaluation (most, if any, optimization work has
focused on optimizing the individual kernels themselves).  Acero will not
"fuse" the kernel and has no expression optimization.  If your expression
has 20 compute calls in it then Acero will make 20 passes over the data.
In fact, these won't even be in-place (e.g. 20 arrays will be initialized).

> Should I instead think of Acero as mainly about working on very large
datasets?

Yes.  At the moment I would expect that pyarrow compute kernels are more or
less as fast as the numpy variants (there are some exceptions, string
functions tend to be faster, some are slower, no one has done an exhaustive
survey).  Running these through Acero should have some overhead and give
you the ability to run on larger-than-memory data.  There is potential for
Acero to implement some clever tricks (like those I described earlier)
which might make it faster (instead of adding overhead).  However, I do not
know if anyone is working on these.

On Mon, Aug 21, 2023 at 6:35 PM Chak-Pong Chung <chakpongch...@gmail.com>
wrote:

> Could you provide a script with which people can reproduce the problem for
> the performance comparison? That way we can take a closer look.
>
> On Mon, Aug 21, 2023 at 8:42 PM Spencer Nelson <swnel...@uw.edu> wrote:
>
>> I'd like some help calibrating my expectations regarding acero
>> performance. I'm finding that some pretty naive numpy is about 10x faster
>> than acero for my use case.
>>
>> I'm working with a table with 13,000,000 values. The values are angular
>> positions on the sky and times. I'd like to filter to a specific one of the
>> times, and to values within a calculated great-circle distance on the sky.
>>
>> I've implemented the Vincenty formula (
>> https://en.wikipedia.org/wiki/Great-circle_distance) for this:
>>
>> ```
>> def pc_angular_separation(lon1, lat1, lon2, lat2):
>>      sdlon = pc.sin(pc.subtract(lon2, lon1))
>>      cdlon = pc.cos(pc.subtract(lon2, lon1))
>>      slat1 = pc.sin(lat1)
>>      slat2 = pc.sin(lat2)
>>      clat1 = pc.cos(lat1)
>>      clat2 = pc.cos(lat2)
>>
>>      num1 = pc.multiply(clat2, sdlon)
>>      num2 = pc.subtract(pc.multiply(slat2, clat1),
>> pc.multiply(pc.multiply(clat2, slat1), cdlon))
>>      denominator = pc.add(pc.multiply(slat2, slat1),
>> pc.multiply(pc.multiply(clat2, clat1), cdlon))
>>      hypot = pc.sqrt(pc.add(pc.multiply(num1, num1), pc.multiply(num2,
>> num2)))
>>      return pc.atan2(hypot, denominator)
>> ```
>>
>> The resulting pyarrow.compute.Expression is fairly monstrous:
>>
>> <pyarrow.compute.Expression
>> atan2(sqrt(add(multiply(multiply(cos(Dec_deg), sin(subtract(RA_deg,
>> 168.9776949652776))), multiply(cos(Dec_deg), sin(subtract(RA_deg,
>> 168.9776949652776)))), multiply(subtract(multiply(sin(Dec_deg),
>> -0.9304510671785976), multiply(multiply(cos(Dec_deg), 0.3664161726591893),
>> cos(subtract(RA_deg, 168.9776949652776)))), subtract(multiply(sin(Dec_deg),
>> -0.9304510671785976), multiply(multiply(cos(Dec_deg), 0.3664161726591893),
>> cos(subtract(RA_deg, 168.9776949652776))))))), add(multiply(sin(Dec_deg),
>> 0.3664161726591893), multiply(multiply(cos(Dec_deg), -0.9304510671785976),
>> cos(subtract(RA_deg, 168.9776949652776)))))>
>>
>> Then my Acero graph is very simple. Just a table source node, then a
>> filter node on the timestamp (for exact match), and then another filter
>> node for a computed value of that expression under a threshold.
>>
>> For 13 million observations, this takes about 15ms on my laptop using
>> Acero.
>>
>> But the same computation done with totally naive numpy is about 3ms.
>>
>> The numpy version has no fanciness, just calling numpy trigonometric
>> functions and materializing all the intermediate results like you might
>> imagine, then eventually coming up with a boolean mask over everything and
>> calling `table.filter(mask)`.
>>
>> So finally, my question: is this about what I should expect? I know Acero
>> has an advantage that it *would* work if my data were larger than fits
>> in memory, which is not true of my numpy approach. But I expected that
>> Acero would need to only visit columnar values once, so it should be able
>> to outpace the numpy approach. Should I instead think of Acero as mainly
>> about working on very large datasets?
>>
>> -Spencer
>>
>
>
> --
> Regards,
> Chak-Pong
>

Reply via email to