Hi!

I got surprising results when comparing numpy and pyarrow performance.

val = np.uint8(115)

numpy has similar speed if using 115 and np.uint8(115):

%timeit np.count_nonzero(data_np == val)
591 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit np.count_nonzero(data_np == 115)
598 µs ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

strangely it is fastest for b's"

%timeit np.count_nonzero(data_np == b"s")
403 µs ± 3.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

pc.equal is 2.5 slower for np.uint8(115):

%timeit pc.equal(data_pa, val).sum().as_py()
1.64 ms ± 8.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

but much, much slower for 115:

%timeit pc.equal(data_pa, 115).sum().as_py()
15.6 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And fails for b"s":

%timeit pc.equal(data_pa, b"s").sum().as_py()
ArrowNotImplementedError: Function 'equal' has no kernel matching
input types (uint8, binary)

I wrote it down in https://github.com/apache/arrow/issues/38640

Any chance to get performance closer to numpy?

BR,

Jacek

Reply via email to