wesm commented on pull request #7442:
URL: https://github.com/apache/arrow/pull/7442#issuecomment-644881357
I found some issues in the Python benchmarks I posted before. Here's the
updated setup and current numbers
setup (I was including the cost of converting NumPy booleans to Arrow
booleans in the prior results). I also added a "worst case scenario" where 50%
of values are not selected
```
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
string_values = pa.array([pd.util.testing.rands(16)
for i in range(10000)] * 100)
double_values = pa.array(np.random.randn(1000000))
all_but_one = np.ones(len(string_values), dtype=bool)
all_but_one[500000] = False
one_in_2 = np.array(np.random.binomial(1, 0.50, size=1000000), dtype=bool)
one_in_100 = np.array(np.random.binomial(1, 0.01, size=1000000), dtype=bool)
one_in_1000 = np.array(np.random.binomial(1, 0.001, size=1000000),
dtype=bool)
all_but_one = pa.array(all_but_one)
one_in_2 = pa.array(one_in_2)
one_in_100 = pa.array(one_in_100)
one_in_1000 = pa.array(one_in_1000)
```
before:
```
In [2]: timeit pc.filter(double_values, all_but_one)
5.15 ms ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [3]: timeit pc.filter(double_values, one_in_100)
1.45 ms ± 8.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: timeit pc.filter(double_values, one_in_1000)
1.37 ms ± 8.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: timeit pc.filter(double_values, one_in_2)
7.08 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: timeit pc.filter(string_values, all_but_one)
11 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: timeit pc.filter(string_values, one_in_100)
1.64 ms ± 9.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: timeit pc.filter(string_values, one_in_1000)
1.45 ms ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: timeit pc.filter(string_values, one_in_2)
11.4 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
after:
```
In [2]: timeit pc.filter(double_values, all_but_one)
370 µs ± 2.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: timeit pc.filter(double_values, one_in_100)
645 µs ± 3.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: timeit pc.filter(double_values, one_in_1000)
124 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: timeit pc.filter(double_values, one_in_2)
5.11 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: timeit pc.filter(string_values, all_but_one)
6.51 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: timeit pc.filter(string_values, one_in_100)
680 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: timeit pc.filter(string_values, one_in_1000)
188 µs ± 849 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: timeit pc.filter(string_values, one_in_2)
7.73 ms ± 63.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]