I extended your benchmark as follows, https://colab.research.google.com/gist/vibhatha/6fbd112902422ed786d19f83d2c54a41/arrow_filter_benchmark.ipynb
If the record batches extracted and filtering is done and table re-created from the respective record-batches, the timing is fine. Please check if my code is accurate in doing that. With Regards, Vibhatha Abeykoon On Thu, Nov 12, 2020 at 10:42 AM Wes McKinney <[email protected]> wrote: > In my setup here I did: > > import pandas as pd > import pyarrow as pa > import pyarrow.compute as pc > import numpy as np > > num_rows = 10_000_000 > data = np.random.randn(num_rows) > > df = pd.DataFrame({'data{}'.format(i): data > for i in range(100)}) > > df['key'] = np.random.randint(0, 100, size=num_rows) > > rb = pa.record_batch(df) > t = pa.table(df) > > I found that the performance of filtering a record batch is very similar: > > In [22]: timeit df[df.key == 5] > 71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > > In [24]: %timeit rb.filter(pc.equal(rb[-1], 5)) > 75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) > > Whereas the performance of filtering a table is absolutely abysmal (no > idea what's going on here) > > In [23]: %timeit t.filter(pc.equal(t[-1], 5)) > 961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > > A few obvious notes: > > * Evidently, these code paths haven't been greatly optimized, so > someone ought to take a look at this > * Everything here is single-threaded in Arrow-land. The end-goal for > all of this is to parallelize everything (predicate evaluation, > filtering) on the CPU thread pool > > On Wed, Nov 11, 2020 at 4:27 PM Vibhatha Abeykoon <[email protected]> > wrote: > > > > Adding to the performance scenario, I also implemented some operators on > top of the Arrow compute API. > > I also observed similar performance when compared to Numpy and Pandas. > > > > But underneath Pandas what I observed was the usage of numpy ops, > > > > ( > https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195 > , > > > https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999 > ) > > > > @Wes > > > > So this would mean that Pandas may have similar performance to Numpy in > filtering cases. Is this a correct assumption? > > > > But the filter compute function itself was very fast. Most time is spent > on creating the mask when there are multiple columns. > > For about 10M records I observed 1.5 ratio of execution time between > Arrow-compute based filtering method vs Pandas. > > > > The performance gap is it due to vectorization or some other factor? > > > > > > With Regards, > > Vibhatha Abeykoon > > > > > > On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <[email protected]> wrote: > >> > >> Ugh, let me reformat that since the PonyMail browser interface thinks > ">>>" is a triply quoted message. > >> > >> <<< t = pa.Table.from_pandas(df0) > >> <<< t > >> pyarrow.Table > >> timestamp: int64 > >> index: int32 > >> value: int64 > >> <<< import pyarrow.compute as pc > >> <<< def select_by_index(table, ival): > >> value_index = table.column('index') > >> index_type = value_index.type.to_pandas_dtype() > >> mask = pc.equal(value_index, index_type(ival)) > >> return table.filter(mask) > >> <<< %timeit t2 = select_by_index(t, 515) > >> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > >> <<< %timeit t2 = select_by_index(t, 3) > >> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > >> <<< %timeit df0[df0['index'] == 515] > >> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > >> <<< %timeit df0[df0['index'] == 3] > >> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > >> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0), > >> np.count_nonzero(df0['index'] == 3), > >> np.count_nonzero(df0['index'] == 515))) > >> ALL:1225000, 3:200000, 515:195 > >> <<< df0.memory_usage() > >> Index 128 > >> timestamp 9800000 > >> index 4900000 > >> value 9800000 > >> dtype: int64 > >> >
