Re: Tabular ID query (subframe selection based on an integer ID)

Vibhatha Abeykoon Thu, 12 Nov 2020 11:02:13 -0800

I extended your benchmark as follows,

https://colab.research.google.com/gist/vibhatha/6fbd112902422ed786d19f83d2c54a41/arrow_filter_benchmark.ipynb


If the record batches extracted and filtering is done and table re-created
from the respective record-batches, the timing is fine.
Please check if my code is accurate in doing that.

With Regards,
Vibhatha Abeykoon


On Thu, Nov 12, 2020 at 10:42 AM Wes McKinney <[email protected]> wrote:

> In my setup here I did:
>
> import pandas as pd
> import pyarrow as pa
> import pyarrow.compute as pc
> import numpy as np
>
> num_rows = 10_000_000
> data = np.random.randn(num_rows)
>
> df = pd.DataFrame({'data{}'.format(i): data
>                    for i in range(100)})
>
> df['key'] = np.random.randint(0, 100, size=num_rows)
>
> rb = pa.record_batch(df)
> t = pa.table(df)
>
> I found that the performance of filtering a record batch is very similar:
>
> In [22]: timeit df[df.key == 5]
> 71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
> 75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> Whereas the performance of filtering a table is absolutely abysmal (no
> idea what's going on here)
>
> In [23]: %timeit t.filter(pc.equal(t[-1], 5))
> 961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>
> A few obvious notes:
>
> * Evidently, these code paths haven't been greatly optimized, so
> someone ought to take a look at this
> * Everything here is single-threaded in Arrow-land. The end-goal for
> all of this is to parallelize everything (predicate evaluation,
> filtering) on the CPU thread pool
>
> On Wed, Nov 11, 2020 at 4:27 PM Vibhatha Abeykoon <[email protected]>
> wrote:
> >
> > Adding to the performance scenario, I also implemented some operators on
> top of the Arrow compute API.
> > I also observed similar performance when compared to Numpy and Pandas.
> >
> > But underneath Pandas what I observed was the usage of numpy ops,
> >
> > (
> https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195
> ,
> >
> https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999
> )
> >
> > @Wes
> >
> > So this would mean that Pandas may have similar performance to Numpy in
> filtering cases. Is this a correct assumption?
> >
> > But the filter compute function itself was very fast. Most time is spent
> on creating the mask when there are multiple columns.
> > For about 10M records I observed 1.5 ratio of execution time between
> Arrow-compute based filtering method vs Pandas.
> >
> > The performance gap is it due to vectorization or some other factor?
> >
> >
> > With Regards,
> > Vibhatha Abeykoon
> >
> >
> > On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <[email protected]> wrote:
> >>
> >> Ugh, let me reformat that since the PonyMail browser interface thinks
> ">>>" is a triply quoted message.
> >>
> >> <<< t = pa.Table.from_pandas(df0)
> >> <<< t
> >> pyarrow.Table
> >> timestamp: int64
> >> index: int32
> >> value: int64
> >> <<< import pyarrow.compute as pc
> >> <<< def select_by_index(table, ival):
> >>      value_index = table.column('index')
> >>      index_type = value_index.type.to_pandas_dtype()
> >>      mask = pc.equal(value_index, index_type(ival))
> >>      return table.filter(mask)
> >> <<< %timeit t2 = select_by_index(t, 515)
> >> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >> <<< %timeit t2 = select_by_index(t, 3)
> >> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >> <<< %timeit df0[df0['index'] == 515]
> >> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> >> <<< %timeit df0[df0['index'] == 3]
> >> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
> >>                                  np.count_nonzero(df0['index'] == 3),
> >>                                  np.count_nonzero(df0['index'] == 515)))
> >> ALL:1225000, 3:200000, 515:195
> >> <<< df0.memory_usage()
> >> Index            128
> >> timestamp    9800000
> >> index        4900000
> >> value        9800000
> >> dtype: int64
> >>
>

Re: Tabular ID query (subframe selection based on an integer ID)

Reply via email to