On 2020/11/11 18:17:32, Wes McKinney <[email protected]> wrote: > You should be able to use the kernels available in pyarrow.compute to > do this -- there might be a few that are missing, but if you can't > find what you need please open a Jira issue so it goes into the > backlog > > On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs <[email protected]> wrote: > > > > I do a lot of the following operation: > > > > subframe = df[df['ID'] == k] > > > > where df is a Pandas DataFrame with a small number of columns but a > > moderately large number of rows (say 200K - 5M). The columns are usually > > simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, > > int64 VALUE. > > > > I am moving the source data to Parquet format. I don't really care whether > > I do this in PyArrow or Pandas, but I need to perform these subframe > > selections frequently and would like to speed them up. (The idea being, > > load the data into memory once, and then expect to perform subframe > > selection anywhere from 10 - 1000 times to extract appropriate data for > > further processing.) > > > > Is there a suggested method? Any ideas? > > > > I've tried > > > > subframe = df.query('ID == %d' % k) > > > > and flirted with the idea of using Gandiva as per > > https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/ > > but it looks a bit rough + I had to manually tweak the types of literal > > constants to support something other than a float64. >
I'm a bit of a beginner in pyarrow, so I have something that kind of works (see https://stackoverflow.com/questions/64581590) but doesn't seem to be significantly more efficient than pandas filtering; Joris filed https://issues.apache.org/jira/browse/ARROW-10423. Below is an example (DataFrame df0 is a 1.2M row dataset with some content having 200K rows and other content having 195 rows, which is a good mix of dense and sparse content) in Jupyter. I'm not sure how to copy+paste output so it is clearly delineating input and output; I've manually added ">>>" prompts to help. I guess the good news is that Pandas does pretty well on its own. The whole dataset takes about 24.5MB; a 10ms query represents 2.45GB/s brute force processing so I suppose I shouldn't complain too much. >>> t = pa.Table.from_pandas(df0) >>> t pyarrow.Table timestamp: int64 index: int32 value: int64 >>> import pyarrow.compute as pc >>> def select_by_index(table, ival): value_index = table.column('index') index_type = value_index.type.to_pandas_dtype() mask = pc.equal(value_index, index_type(ival)) return table.filter(mask) >>> %timeit t2 = select_by_index(t, 515) 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit t2 = select_by_index(t, 3) 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> %timeit df0[df0['index'] == 515] 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit df0[df0['index'] == 3] 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) >>> print("ALL:%d, 3:%d, 515:%d" % (len(df0), np.count_nonzero(df0['index'] == 3), np.count_nonzero(df0['index'] == 515))) ALL:1225000, 3:200000, 515:195 >>> df0.memory_usage() Index 128 timestamp 9800000 index 4900000 value 9800000 dtype: int64
