Tabular ID query (subframe selection based on an integer ID)

Jason Sachs Wed, 11 Nov 2020 09:44:08 -0800

I do a lot of the following operation:

    subframe = df[df['ID'] == k]


where df is a Pandas DataFrame with a small number of columns but a moderately 
large number of rows (say 200K - 5M). The columns are usually simple... for 
example's sake let's call them int64 TIMESTAMP, uint32 ID, int64 VALUE.

I am moving the source data to Parquet format. I don't really care whether I do 
this in PyArrow or Pandas, but I need to perform these subframe selections 
frequently and would like to speed them up. (The idea being, load the data into 
memory once, and then expect to perform subframe selection anywhere from 10 - 
1000 times to extract appropriate data for further processing.)

Is there a suggested method? Any ideas?

I've tried

    subframe = df.query('ID == %d' % k)

and flirted with the idea of using Gandiva as per 
https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
 but it looks a bit rough + I had to manually tweak the types of literal 
constants to support something other than a float64.

Tabular ID query (subframe selection based on an integer ID)

Reply via email to