Hi,

I'm working on implementing a filter-like functionality over a
Table::Column given an indicator array.
For now, we've implemented a basic version leveraging Array::Slice while
iterating over the Table::Column->chunks() but this seems slow when the
contiguous runs are of length <= 128 (compared to pandas).

  1 Total rows in table:  646035

  2 crk: Period:  128

  3 Expected output rows:  640987

  4 PROFILE: select starting

  5 arrowops actual_num_rows:  0

  6 PROFILE: select completed in 795ms

  7 PROFILE: pandas starting

  8 pandas actual_num_rows:  640987

  9 PROFILE: pandas completed in 737ms

Before I dive into more observations about my implementation, I wanted to
check if there are alternatives I could consider that might have
substantially different performance characteristics.

For comparison - pandas always takes ~750 ms for processing the 650k rows
in different to the indicator distribution; but our implementation to copy
the array-slice(s) into an output (ArrayVector) degrades quite a bit as the
contiguous run-length reduces.

For a degenerate case with alternating 1's and 0's in indicator:

Expected output rows:  323017

PROFILE: select starting

arrowops actual_num_rows:  323017

PROFILE: select completed in 72110ms

PROFILE: pandas starting

pandas actual_num_rows:  323017

PROFILE: pandas completed in 492ms

Thanks in advance,
Ravi.

Reply via email to