Hi, I'm working on implementing a filter-like functionality over a Table::Column given an indicator array. For now, we've implemented a basic version leveraging Array::Slice while iterating over the Table::Column->chunks() but this seems slow when the contiguous runs are of length <= 128 (compared to pandas).
1 Total rows in table: 646035 2 crk: Period: 128 3 Expected output rows: 640987 4 PROFILE: select starting 5 arrowops actual_num_rows: 0 6 PROFILE: select completed in 795ms 7 PROFILE: pandas starting 8 pandas actual_num_rows: 640987 9 PROFILE: pandas completed in 737ms Before I dive into more observations about my implementation, I wanted to check if there are alternatives I could consider that might have substantially different performance characteristics. For comparison - pandas always takes ~750 ms for processing the 650k rows in different to the indicator distribution; but our implementation to copy the array-slice(s) into an output (ArrayVector) degrades quite a bit as the contiguous run-length reduces. For a degenerate case with alternating 1's and 0's in indicator: Expected output rows: 323017 PROFILE: select starting arrowops actual_num_rows: 323017 PROFILE: select completed in 72110ms PROFILE: pandas starting pandas actual_num_rows: 323017 PROFILE: pandas completed in 492ms Thanks in advance, Ravi.
