hi Ravi, It's hard to say anything without looking at some code. Are you working on a pull request? This seems like potentially useful functionality to have in the project.
- Wes On Thu, May 9, 2019 at 1:10 PM Ravi Kiran Chirravuri <[email protected]> wrote: > > Hi, > > I'm working on implementing a filter-like functionality over a > Table::Column given an indicator array. > For now, we've implemented a basic version leveraging Array::Slice while > iterating over the Table::Column->chunks() but this seems slow when the > contiguous runs are of length <= 128 (compared to pandas). > > 1 Total rows in table: 646035 > > 2 crk: Period: 128 > > 3 Expected output rows: 640987 > > 4 PROFILE: select starting > > 5 arrowops actual_num_rows: 0 > > 6 PROFILE: select completed in 795ms > > 7 PROFILE: pandas starting > > 8 pandas actual_num_rows: 640987 > > 9 PROFILE: pandas completed in 737ms > > Before I dive into more observations about my implementation, I wanted to > check if there are alternatives I could consider that might have > substantially different performance characteristics. > > For comparison - pandas always takes ~750 ms for processing the 650k rows > in different to the indicator distribution; but our implementation to copy > the array-slice(s) into an output (ArrayVector) degrades quite a bit as the > contiguous run-length reduces. > > For a degenerate case with alternating 1's and 0's in indicator: > > Expected output rows: 323017 > > PROFILE: select starting > > arrowops actual_num_rows: 323017 > > PROFILE: select completed in 72110ms > > PROFILE: pandas starting > > pandas actual_num_rows: 323017 > > PROFILE: pandas completed in 492ms > > Thanks in advance, > Ravi.
