Re: Efficient way to copy contiguous run(s) of entries

Wes McKinney Thu, 09 May 2019 16:44:32 -0700

hi Ravi,

It's hard to say anything without looking at some code. Are you
working on a pull request? This seems like potentially useful
functionality to have in the project.


- Wes

On Thu, May 9, 2019 at 1:10 PM Ravi Kiran Chirravuri
<[email protected]> wrote:
>
> Hi,
>
> I'm working on implementing a filter-like functionality over a
> Table::Column given an indicator array.
> For now, we've implemented a basic version leveraging Array::Slice while
> iterating over the Table::Column->chunks() but this seems slow when the
> contiguous runs are of length <= 128 (compared to pandas).
>
>   1 Total rows in table:  646035
>
>   2 crk: Period:  128
>
>   3 Expected output rows:  640987
>
>   4 PROFILE: select starting
>
>   5 arrowops actual_num_rows:  0
>
>   6 PROFILE: select completed in 795ms
>
>   7 PROFILE: pandas starting
>
>   8 pandas actual_num_rows:  640987
>
>   9 PROFILE: pandas completed in 737ms
>
> Before I dive into more observations about my implementation, I wanted to
> check if there are alternatives I could consider that might have
> substantially different performance characteristics.
>
> For comparison - pandas always takes ~750 ms for processing the 650k rows
> in different to the indicator distribution; but our implementation to copy
> the array-slice(s) into an output (ArrayVector) degrades quite a bit as the
> contiguous run-length reduces.
>
> For a degenerate case with alternating 1's and 0's in indicator:
>
> Expected output rows:  323017
>
> PROFILE: select starting
>
> arrowops actual_num_rows:  323017
>
> PROFILE: select completed in 72110ms
>
> PROFILE: pandas starting
>
> pandas actual_num_rows:  323017
>
> PROFILE: pandas completed in 492ms
>
> Thanks in advance,
> Ravi.

Re: Efficient way to copy contiguous run(s) of entries

Reply via email to