For reference, this was also opened as an issue on github, and I
answered there: https://github.com/apache/arrow/issues/35126

On Wed, 12 Apr 2023 at 06:34, Jiajun Yao <[email protected]> wrote:
>
> When the pyarrow table has many chunks, the slice function is slow as 
> demonstrated with the following code:
>
> ```
>
> import sys
> import time
> import numpy as np
> import pyarrow as pa
>
> batch_size = 1024
>
> batches = []
> for _ in range(8555):
>     batch = {}
>     for i in range(10):
>         batch[str(i)] = np.array([j for j in range(batch_size)])
>     batches.append(pa.Table.from_pydict(batch))
> block = pa.concat_tables(batches, promote=True)
>
> # Without the below line, the time is 345s and with it, the time is 0.07s.
> # block = block.combine_chunks()
>
> start = time.perf_counter()
> while block.num_rows > batch_size:
>     block.slice(0, batch_size)
>     block = block.slice(batch_size, block.num_rows - batch_size)
>
> duration = time.perf_counter() - start
> print(f"Duration: {duration}")
>
> ```
>
> Several questions:
>
> Is this slice slowness expected when a table has many chunks?
> Is there a way to tell pyarrow.concat_tables to return a table with a single 
> chunk so I can avoid an extra copy by calling combine_chunks()?
>
>
> --
> Thanks,
> Jiajun Yao

Reply via email to