When the pyarrow table has many chunks, the slice function is slow as
demonstrated with the following code:

```

import sys
import time
import numpy as np
import pyarrow as pa

batch_size = 1024

batches = []
for _ in range(8555):
    batch = {}
    for i in range(10):
        batch[str(i)] = np.array([j for j in range(batch_size)])
    batches.append(pa.Table.from_pydict(batch))
block = pa.concat_tables(batches, promote=True)
*# Without the below line, the time is 345s and with it, the time is 0.07s.
# block = block.combine_chunks()*

start = time.perf_counter()
while block.num_rows > batch_size:
    block.slice(0, batch_size)
    block = block.slice(batch_size, block.num_rows - batch_size)

duration = time.perf_counter() - start
print(f"Duration: {duration}")

```

Several questions:

   1. Is this slice slowness expected when a table has many chunks?
   2. Is there a way to tell pyarrow.concat_tables to return a table with a
   single chunk so I can avoid an extra copy by calling combine_chunks()?


-- 
Thanks,
Jiajun Yao

Reply via email to