When the pyarrow table has many chunks, the slice function is slow as
demonstrated with the following code:
```
import sys
import time
import numpy as np
import pyarrow as pa
batch_size = 1024
batches = []
for _ in range(8555):
batch = {}
for i in range(10):
batch[str(i)] = np.array([j for j in range(batch_size)])
batches.append(pa.Table.from_pydict(batch))
block = pa.concat_tables(batches, promote=True)
*# Without the below line, the time is 345s and with it, the time is 0.07s.
# block = block.combine_chunks()*
start = time.perf_counter()
while block.num_rows > batch_size:
block.slice(0, batch_size)
block = block.slice(batch_size, block.num_rows - batch_size)
duration = time.perf_counter() - start
print(f"Duration: {duration}")
```
Several questions:
1. Is this slice slowness expected when a table has many chunks?
2. Is there a way to tell pyarrow.concat_tables to return a table with a
single chunk so I can avoid an extra copy by calling combine_chunks()?
--
Thanks,
Jiajun Yao