Hi,

I have an arrow file produced by HF datasets and I am trying to load this
dataset/arrow file with `datasets.load_from_disk(the_dataset_folder)`.
I noticed that the first time I load it, it would be significantly slower
than the subsequent times. Two days later, I will retry loading it, and it
will be slow again...

After diving a little bit, the gap happens in the
`_memory_mapped_arrow_table_from_file` function, and in particular in the
call to `RecordBatchStreamReader.read_all`:
https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51

`read_all` is slow the first time (probably for some operations that are
only happening once, and are cached for a few hours?), but not the
subsequent times.

```

>>> import time
>>> import pyarrow as pa
>>> def _memory_mapped_arrow_table_from_file(filename):
...     memory_mapped_stream = pa.memory_map(filename)
...     opened_stream = pa.ipc.open_stream(memory_mapped_stream)
...     start_time = time.time()
...     _ = opened_stream.read_all()
...     print(f"{time.time()-start_time}")
...
>>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow"
>>> _memory_mapped_arrow_table_from_file(filename_slow) # First time
0.24040865898132324
>>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times
0.0006551742553710938
>>> _memory_mapped_arrow_table_from_file(filename_slow)
0.0006804466247558594
>>> _memory_mapped_arrow_table_from_file(filename_slow)
0.0009818077087402344

```

Anything I can do to remove that discrepancy?

My setup:
- Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.13
- PyArrow version: 9.0.0

Thanks in advance!

-- 

*Victor Sanh*

Scientist 🤗

We're hiring! <https://angel.co/company/hugging-face/jobs>

website: https://huggingface.co/

Reply via email to