The short answer is no, you cannot remove that discrepancy.

For a memory mapped file, when data is first accessed it is brought into
memory. Subsequent reads to that data doesn't require having to go back to
disk, because it's already in memory. In your example, you haven't
restarted your process so the file data is still in memory for the
subsequent reads.

If you want more details about memory mapped files, I think this SO post
seems to have some pretty good info [1].

[1]: https://stackoverflow.com/a/6383253



Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Wed, Aug 17, 2022 at 3:15 PM Victor Sanh <[email protected]> wrote:

> Hi,
>
> I have an arrow file produced by HF datasets and I am trying to load this
> dataset/arrow file with `datasets.load_from_disk(the_dataset_folder)`.
> I noticed that the first time I load it, it would be significantly slower
> than the subsequent times. Two days later, I will retry loading it, and it
> will be slow again...
>
> After diving a little bit, the gap happens in the
> `_memory_mapped_arrow_table_from_file` function, and in particular in the
> call to `RecordBatchStreamReader.read_all`:
>
> https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51
>
> `read_all` is slow the first time (probably for some operations that are
> only happening once, and are cached for a few hours?), but not the
> subsequent times.
>
> ```
>
> >>> import time
> >>> import pyarrow as pa
> >>> def _memory_mapped_arrow_table_from_file(filename):
> ...     memory_mapped_stream = pa.memory_map(filename)
> ...     opened_stream = pa.ipc.open_stream(memory_mapped_stream)
> ...     start_time = time.time()
> ...     _ = opened_stream.read_all()
> ...     print(f"{time.time()-start_time}")
> ...
> >>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow"
> >>> _memory_mapped_arrow_table_from_file(filename_slow) # First time
> 0.24040865898132324
> >>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times
> 0.0006551742553710938
> >>> _memory_mapped_arrow_table_from_file(filename_slow)
> 0.0006804466247558594
> >>> _memory_mapped_arrow_table_from_file(filename_slow)
> 0.0009818077087402344
>
> ```
>
> Anything I can do to remove that discrepancy?
>
> My setup:
> - Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17
> - Python version: 3.8.13
> - PyArrow version: 9.0.0
>
> Thanks in advance!
>
> --
>
> *Victor Sanh*
>
> Scientist 🤗
>
> We're hiring! <https://angel.co/company/hugging-face/jobs>
>
> website: https://huggingface.co/
>

Reply via email to