Re: `RecordBatchStreamReader.read_all` is slow the first time

Victor Sanh Wed, 17 Aug 2022 19:18:24 -0700

Thanks for the answer Aldrin, makes a lot of sense!

On Wed, Aug 17, 2022 at 6:42 PM Aldrin <[email protected]> wrote:


> The short answer is no, you cannot remove that discrepancy.
>
> For a memory mapped file, when data is first accessed it is brought into
> memory. Subsequent reads to that data doesn't require having to go back to
> disk, because it's already in memory. In your example, you haven't
> restarted your process so the file data is still in memory for the
> subsequent reads.
>
> If you want more details about memory mapped files, I think this SO post
> seems to have some pretty good info [1].
>
> [1]: https://stackoverflow.com/a/6383253
>
>
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Aug 17, 2022 at 3:15 PM Victor Sanh <[email protected]> wrote:
>
>> Hi,
>>
>> I have an arrow file produced by HF datasets and I am trying to load this
>> dataset/arrow file with `datasets.load_from_disk(the_dataset_folder)`.
>> I noticed that the first time I load it, it would be significantly slower
>> than the subsequent times. Two days later, I will retry loading it, and it
>> will be slow again...
>>
>> After diving a little bit, the gap happens in the
>> `_memory_mapped_arrow_table_from_file` function, and in particular in the
>> call to `RecordBatchStreamReader.read_all`:
>>
>> https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51
>>
>> `read_all` is slow the first time (probably for some operations that are
>> only happening once, and are cached for a few hours?), but not the
>> subsequent times.
>>
>> ```
>>
>> >>> import time
>> >>> import pyarrow as pa
>> >>> def _memory_mapped_arrow_table_from_file(filename):
>> ...     memory_mapped_stream = pa.memory_map(filename)
>> ...     opened_stream = pa.ipc.open_stream(memory_mapped_stream)
>> ...     start_time = time.time()
>> ...     _ = opened_stream.read_all()
>> ...     print(f"{time.time()-start_time}")
>> ...
>> >>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow"
>> >>> _memory_mapped_arrow_table_from_file(filename_slow) # First time
>> 0.24040865898132324
>> >>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times
>> 0.0006551742553710938
>> >>> _memory_mapped_arrow_table_from_file(filename_slow)
>> 0.0006804466247558594
>> >>> _memory_mapped_arrow_table_from_file(filename_slow)
>> 0.0009818077087402344
>>
>> ```
>>
>> Anything I can do to remove that discrepancy?
>>
>> My setup:
>> - Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17
>> - Python version: 3.8.13
>> - PyArrow version: 9.0.0
>>
>> Thanks in advance!
>>
>> --
>>
>> *Victor Sanh*
>>
>> Scientist 🤗
>>
>> We're hiring! <https://angel.co/company/hugging-face/jobs>
>>
>> website: https://huggingface.co/
>>
>

Re: `RecordBatchStreamReader.read_all` is slow the first time

Reply via email to