Thanks for the answer Aldrin, makes a lot of sense! On Wed, Aug 17, 2022 at 6:42 PM Aldrin <[email protected]> wrote:
> The short answer is no, you cannot remove that discrepancy. > > For a memory mapped file, when data is first accessed it is brought into > memory. Subsequent reads to that data doesn't require having to go back to > disk, because it's already in memory. In your example, you haven't > restarted your process so the file data is still in memory for the > subsequent reads. > > If you want more details about memory mapped files, I think this SO post > seems to have some pretty good info [1]. > > [1]: https://stackoverflow.com/a/6383253 > > > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Wed, Aug 17, 2022 at 3:15 PM Victor Sanh <[email protected]> wrote: > >> Hi, >> >> I have an arrow file produced by HF datasets and I am trying to load this >> dataset/arrow file with `datasets.load_from_disk(the_dataset_folder)`. >> I noticed that the first time I load it, it would be significantly slower >> than the subsequent times. Two days later, I will retry loading it, and it >> will be slow again... >> >> After diving a little bit, the gap happens in the >> `_memory_mapped_arrow_table_from_file` function, and in particular in the >> call to `RecordBatchStreamReader.read_all`: >> >> https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51 >> >> `read_all` is slow the first time (probably for some operations that are >> only happening once, and are cached for a few hours?), but not the >> subsequent times. >> >> ``` >> >> >>> import time >> >>> import pyarrow as pa >> >>> def _memory_mapped_arrow_table_from_file(filename): >> ... memory_mapped_stream = pa.memory_map(filename) >> ... opened_stream = pa.ipc.open_stream(memory_mapped_stream) >> ... start_time = time.time() >> ... _ = opened_stream.read_all() >> ... print(f"{time.time()-start_time}") >> ... >> >>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow" >> >>> _memory_mapped_arrow_table_from_file(filename_slow) # First time >> 0.24040865898132324 >> >>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times >> 0.0006551742553710938 >> >>> _memory_mapped_arrow_table_from_file(filename_slow) >> 0.0006804466247558594 >> >>> _memory_mapped_arrow_table_from_file(filename_slow) >> 0.0009818077087402344 >> >> ``` >> >> Anything I can do to remove that discrepancy? >> >> My setup: >> - Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17 >> - Python version: 3.8.13 >> - PyArrow version: 9.0.0 >> >> Thanks in advance! >> >> -- >> >> *Victor Sanh* >> >> Scientist 🤗 >> >> We're hiring! <https://angel.co/company/hugging-face/jobs> >> >> website: https://huggingface.co/ >> >
