tustvold commented on issue #2738: URL: https://github.com/apache/arrow-rs/issues/2738#issuecomment-1249568144
I think most of the pieces to support this are already present, object store get returns a special cased [`GetResult::File`](https://docs.rs/object_store/latest/object_store/enum.GetResult.html) for a LocalFileSystem, providing access to the underlying file descriptor. It would then be a case of defining a [`ChunkReader`](https://docs.rs/parquet/latest/parquet/file/reader/trait.ChunkReader.html) that makes use of memmap. That being said memmap helps in a relatively narrow set of circumstances where: * No in-memory buffer management system * The same memory region is being read/written multiple times * Performing small reads/writes at random locations Parquet does not really have this access pattern, it is internally a block format that performs large consecutive reads of a file once. Modern operating systems and hardware are very good at this, and this combined with the fact decoding these blocks is non-trivial, typically results in parquet decoding being primarily CPU-bound and not IO-bound. I'm therefore not sure I would expect this to yield meaningful performance improvements, although happy to be corrected on this. You may also find [this](https://db.cs.cmu.edu/mmap-cidr2022/) article insightful, it is more focused on general database workloads, but the general takeaway is that mmap is not always positive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
