tustvold commented on issue #2738:
URL: https://github.com/apache/arrow-rs/issues/2738#issuecomment-1249568144

   I think most of the pieces to support this are already present, object store 
get returns a special cased 
[`GetResult::File`](https://docs.rs/object_store/latest/object_store/enum.GetResult.html)
 for a LocalFileSystem, providing access to the underlying file descriptor. It 
would then be a case of defining a 
[`ChunkReader`](https://docs.rs/parquet/latest/parquet/file/reader/trait.ChunkReader.html)
 that makes use of memmap.
   
   That being said memmap helps in a relatively narrow set of circumstances 
where:
   
   * No in-memory buffer management system
   * The same memory region is being read/written multiple times
   * Performing small reads/writes at random locations
   
   Parquet does not really have this access pattern, it is internally a block 
format that performs large consecutive reads of a file once. Modern operating 
systems and hardware are very good at this, and this combined with the fact 
decoding these blocks is non-trivial, typically results in parquet decoding 
being primarily CPU-bound and not IO-bound. I'm therefore not sure I would 
expect this to yield meaningful performance improvements, although happy to be 
corrected on this.
   
   You may also find [this](https://db.cs.cmu.edu/mmap-cidr2022/) article 
insightful, it is more focused on general database workloads, but the general 
takeaway is that mmap is not always positive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to