niyue edited a comment on pull request #11588:
URL: https://github.com/apache/arrow/pull/11588#issuecomment-958612133


   > Thanks for the PR.
   > I'm not sure if exposing madvise options is beneficial. If we expose 
`RANDOM`, then why not `SEQUENTIAL`?
   > I prefer don't expose these access pattern related options, just let OS do 
the job.
   > `WILLNEED` is a useful option IMO, I suppose OS won't invent a reading 
without hints.
   
   I think probably someone else may need `SEQUENTIAL` as well. In my test, if 
the access pattern is random access (binary searching an array in a memory 
mapped arrow IPC file in my case), I find OS (Linux) will prefetch data, and 
lots of IO are wasted (90% in my test), page cache is full of never used data 
as well. 
   
   I wrote a program to access an array in a mmap record batch with binary 
exponential indexes (`arr[1], arr[2], arr[4], arr[8], ...`)
   to visualize what I found, before applying the random advice, the page cache 
for this file looks like this:
   
![image](https://user-images.githubusercontent.com/27754/140003696-8a544000-162f-4e94-9fd9-00e1115c672d.png)
   
   After applying the random advice, the page cache for this file looks like 
this:
   
![image](https://user-images.githubusercontent.com/27754/140003550-64c57ea8-fea0-47b0-bb74-56b7967ee3fe.png)
   
   This probably not a problem for a fast storage (SSD), but the IO will become 
bottleneck if the storage bandwidth is limited.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to