alamb commented on issue #3287:
URL: https://github.com/apache/arrow-rs/issues/3287#issuecomment-2584115139

   Copying @totoroyyb 's high level usecase description from 
https://github.com/apache/arrow-rs/issues/6933
   
   They report a 100x performance improvement when disabling data validation:
   
   
   **Describe your question**
   I am using high-level API (`FileReader` and `FileDecoder`) to read IPC files 
via mmap. I have noticed that `validate_data()` in the Array building process 
([here](https://github.com/apache/arrow-rs/blob/f7263e253655b2ee613be97f9d00e063444d3df5/arrow-data/src/data.rs#L1918-L1945))
 adds significant overhead.
   
   I am targeting an ultra-low-latency scenario. With `validate_data` I got 
290ms for reading a 2.2GB IPC file (via mmap), and 3.8ms without 
`validate_data`, which I tested locally by commenting that out. 3.8ms latency 
is pretty much identical to c++ arrow implementation I tested, and I suspect 
c++ codebase didn't do this sanity check (not entirely sure).
   
   The functions for the "unchecked" building are here in the codebase, but 
they are not accessible from high-level API, where I can easily disable them 
without creating my own array and everything on top of it.
   
   **I wonder if there is any better way to achieve that?**
   
   **Additional context**
   Low latency is critical in my case. Thus, I am trying to avoid any 
additional overhead (C++ codebase as the baseline, maybe?)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to