alamb commented on issue #3287: URL: https://github.com/apache/arrow-rs/issues/3287#issuecomment-2584115139
Copying @totoroyyb 's high level usecase description from https://github.com/apache/arrow-rs/issues/6933 They report a 100x performance improvement when disabling data validation: **Describe your question** I am using high-level API (`FileReader` and `FileDecoder`) to read IPC files via mmap. I have noticed that `validate_data()` in the Array building process ([here](https://github.com/apache/arrow-rs/blob/f7263e253655b2ee613be97f9d00e063444d3df5/arrow-data/src/data.rs#L1918-L1945)) adds significant overhead. I am targeting an ultra-low-latency scenario. With `validate_data` I got 290ms for reading a 2.2GB IPC file (via mmap), and 3.8ms without `validate_data`, which I tested locally by commenting that out. 3.8ms latency is pretty much identical to c++ arrow implementation I tested, and I suspect c++ codebase didn't do this sanity check (not entirely sure). The functions for the "unchecked" building are here in the codebase, but they are not accessible from high-level API, where I can easily disable them without creating my own array and everything on top of it. **I wonder if there is any better way to achieve that?** **Additional context** Low latency is critical in my case. Thus, I am trying to avoid any additional overhead (C++ codebase as the baseline, maybe?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
