raulcd opened a new issue, #49321: URL: https://github.com/apache/arrow/issues/49321
### Describe the enhancement requested As part of: - https://github.com/apache/arrow/issues/36411 The discussion about adding a sanitizers build for PyArrow popped up. I am creating this issue to track the discussion and raise it as a separate enhancement. So far the summary of the discussion there: > I think the main difficulty for a PyArrow sanitizers build is that the sanitizer instrumentation should be enabled in CPython as well (and potentially NumPy?). _Originally posted by @pitrou [#36411](https://github.com/apache/arrow/issues/36411#issuecomment-3916508307)_ > You may be interested in how numpy & scipy are doing this, in conjunction with CPython. That setup uses pixi as a kind of "light-weight conda-build" orchestrator that wraps the various rebuilds (independent of whether that's via CMake/meson/whatever): > * https://github.com/python/cpython/issues/142466 > * https://github.com/python/cpython/pull/142872 > * https://github.com/numpy/numpy/pull/30510 > * https://github.com/scipy/scipy/pull/24066 > * etc. _Originally posted by @h-vetinari in [#36411](https://github.com/apache/arrow/issues/36411#issuecomment-3916859990)_ > That's an ideal setup but I don't think its required - you could use point LD_PRELOAD to the sanitizer library to have it loaded correctly from a process that was not built with sanitizers enabled (i.e. Python). We used to do that in CI with pandas, although we did abandon it after time due to it being a maintenance burden _Originally posted by @WillAyd in [#36411](https://github.com/apache/arrow/issues/36411#issuecomment-3916925502)_ > Is that enough, though? Ideally, the code is instrumented at compile time (memory accesses etc.). For example, if PyArrow passes a bogus memory pointer to NumPy, we want ASan to notice and that might not happen if NumPy was not compiled with ASan enabled. > _Originally posted by @pitrou in [#36411](https://github.com/apache/arrow/issues/36411#issuecomment-3916949900)_ > Yeah, for ASAN/TSAN, you need to instrument the other relevant libraries, which means rebuilding them, which is generally a huge pain, which is why the approach I referenced above provides a real benefit. Once all the pieces are in place, it comes down to > ``` > pixi run test-asan -t some_test > ``` > which rebuilds (& caches) instrumented cpython, numpy etc. as necessary. I haven't been very involved, but the scipy PR contains more details; and I'm pretty sure that Lucas wouldn't mind answering questions (not tagged here because it's already a bit OT). _Originally posted by @h-vetinari in [#36411](https://github.com/apache/arrow/issues/36411#issuecomment-3917154312)_ ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
