josiahyan commented on pull request #8214: URL: https://github.com/apache/arrow/pull/8214#issuecomment-696065926
Yep, I'm worried that the buffers can be changed unintentionally/intentionally, and as you pointed out, needs to be always checked for and trapped. I'm wondering if the overhead of these checks are just inherent in such a flexible API, and quite significant (for ArrowBuf without flipping the `arrow.enable_unsafe_memory_access=false` flag globally, it does consume ~30% of CPU when `setSafe`ing), and worth removing for the general use-case. (where this by a Builder API that defers generalization, or a special (local) ArrowBuf mode if possible). We could jump into the unsafe APIs, like [this](https://github.com/apache/iceberg/issues/9#issuecomment-517486998) [code](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java#L107), but a more generic out-of-the-box solution that makes the happy path for serialization easy would be nice. Ideally, we'd remove at least ~70% of the CPU cost of calling `setSafe` before this patch, for users that only want to append data, with an eye towards pushing it down even more given the simpler code. I'm less familiar with the Arrow code than I'd like to be though (right now), and I'm not sure if there are other ways around this that don't involve doing `arrow.enable_unsafe_memory_access=true`, and various other tweaks. There is inherent value in keeping only one or two (high level) interfaces for building Arrow buffers, and optimizing for the common use case. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
