jorgecarleitao commented on pull request #9211: URL: https://github.com/apache/arrow/pull/9211#issuecomment-761758608
@nevi-me , no need to apologize, no one saw this, and we also had no tests to cover this. Wrt to the offsets, I admit that I had a branch where I removed the `Buffer::offset` altogether. However, I saw some perf implications somewhere and abandoned the idea. We could just take the perf hit to remove this complexity, even though IMO the buffer offsets are not very painful to deal with, as they are opaque to the `ArrayData`: even bit operations automatically take that offset into account. The slot offset (`ArrayData`) is indeed more challenge, but it is part of the spec, so we need to deal with it. One aspect here is that unfortunately we have a lot of units of measurement: 1. Offset measured in slots (`ArrayData::offset`) 2. Offset measured in bytes (`Buffer::offset`, `ArrayData::offset * size_of`) 3. Offset is sometimes measured in bits (for bitmap operations) Overall, this adds complexity to how to reason about the code. I am not sure we have an easy way to address this, as they are indeed needed. One idea is to declare each measurement to be a different type and perform explicit casts (which the compiler will optimize out as they area all represented by `usize`), so that we have the compiler on our side when adding different types of offsets (e.g. slot offset vs byte offset). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
