Manishearth opened a new issue, #10164: URL: https://github.com/apache/arrow-rs/issues/10164
> [!NOTE] > This finding was identified during an agentic unsafe Rust code review performed by Gemini AI, followed by human review and verification. ### The Issue In `src/util/bit_util.rs`, the trait `FromBytes` is exported as a public `unsafe trait` (when the `experimental` crate feature is enabled). Its `# Safety` contract promises callers safety based solely on bit pattern validity: https://github.com/apache/arrow-rs/blob/56.2.0/parquet/src/util/bit_util.rs#L40-L48 ```rust /// # Safety /// All bit patterns 00000xxxx, where there are `BIT_CAPACITY` `x`s, /// must be valid, unless BIT_CAPACITY is 0. pub unsafe trait FromBytes: Sized { const BIT_CAPACITY: usize; type Buffer: AsMut<[u8]> + Default; fn try_from_le_slice(b: &[u8]) -> Result<Self>; fn from_le_bytes(bs: Self::Buffer) -> Self; } ``` The method `BitReader::get_batch` accepts a mutable slice `&mut [T]` where `T: FromBytes`. When `size_of::<T>()` is 2, 4, or 8, `get_batch` casts `batch.as_mut_ptr()` directly to `*mut u16`, `*mut u32`, or `*mut u64` and invokes `std::slice::from_raw_parts_mut`: ```rust match size_of::<T>() { ... 2 => { let ptr = batch.as_mut_ptr() as *mut u16; ... let out = unsafe { std::slice::from_raw_parts_mut(ptr, batch.len()) }; ``` Because `batch` is a slice `&mut [T]`, its backing pointer is guaranteed only to be aligned to `align_of::<T>()`. Because the `# Safety` contract on `FromBytes` specifies no alignment obligation, downstream code can soundly implement `FromBytes` for a custom 1-byte aligned type of size 2 (e.g. `#[repr(packed)] struct UnalignedU16([u8; 2]);`). Passing `&mut [UnalignedU16]` to `get_batch` obeys all documented safety preconditions, but executes `from_raw_parts_mut` on an unaligned pointer `*mut u16`. Constructing a reference `&mut [u16]` from a misaligned pointer is immediate Undefined Behavior. <details><summary>Minimal Reproduction (Miri)</summary> ```rust // Standalone reproduction for parquet FromBytes missing alignment obligation UB. use parquet::errors::Result; use parquet::util::bit_util::{BitReader, FromBytes}; #[repr(packed)] struct UnalignedU16([u8; 2]); unsafe impl FromBytes for UnalignedU16 { const BIT_CAPACITY: usize = 16; type Buffer = [u8; 2]; fn try_from_le_slice(b: &[u8]) -> Result<Self> { Ok(UnalignedU16([b[0], b[1]])) } fn from_le_bytes(bs: Self::Buffer) -> Self { UnalignedU16(bs) } } fn main() { // 64 bytes of input data to support reading 16 values of 16 bits (32 bytes) let buf = [0u8; 64]; let mut reader = BitReader::new(bytes::Bytes::copy_from_slice(&buf)); // Storage for 16 elements of UnalignedU16 (32 bytes) + padding for offset let mut storage = [0u8; 36]; // Dynamically guarantee that slice_ptr is at an ODD address (misaligned for u16) let addr = storage.as_mut_ptr() as usize; let offset = if addr % 2 == 0 { 1 } else { 2 }; let slice_ptr = unsafe { storage.as_mut_ptr().add(offset) as *mut UnalignedU16 }; let batch = unsafe { std::slice::from_raw_parts_mut(slice_ptr, 16) }; // This will now enter the fast-path loop and trigger the alignment violation let _ = reader.get_batch(batch, 16); } ``` ```text error: Undefined Behavior: constructing invalid value of type &mut [u16]: encountered an unaligned reference (required 2 byte alignment but found 1) --> /google/src/cloud/manishearth/verify-unsafe-rust-bugs/google3/third_party/rust/parquet/v56/src/util/bit_util.rs:506:36 | 506 | let out = unsafe { std::slice::from_raw_parts_mut(ptr, batch.len()) }; | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Undefined Behavior occurred here | = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information = note: stack backtrace: 0: parquet::util::bit_util::BitReader::get_batch::<UnalignedU16> at /google/src/cloud/manishearth/verify-unsafe-rust-bugs/google3/third_party/rust/parquet/v56/src/util/bit_util.rs:506:36: 506:84 1: main at src/bin/repro1.rs:34:13: 34:40 ``` </details> <details><summary>Suggested Fix</summary> Update the `# Safety` contract of `FromBytes` to explicitly state that implementors must guarantee `align_of::<Self>() >= align_of::<Primitive>()` matching primitive types of `size_of::<Self>()`. Alternatively, seal `FromBytes` via a private supertrait so external crates cannot implement it. </details> -------------------------------------------------------------------------------- > [!NOTE] > The full audit report below also contains additional minor findings (such as missing safety comments or undocumented FFI assumptions) that are probably worth fixing as well but not the primary goal of this issue. <details><summary>Full Gemini Unsafe Code Audit Report</summary> # Unsafe Rust Review: `parquet` (`v56`) ## Overall Safety Assessment `parquet` contains a moderate amount of `unsafe` code (~20 blocks across 17 files), primarily focused on zero-copy transmutations (e.g., bloom filter blocks, bit unpacking fast-paths), Arrow buffer construction (`build_unchecked`), and string UTF-8 validation optimizations. Architecturally, the crate demonstrates strong encapsulation principles: core traits such as `ParquetValueType` and `DataType` are sealed via private supertraits (`GetDecoder`, `MakeStatistics`). This prevents downstream crates from implementing them for arbitrary types and breaking internal invariants assumed by unsafe fast-paths. However, formal safety documentation and proof-obligation hygiene are lacking: - The public `FromBytes` trait (accessible when the `experimental` crate feature flag is enabled) is unsealed and its safety contract omits alignment requirements. This allows downstream code to implement `FromBytes` for 1-byte aligned custom types (e.g., `#[repr(packed)]`) while adhering 100% to the documented contract, triggering undefined behavior when passed to `BitReader::get_batch`. - Numerous `unsafe` blocks across Arrow array readers and buffer views lack formal `// SAFETY:` comments or rely on informal comments / non-standard prefixes (e.g., `// Safety:`, `// # Safety`, `// SAFETY -`). ## Critical Findings ### Unsoundness in Public `FromBytes` Trait and `BitReader::get_batch` 🔴 🤦 - **Severity**: 🔴 High - **Threat Vector**: 🤦 Accidental Misuse - **Bug Type**: Alignment Violation, Missing Alignment Obligation The trait `FromBytes` is defined in `src/util/bit_util.rs` as: ```rust /// # Safety /// All bit patterns 00000xxxx, where there are `BIT_CAPACITY` `x`s, /// must be valid, unless BIT_CAPACITY is 0. pub unsafe trait FromBytes: Sized { const BIT_CAPACITY: usize; type Buffer: AsMut<[u8]> + Default; fn try_from_le_slice(b: &[u8]) -> Result<Self>; fn from_le_bytes(bs: Self::Buffer) -> Self; } ``` When the crate's `experimental` feature flag is enabled, `parquet::util::bit_util::FromBytes` is exported as a public, unsealed `unsafe trait`. The function `BitReader::get_batch` is defined as: ```rust pub fn get_batch<T: FromBytes>(&mut self, batch: &mut [T], num_bits: usize) -> usize ``` Internally, when `size_of::<T>()` is 2, 4, or 8, `get_batch` casts `batch.as_mut_ptr()` directly to `*mut u16`, `*mut u32`, or `*mut u64` and invokes `std::slice::from_raw_parts_mut`: ```rust match size_of::<T>() { ... 2 => { let ptr = batch.as_mut_ptr() as *mut u16; ... let out = unsafe { std::slice::from_raw_parts_mut(ptr, batch.len()) }; ``` Because `batch` is a slice `&mut [T]`, `batch.as_mut_ptr()` is guaranteed only to be aligned to `std::mem::align_of::<T>()`. Because the documented `# Safety` contract on `FromBytes` requires only validity of bit patterns and does not specify alignment obligations, a downstream crate can soundly implement `FromBytes` for an unaligned custom type of size 2: ```rust #[repr(packed)] struct UnalignedU16([u8; 2]); // size 2, align 1 unsafe impl FromBytes for UnalignedU16 { const BIT_CAPACITY: usize = 16; type Buffer = [u8; 2]; ... } ``` Calling `bit_reader.get_batch(&mut unaligned_slice, 16)` obeys all documented preconditions, but executes `from_raw_parts_mut` on an unaligned pointer `*mut u16`. Constructing a reference `&mut [u16]` from a misaligned pointer is immediate Undefined Behavior. **Remedy:** Update the `# Safety` contract of `FromBytes` to explicitly state that implementors must guarantee `std::mem::align_of::<Self>() >= std::mem::align_of::<Primitive>()` (or match the natural alignment required by primitive types of `size_of::<Self>()`). Alternatively, seal `FromBytes` via a private supertrait so external crates cannot implement it. ## Fishy Findings None. ## Missing Safety Comments Here are the exact locations where safety comments are missing or use non-standard formatting (such as `// Safety:`, `// # Safety`, or `// SAFETY -`), along with rigorous proposed proof comments: ### 1. `src/arrow/buffer/view_buffer.rs:82` - `ViewBuffer::into_array` (`Utf8View`) 🟡 ```rust // SAFETY: // - All views added to `self.views` were validated to refer to valid UTF-8 data within bounds of `self.buffers` during insertion in `append_view_unchecked` / `append_raw_view_unchecked`. // - `builder` is constructed with exact lengths and buffers matching `self.views` and `self.buffers`. let array = unsafe { builder.build_unchecked() }; ``` ### 2. `src/arrow/buffer/view_buffer.rs:91` - `ViewBuffer::into_array` (`BinaryView`) 🔴 ```rust // SAFETY: // - All views added to `self.views` refer to valid byte ranges within bounds of `self.buffers` as enforced by `ViewBuffer` insertion methods. // - `builder` is constructed with exact lengths and buffers matching `self.views` and `self.buffers`. let array = unsafe { builder.build_unchecked() }; ``` ### 3. `src/bloom_filter/mod.rs:298` - `BloomFilter::write_bitset` 🟡 ```rust // SAFETY: // - `Block` is `#[repr(transparent)] struct Block([u32; 8])`. // - `self.0` is a contiguous slice `&[Block]`. // - `self.0.as_ptr() as *const u8` is valid for reads of `self.0.len() * size_of::<Block>()` bytes. // - `u8` has no validity or alignment constraints. let slice = unsafe { std::slice::from_raw_parts( self.0.as_ptr() as *const u8, self.0.len() * size_of::<Block>(), ) }; ``` ### 4. `src/arrow/buffer/offset_buffer.rs:134` - `OffsetBuffer::into_array` 🔴 ```rust // SAFETY: // - `OffsetBuffer` is crate-internal and constructed only with monotonically increasing offsets starting at 0 and within bounds of `self.values`. // - For UTF-8 arrays, character boundary and validity constraints are verified prior to buffer construction. false => unsafe { array_data_builder.build_unchecked() }, ``` ### 5. `src/arrow/array_reader/byte_view_array.rs:343` - `ByteViewArrayDecoderPlain::read` 🔴 ```rust // SAFETY: // - `self.offset + 4 <= self.buf.len()` is explicitly checked by the conditional EOF return immediately above. let len_bytes: [u8; 4] = unsafe { buf.get_unchecked(self.offset..self.offset + 4) .try_into() .unwrap() }; ``` ### 6. `src/arrow/array_reader/byte_view_array.rs:383` - `ByteViewArrayDecoderPlain::read` 🔴 ```rust // SAFETY: // - `utf8_validation_begin <= self.offset <= buf.len()` is maintained by loop invariants and bounds checks on `self.offset`. check_valid_utf8(unsafe { buf.get_unchecked(utf8_validation_begin..self.offset) })?; ``` ### 7. `src/arrow/array_reader/byte_view_array.rs:391` - `ByteViewArrayDecoderPlain::read` 🔴 ```rust // SAFETY: // - `block_id` is returned from `output.append_block`. // - `start_offset + len == end_offset <= buf.len()` is verified by the conditional EOF return above. unsafe { output.append_view_unchecked(block_id, start_offset as u32, len); } ``` ### 8. `src/arrow/array_reader/byte_view_array.rs:400` - `ByteViewArrayDecoderPlain::read` 🔴 ```rust // SAFETY: // - `utf8_validation_begin <= self.offset <= buf.len()` is guaranteed by loop invariants. check_valid_utf8(unsafe { buf.get_unchecked(utf8_validation_begin..self.offset) })?; ``` ### 9. `src/arrow/array_reader/byte_view_array.rs:484` - `ByteViewArrayDecoderDictionary::read` 🟡 ```rust // SAFETY: // - `view` was retrieved from `dict.views`, which contains valid ByteView entries. unsafe { output.append_raw_view_unchecked(view); } ``` ### 10. `src/arrow/array_reader/byte_view_array.rs:493` - `ByteViewArrayDecoderDictionary::read` 🟡 ```rust // SAFETY: // - `view` originates from dictionary views and its buffer index is adjusted to point to the copied base buffer in `output`. unsafe { output.append_raw_view_unchecked(&view.into()); } ``` ### 11. `src/arrow/array_reader/byte_view_array.rs:571` - `ByteViewArrayDecoderDeltaLength::read` 🟡 ```rust // SAFETY: // - `block_id` corresponds to the appended data buffer. // - `current_offset + length` is bounded by `self.data.len()` as validated during decoder initialization. unsafe { output.append_view_unchecked(block_id, current_offset as u32, *length as u32) } ``` ### 12. `src/arrow/array_reader/byte_view_array.rs:642` - `ByteViewArrayDecoderDelta::read` (non-utf8 branch) 🟡 ```rust // SAFETY: // - `view` is constructed with `buffer_id` pointing to the newly allocated buffer in `output` and `offset` within bounds. unsafe { output.append_raw_view_unchecked(&view); } ``` ### 13. `src/arrow/array_reader/byte_view_array.rs:667` - `ByteViewArrayDecoderDelta::read` (utf8 branch) 🟡 ```rust // SAFETY: // - `view` is constructed with `buffer_id` pointing to the newly allocated buffer in `output` and `offset` within bounds. unsafe { output.append_raw_view_unchecked(&view); } ``` ### 14. `src/arrow/buffer/dictionary_buffer.rs:174` - `DictionaryBuffer::into_array` 🔴 ```rust // SAFETY: // - `DictionaryBuffer` is crate-internal and ensures `keys` indices are within bounds of `values`. false => unsafe { builder.build_unchecked() }, ``` ### 15. `src/arrow/array_reader/list_array.rs:227` - `ListArrayReader::consume_batch` 🔴 ```rust // SAFETY: // - `list_offsets` and `child_data` form a valid Arrow ListArray structure as constructed by internal list decoders. let list_data = unsafe { data_builder.build_unchecked() }; ``` ### 16. `src/arrow/array_reader/struct_array.rs:175` - `StructArrayReader::consume_batch` 🔴 ```rust // SAFETY: // - `children` arrays and null bitmaps are decoded with matching lengths and valid Arrow definitions. let array_data = unsafe { array_data_builder.build_unchecked() }; ``` ### 17. `src/arrow/array_reader/map_array.rs:105` - `MapArrayReader::consume_batch` 🟡 ```rust // SAFETY: // - `ListArrayReader` produces a valid `ListArray` with a `StructArray` child matching Arrow Map layout specifications. Ok(Arc::new(MapArray::from(unsafe { builder.build_unchecked() }))) ``` ### 18. `src/arrow/arrow_writer/levels.rs:509` - `LevelInfo::filter_def_levels_and_non_null_indices` 🟡 ```rust // SAFETY: // - `i < range.end <= nulls.len()` is guaranteed by the assertion `range.end <= nulls.len()` above. let valid = unsafe { nulls.value_unchecked(i) }; ``` ### 19. `src/arrow/array_reader/primitive_array.rs:220` - `PrimitiveArrayReader::consume_batch` 🔴 ```rust // SAFETY: // - `record_data` and bitmap buffer lengths match `num_values` as consumed from `RecordReader`. let array_data = unsafe { array_data.build_unchecked() }; ``` ### 20. `src/arrow/array_reader/fixed_len_byte_array.rs:180` - `FixedLenByteArrayReader::consume_batch` 🔴 ```rust // SAFETY: // - `record_data.buffer` is sized to exact multiples of `self.byte_length` matching `num_values`. let binary = FixedSizeBinaryArray::from(unsafe { array_data.build_unchecked() }); ``` ### 21. `src/arrow/array_reader/fixed_size_list_array.rs:205` - `FixedSizeListArrayReader::consume_batch` 🔴 ```rust // SAFETY: // - `child_data` length is an exact multiple of list size for `list_len` elements. let list_data = unsafe { list_builder.build_unchecked() }; ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
