Manishearth opened a new issue, #10164:
URL: https://github.com/apache/arrow-rs/issues/10164

   > [!NOTE]
   > This finding was identified during an agentic unsafe Rust code review 
performed by Gemini AI, followed by human review and verification.
   
   ### The Issue
   
   In `src/util/bit_util.rs`, the trait `FromBytes` is exported as a public 
`unsafe trait` (when the `experimental` crate feature is enabled). Its `# 
Safety` contract promises callers safety based solely on bit pattern validity:
   
   
https://github.com/apache/arrow-rs/blob/56.2.0/parquet/src/util/bit_util.rs#L40-L48
   
   ```rust
   /// # Safety
   /// All bit patterns 00000xxxx, where there are `BIT_CAPACITY` `x`s,
   /// must be valid, unless BIT_CAPACITY is 0.
   pub unsafe trait FromBytes: Sized {
       const BIT_CAPACITY: usize;
       type Buffer: AsMut<[u8]> + Default;
       fn try_from_le_slice(b: &[u8]) -> Result<Self>;
       fn from_le_bytes(bs: Self::Buffer) -> Self;
   }
   ```
   
   The method `BitReader::get_batch` accepts a mutable slice `&mut [T]` where 
`T: FromBytes`. When `size_of::<T>()` is 2, 4, or 8, `get_batch` casts 
`batch.as_mut_ptr()` directly to `*mut u16`, `*mut u32`, or `*mut u64` and 
invokes `std::slice::from_raw_parts_mut`:
   
   ```rust
   match size_of::<T>() {
       ...
       2 => {
           let ptr = batch.as_mut_ptr() as *mut u16;
           ...
           let out = unsafe { std::slice::from_raw_parts_mut(ptr, batch.len()) 
};
   ```
   
   Because `batch` is a slice `&mut [T]`, its backing pointer is guaranteed 
only to be aligned to `align_of::<T>()`. Because the `# Safety` contract on 
`FromBytes` specifies no alignment obligation, downstream code can soundly 
implement `FromBytes` for a custom 1-byte aligned type of size 2 (e.g. 
`#[repr(packed)] struct UnalignedU16([u8; 2]);`).
   
   Passing `&mut [UnalignedU16]` to `get_batch` obeys all documented safety 
preconditions, but executes `from_raw_parts_mut` on an unaligned pointer `*mut 
u16`. Constructing a reference `&mut [u16]` from a misaligned pointer is 
immediate Undefined Behavior.
   
   <details><summary>Minimal Reproduction (Miri)</summary>
   
   ```rust
   // Standalone reproduction for parquet FromBytes missing alignment 
obligation UB.
   use parquet::errors::Result;
   use parquet::util::bit_util::{BitReader, FromBytes};
   
   #[repr(packed)]
   struct UnalignedU16([u8; 2]);
   
   unsafe impl FromBytes for UnalignedU16 {
       const BIT_CAPACITY: usize = 16;
       type Buffer = [u8; 2];
       fn try_from_le_slice(b: &[u8]) -> Result<Self> {
           Ok(UnalignedU16([b[0], b[1]]))
       }
       fn from_le_bytes(bs: Self::Buffer) -> Self {
           UnalignedU16(bs)
       }
   }
   
   fn main() {
       // 64 bytes of input data to support reading 16 values of 16 bits (32 
bytes)
       let buf = [0u8; 64];
       let mut reader = BitReader::new(bytes::Bytes::copy_from_slice(&buf));
       // Storage for 16 elements of UnalignedU16 (32 bytes) + padding for 
offset
       let mut storage = [0u8; 36];
   
       // Dynamically guarantee that slice_ptr is at an ODD address (misaligned 
for u16)
       let addr = storage.as_mut_ptr() as usize;
       let offset = if addr % 2 == 0 { 1 } else { 2 };
   
       let slice_ptr = unsafe { storage.as_mut_ptr().add(offset) as *mut 
UnalignedU16 };
       let batch = unsafe { std::slice::from_raw_parts_mut(slice_ptr, 16) };
   
       // This will now enter the fast-path loop and trigger the alignment 
violation
       let _ = reader.get_batch(batch, 16);
   }
   ```
   
   ```text
   error: Undefined Behavior: constructing invalid value of type &mut [u16]: 
encountered an unaligned reference (required 2 byte alignment but found 1)
      --> 
/google/src/cloud/manishearth/verify-unsafe-rust-bugs/google3/third_party/rust/parquet/v56/src/util/bit_util.rs:506:36
       |
   506 |                 let out = unsafe { std::slice::from_raw_parts_mut(ptr, 
batch.len()) };
       |                                    
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Undefined Behavior occurred 
here
       |
       = help: this indicates a bug in the program: it performed an invalid 
operation, and caused Undefined Behavior
       = help: see 
https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html 
for further information
       = note: stack backtrace:
               0: parquet::util::bit_util::BitReader::get_batch::<UnalignedU16>
                   at 
/google/src/cloud/manishearth/verify-unsafe-rust-bugs/google3/third_party/rust/parquet/v56/src/util/bit_util.rs:506:36:
 506:84
               1: main
                   at src/bin/repro1.rs:34:13: 34:40
   ```
   
   </details>
   
   <details><summary>Suggested Fix</summary>
   
   Update the `# Safety` contract of `FromBytes` to explicitly state that 
implementors must guarantee `align_of::<Self>() >= align_of::<Primitive>()` 
matching primitive types of `size_of::<Self>()`. Alternatively, seal 
`FromBytes` via a private supertrait so external crates cannot implement it.
   </details>
   
   
--------------------------------------------------------------------------------
   
   > [!NOTE]
   > The full audit report below also contains additional minor findings (such 
as missing safety comments or undocumented FFI assumptions) that are probably 
worth fixing as well but not the primary goal of this issue.
   
   <details><summary>Full Gemini Unsafe Code Audit Report</summary>
   
   # Unsafe Rust Review: `parquet` (`v56`)
   
   ## Overall Safety Assessment
   
   `parquet` contains a moderate amount of `unsafe` code (~20 blocks across 17 
files), primarily focused on zero-copy transmutations (e.g., bloom filter 
blocks, bit unpacking fast-paths), Arrow buffer construction 
(`build_unchecked`), and string UTF-8 validation optimizations.
   
   Architecturally, the crate demonstrates strong encapsulation principles: 
core traits such as `ParquetValueType` and `DataType` are sealed via private 
supertraits (`GetDecoder`, `MakeStatistics`). This prevents downstream crates 
from implementing them for arbitrary types and breaking internal invariants 
assumed by unsafe fast-paths.
   
   However, formal safety documentation and proof-obligation hygiene are 
lacking:
   
   -   The public `FromBytes` trait (accessible when the `experimental` crate
   feature flag is enabled) is unsealed and its safety contract omits alignment 
requirements. This allows downstream code to implement `FromBytes` for 1-byte 
aligned custom types (e.g., `#[repr(packed)]`) while adhering 100% to the 
documented contract, triggering undefined behavior when passed to 
`BitReader::get_batch`.
   
   -   Numerous `unsafe` blocks across Arrow array readers and buffer views lack
   formal `// SAFETY:` comments or rely on informal comments / non-standard 
prefixes (e.g., `// Safety:`, `// # Safety`, `// SAFETY -`).
   
   ## Critical Findings
   
   ### Unsoundness in Public `FromBytes` Trait and `BitReader::get_batch` 🔴 🤦
   
   -   **Severity**: 🔴 High
   -   **Threat Vector**: 🤦 Accidental Misuse
   -   **Bug Type**: Alignment Violation, Missing Alignment Obligation The trait
   `FromBytes` is defined in `src/util/bit_util.rs` as:
   
   ```rust
   
   /// # Safety
   /// All bit patterns 00000xxxx, where there are `BIT_CAPACITY` `x`s,
   /// must be valid, unless BIT_CAPACITY is 0.
   pub unsafe trait FromBytes: Sized {
       const BIT_CAPACITY: usize;
       type Buffer: AsMut<[u8]> + Default;
       fn try_from_le_slice(b: &[u8]) -> Result<Self>;
       fn from_le_bytes(bs: Self::Buffer) -> Self;
   }
   
   ```
   
   When the crate's `experimental` feature flag is enabled, 
`parquet::util::bit_util::FromBytes` is exported as a public, unsealed `unsafe 
trait`.
   
   The function `BitReader::get_batch` is defined as:
   
   ```rust
   
   pub fn get_batch<T: FromBytes>(&mut self, batch: &mut [T], num_bits: usize) 
-> usize
   
   ```
   
   Internally, when `size_of::<T>()` is 2, 4, or 8, `get_batch` casts 
`batch.as_mut_ptr()` directly to `*mut u16`, `*mut u32`, or `*mut u64` and 
invokes `std::slice::from_raw_parts_mut`:
   
   ```rust
   
           match size_of::<T>() {
               ...
               2 => {
                   let ptr = batch.as_mut_ptr() as *mut u16;
                   ...
                   let out = unsafe { std::slice::from_raw_parts_mut(ptr, 
batch.len()) };
   
   ```
   
   Because `batch` is a slice `&mut [T]`, `batch.as_mut_ptr()` is guaranteed 
only to be aligned to `std::mem::align_of::<T>()`. Because the documented `# 
Safety` contract on `FromBytes` requires only validity of bit patterns and does 
not specify alignment obligations, a downstream crate can soundly implement 
`FromBytes` for an unaligned custom type of size 2:
   
   ```rust
   
   #[repr(packed)]
   struct UnalignedU16([u8; 2]); // size 2, align 1
   
   unsafe impl FromBytes for UnalignedU16 {
       const BIT_CAPACITY: usize = 16;
       type Buffer = [u8; 2];
       ...
   }
   
   ```
   
   Calling `bit_reader.get_batch(&mut unaligned_slice, 16)` obeys all 
documented preconditions, but executes `from_raw_parts_mut` on an unaligned 
pointer `*mut u16`. Constructing a reference `&mut [u16]` from a misaligned 
pointer is immediate Undefined Behavior.
   
   **Remedy:** Update the `# Safety` contract of `FromBytes` to explicitly 
state that implementors must guarantee `std::mem::align_of::<Self>() >= 
std::mem::align_of::<Primitive>()` (or match the natural alignment required by 
primitive types of `size_of::<Self>()`). Alternatively, seal `FromBytes` via a 
private supertrait so external crates cannot implement it.
   
   ## Fishy Findings
   
   None.
   
   ## Missing Safety Comments
   
   Here are the exact locations where safety comments are missing or use 
non-standard formatting (such as `// Safety:`, `// # Safety`, or `// SAFETY 
-`), along with rigorous proposed proof comments:
   
   ### 1. `src/arrow/buffer/view_buffer.rs:82` - `ViewBuffer::into_array` 
(`Utf8View`) 🟡
   
   ```rust
   
   // SAFETY:
   // - All views added to `self.views` were validated to refer to valid UTF-8 
data within bounds of `self.buffers` during insertion in 
`append_view_unchecked` / `append_raw_view_unchecked`.
   // - `builder` is constructed with exact lengths and buffers matching 
`self.views` and `self.buffers`.
   let array = unsafe { builder.build_unchecked() };
   
   ```
   
   ### 2. `src/arrow/buffer/view_buffer.rs:91` - `ViewBuffer::into_array` 
(`BinaryView`) 🔴
   
   ```rust
   
   // SAFETY:
   // - All views added to `self.views` refer to valid byte ranges within 
bounds of `self.buffers` as enforced by `ViewBuffer` insertion methods.
   // - `builder` is constructed with exact lengths and buffers matching 
`self.views` and `self.buffers`.
   let array = unsafe { builder.build_unchecked() };
   
   ```
   
   ### 3. `src/bloom_filter/mod.rs:298` - `BloomFilter::write_bitset` 🟡
   
   ```rust
   
   // SAFETY:
   // - `Block` is `#[repr(transparent)] struct Block([u32; 8])`.
   // - `self.0` is a contiguous slice `&[Block]`.
   // - `self.0.as_ptr() as *const u8` is valid for reads of `self.0.len() * 
size_of::<Block>()` bytes.
   // - `u8` has no validity or alignment constraints.
   let slice = unsafe {
       std::slice::from_raw_parts(
           self.0.as_ptr() as *const u8,
           self.0.len() * size_of::<Block>(),
       )
   };
   
   ```
   
   ### 4. `src/arrow/buffer/offset_buffer.rs:134` - `OffsetBuffer::into_array` 🔴
   
   ```rust
   
   // SAFETY:
   // - `OffsetBuffer` is crate-internal and constructed only with 
monotonically increasing offsets starting at 0 and within bounds of 
`self.values`.
   // - For UTF-8 arrays, character boundary and validity constraints are 
verified prior to buffer construction.
   false => unsafe { array_data_builder.build_unchecked() },
   
   ```
   
   ### 5. `src/arrow/array_reader/byte_view_array.rs:343` - 
`ByteViewArrayDecoderPlain::read` 🔴
   
   ```rust
   
   // SAFETY:
   // - `self.offset + 4 <= self.buf.len()` is explicitly checked by the 
conditional EOF return immediately above.
   let len_bytes: [u8; 4] = unsafe {
       buf.get_unchecked(self.offset..self.offset + 4)
           .try_into()
           .unwrap()
   };
   
   ```
   
   ### 6. `src/arrow/array_reader/byte_view_array.rs:383` - 
`ByteViewArrayDecoderPlain::read` 🔴
   
   ```rust
   
   // SAFETY:
   // - `utf8_validation_begin <= self.offset <= buf.len()` is maintained by 
loop invariants and bounds checks on `self.offset`.
   check_valid_utf8(unsafe {
       buf.get_unchecked(utf8_validation_begin..self.offset)
   })?;
   
   ```
   
   ### 7. `src/arrow/array_reader/byte_view_array.rs:391` - 
`ByteViewArrayDecoderPlain::read` 🔴
   
   ```rust
   
   // SAFETY:
   // - `block_id` is returned from `output.append_block`.
   // - `start_offset + len == end_offset <= buf.len()` is verified by the 
conditional EOF return above.
   unsafe {
       output.append_view_unchecked(block_id, start_offset as u32, len);
   }
   
   ```
   
   ### 8. `src/arrow/array_reader/byte_view_array.rs:400` - 
`ByteViewArrayDecoderPlain::read` 🔴
   
   ```rust
   
   // SAFETY:
   // - `utf8_validation_begin <= self.offset <= buf.len()` is guaranteed by 
loop invariants.
   check_valid_utf8(unsafe { 
buf.get_unchecked(utf8_validation_begin..self.offset) })?;
   
   ```
   
   ### 9. `src/arrow/array_reader/byte_view_array.rs:484` - 
`ByteViewArrayDecoderDictionary::read` 🟡
   
   ```rust
   
   // SAFETY:
   // - `view` was retrieved from `dict.views`, which contains valid ByteView 
entries.
   unsafe {
       output.append_raw_view_unchecked(view);
   }
   
   ```
   
   ### 10. `src/arrow/array_reader/byte_view_array.rs:493` - 
`ByteViewArrayDecoderDictionary::read` 🟡
   
   ```rust
   
   // SAFETY:
   // - `view` originates from dictionary views and its buffer index is 
adjusted to point to the copied base buffer in `output`.
   unsafe {
       output.append_raw_view_unchecked(&view.into());
   }
   
   ```
   
   ### 11. `src/arrow/array_reader/byte_view_array.rs:571` - 
`ByteViewArrayDecoderDeltaLength::read` 🟡
   
   ```rust
   
   // SAFETY:
   // - `block_id` corresponds to the appended data buffer.
   // - `current_offset + length` is bounded by `self.data.len()` as validated 
during decoder initialization.
   unsafe { output.append_view_unchecked(block_id, current_offset as u32, 
*length as u32) }
   
   ```
   
   ### 12. `src/arrow/array_reader/byte_view_array.rs:642` - 
`ByteViewArrayDecoderDelta::read` (non-utf8 branch) 🟡
   
   ```rust
   
   // SAFETY:
   // - `view` is constructed with `buffer_id` pointing to the newly allocated 
buffer in `output` and `offset` within bounds.
   unsafe {
       output.append_raw_view_unchecked(&view);
   }
   
   ```
   
   ### 13. `src/arrow/array_reader/byte_view_array.rs:667` - 
`ByteViewArrayDecoderDelta::read` (utf8 branch) 🟡
   
   ```rust
   
   // SAFETY:
   // - `view` is constructed with `buffer_id` pointing to the newly allocated 
buffer in `output` and `offset` within bounds.
   unsafe {
       output.append_raw_view_unchecked(&view);
   }
   
   ```
   
   ### 14. `src/arrow/buffer/dictionary_buffer.rs:174` - 
`DictionaryBuffer::into_array` 🔴
   
   ```rust
   
   // SAFETY:
   // - `DictionaryBuffer` is crate-internal and ensures `keys` indices are 
within bounds of `values`.
   false => unsafe { builder.build_unchecked() },
   
   ```
   
   ### 15. `src/arrow/array_reader/list_array.rs:227` - 
`ListArrayReader::consume_batch` 🔴
   
   ```rust
   
   // SAFETY:
   // - `list_offsets` and `child_data` form a valid Arrow ListArray structure 
as constructed by internal list decoders.
   let list_data = unsafe { data_builder.build_unchecked() };
   
   ```
   
   ### 16. `src/arrow/array_reader/struct_array.rs:175` - 
`StructArrayReader::consume_batch` 🔴
   
   ```rust
   
   // SAFETY:
   // - `children` arrays and null bitmaps are decoded with matching lengths 
and valid Arrow definitions.
   let array_data = unsafe { array_data_builder.build_unchecked() };
   
   ```
   
   ### 17. `src/arrow/array_reader/map_array.rs:105` - 
`MapArrayReader::consume_batch` 🟡
   
   ```rust
   
   // SAFETY:
   // - `ListArrayReader` produces a valid `ListArray` with a `StructArray` 
child matching Arrow Map layout specifications.
   Ok(Arc::new(MapArray::from(unsafe {
       builder.build_unchecked()
   })))
   
   ```
   
   ### 18. `src/arrow/arrow_writer/levels.rs:509` - 
`LevelInfo::filter_def_levels_and_non_null_indices` 🟡
   
   ```rust
   
   // SAFETY:
   // - `i < range.end <= nulls.len()` is guaranteed by the assertion 
`range.end <= nulls.len()` above.
   let valid = unsafe { nulls.value_unchecked(i) };
   
   ```
   
   ### 19. `src/arrow/array_reader/primitive_array.rs:220` - 
`PrimitiveArrayReader::consume_batch` 🔴
   
   ```rust
   
   // SAFETY:
   // - `record_data` and bitmap buffer lengths match `num_values` as consumed 
from `RecordReader`.
   let array_data = unsafe { array_data.build_unchecked() };
   
   ```
   
   ### 20. `src/arrow/array_reader/fixed_len_byte_array.rs:180` - 
`FixedLenByteArrayReader::consume_batch` 🔴
   
   ```rust
   
   // SAFETY:
   // - `record_data.buffer` is sized to exact multiples of `self.byte_length` 
matching `num_values`.
   let binary = FixedSizeBinaryArray::from(unsafe { 
array_data.build_unchecked() });
   
   ```
   
   ### 21. `src/arrow/array_reader/fixed_size_list_array.rs:205` - 
`FixedSizeListArrayReader::consume_batch` 🔴
   
   ```rust
   
   // SAFETY:
   // - `child_data` length is an exact multiple of list size for `list_len` 
elements.
   let list_data = unsafe { list_builder.build_unchecked() };
   
   ```
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to