etseidl commented on code in PR #9868:
URL: https://github.com/apache/arrow-rs/pull/9868#discussion_r3174295986


##########
parquet/src/parquet_thrift.rs:
##########
@@ -678,16 +697,45 @@ impl<'a, R: ThriftCompactInputProtocol<'a>> 
ReadThrift<'a, R> for &'a [u8] {
     }
 }
 
+/// Fallback upper bound for [`Vec::with_capacity`] when the protocol does not
+/// expose a remaining-byte count. Picked conservatively: even the smallest
+/// Thrift element is at least one byte on the wire, and Parquet metadata
+/// payloads larger than this in any single list are well outside the spec.
+const READ_THRIFT_VEC_FALLBACK_CAP: usize = 1 << 20;
+
 /// Read a Thrift encoded [list] from the input protocol object.
 ///
+/// The list size is taken from the wire as an `i32`. Because that value is
+/// untrusted, the up-front [`Vec::with_capacity`] allocation is bounded by
+/// the number of bytes actually remaining in the input (when the protocol
+/// can report it) — every element on the wire costs at least one byte, so
+/// a list whose `size` exceeds the remaining input is malformed by
+/// definition. Streaming protocols that cannot report a remaining count
+/// fall back to [`READ_THRIFT_VEC_FALLBACK_CAP`].
+///
+/// Negative sizes are rejected explicitly. Although the Thrift spec uses an
+/// unsigned varint, a buggy or hostile encoder can produce a value whose
+/// `i32` interpretation is negative.
+///
 /// [list]: 
https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#list-and-set
 pub(crate) fn read_thrift_vec<'a, T, R>(prot: &mut R) -> Result<Vec<T>>
 where
     R: ThriftCompactInputProtocol<'a>,
     T: ReadThrift<'a, R>,
 {
     let list_ident = prot.read_list_begin()?;
-    let mut res = Vec::with_capacity(list_ident.size as usize);
+    if list_ident.size < 0 {
+        return Err(general_err!(
+            "Thrift list has negative size {}",
+            list_ident.size
+        ));
+    }
+    let claimed = list_ident.size as usize;
+    let cap = match prot.remaining_bytes() {
+        Some(remaining) => claimed.min(remaining),
+        None => claimed.min(READ_THRIFT_VEC_FALLBACK_CAP),
+    };
+    let mut res = Vec::with_capacity(cap);

Review Comment:
   I'm not sure this is the correct fix. This will cap the initial allocation, 
but then below we continue to read `list_ident.size` elements and push them to 
the result. If `with_capacity` would fail, the `push` will as well 
(eventually). Capping `list_ident.size` will leave the parser in an invalid 
state if we exit decoding the list early (assuming there really are `size` 
elements present).
   
   Perhaps instead, validation should be moved to `read_list_begin`, which can 
return an error on an invalid `size`.
   
   Edit: I should have read more. I see below in the test you are expecting the 
read to eventually hit `EOF`. Also, looking at `read_list_begin`, I think the 
use of `i32` was a mistake caused by a misreading of the thrift spec. It should 
be a `u32` instead. But given the limitations imposed by java, I don't think we 
need to allow anything over 2B anyway.
   
   I think I'm ok with erroring if `size_of::<T>() * size` exceeds `remaining` 
when the latter is available, but I'm unsure about the capping done here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to