Re: [PR] Reduce one copy in `SerializedPageReader` [arrow-rs]

via GitHub Fri, 31 Oct 2025 06:12:28 -0700


alamb commented on code in PR #8745:
URL: https://github.com/apache/arrow-rs/pull/8745#discussion_r2481345379



##########
parquet/Cargo.toml:
##########
@@ -95,6 +95,7 @@ tokio = { version = "1.0", default-features = false, features 
= ["macros", "rt-m
 rand = { version = "0.9", default-features = false, features = ["std", 
"std_rng", "thread_rng"] }
 object_store = { version = "0.12.0", default-features = false, features = 
["azure", "fs"] }
 sysinfo = { version = "0.37.1", default-features = false, features = 
["system"] }
+mimalloc = { version = "*" }

Review Comment:
   I don't think we should add this in the parquet crate as it will conflict 
with downstream crates that want to use a different allocator



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -397,9 +397,9 @@ pub(crate) fn decode_page(
             }
             let decompressed_size = uncompressed_page_size - offset;
             let mut decompressed = Vec::with_capacity(uncompressed_page_size);
-            decompressed.extend_from_slice(&buffer.as_ref()[..offset]);
+            decompressed.extend_from_slice(&buffer[..offset]);

Review Comment:
   This seems an unrelated (but nice) cleanup
   
   While looking at this code, it seems like it always copies the compressed 
bytes, even when it then decompresses it immediately. I'll make a small PR to 
see if I can remove that unecessary copy



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -897,31 +897,23 @@ impl<R: ChunkReader> PageReader for 
SerializedPageReader<R> {
                         *remaining,
                     )?;
                     let data_len = header.compressed_page_size as usize;
+                    let data_start = *offset;
                     *offset += data_len as u64;
                     *remaining -= data_len as u64;
 
                     if header.r#type == PageType::INDEX_PAGE {
                         continue;
                     }
 
-                    let mut buffer = Vec::with_capacity(data_len);
-                    let read = read.take(data_len as u64).read_to_end(&mut 
buffer)?;
-
-                    if read != data_len {
-                        return Err(eof_err!(
-                            "Expected to read {} bytes of page, read only {}",
-                            data_len,
-                            read
-                        ));
-                    }
+                    let buffer = self.reader.get_bytes(data_start, data_len)?;

Review Comment:
   I confirm on review this can potentially avoid a copy if the underlying 
reader is already `Bytes`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Reduce one copy in `SerializedPageReader` [arrow-rs]

Reply via email to