alamb commented on code in PR #8745:
URL: https://github.com/apache/arrow-rs/pull/8745#discussion_r2481345379
##########
parquet/Cargo.toml:
##########
@@ -95,6 +95,7 @@ tokio = { version = "1.0", default-features = false, features
= ["macros", "rt-m
rand = { version = "0.9", default-features = false, features = ["std",
"std_rng", "thread_rng"] }
object_store = { version = "0.12.0", default-features = false, features =
["azure", "fs"] }
sysinfo = { version = "0.37.1", default-features = false, features =
["system"] }
+mimalloc = { version = "*" }
Review Comment:
I don't think we should add this in the parquet crate as it will conflict
with downstream crates that want to use a different allocator
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -397,9 +397,9 @@ pub(crate) fn decode_page(
}
let decompressed_size = uncompressed_page_size - offset;
let mut decompressed = Vec::with_capacity(uncompressed_page_size);
- decompressed.extend_from_slice(&buffer.as_ref()[..offset]);
+ decompressed.extend_from_slice(&buffer[..offset]);
Review Comment:
This seems an unrelated (but nice) cleanup
While looking at this code, it seems like it always copies the compressed
bytes, even when it then decompresses it immediately. I'll make a small PR to
see if I can remove that unecessary copy
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -897,31 +897,23 @@ impl<R: ChunkReader> PageReader for
SerializedPageReader<R> {
*remaining,
)?;
let data_len = header.compressed_page_size as usize;
+ let data_start = *offset;
*offset += data_len as u64;
*remaining -= data_len as u64;
if header.r#type == PageType::INDEX_PAGE {
continue;
}
- let mut buffer = Vec::with_capacity(data_len);
- let read = read.take(data_len as u64).read_to_end(&mut
buffer)?;
-
- if read != data_len {
- return Err(eof_err!(
- "Expected to read {} bytes of page, read only {}",
- data_len,
- read
- ));
- }
+ let buffer = self.reader.get_bytes(data_start, data_len)?;
Review Comment:
I confirm on review this can potentially avoid a copy if the underlying
reader is already `Bytes`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]