Re: [PR] API for reading Variant data and metadata [arrow-rs]
alamb commented on PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#issuecomment-2905855753 I took the liberty of merging up from main and adding headers to the files to get CI to pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2105264488 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index)
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mapleFU commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2105173643 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,714 @@ +use crate::decoder::{ +self, get_basic_type, get_primitive_type, VariantBasicType, VariantPrimitiveType, +}; +use crate::utils::{array_from_slice, first_byte_from_slice, slice_from_slice, string_from_slice}; +use arrow_schema::ArrowError; +use std::{ +num::TryFromIntError, +ops::{Index, Range}, +}; + +#[derive(Clone, Debug, Copy, PartialEq)] +enum OffsetSizeBytes { +One = 1, +Two = 2, +Three = 3, +Four = 4, +} + +impl OffsetSizeBytes { +/// Build from the `offset_size_minus_one` bits (see spec). +fn try_new(offset_size_minus_one: u8) -> Result { +use OffsetSizeBytes::*; +let result = match offset_size_minus_one { +0 => One, +1 => Two, +2 => Three, +3 => Four, +_ => { +return Err(ArrowError::InvalidArgumentError( +"offset_size_minus_one must be 0–3".to_string(), +)) +} +}; +Ok(result) +} + +/// Return one unsigned little-endian value from `bytes`. +/// +/// * `bytes` – the Variant-metadata buffer. +/// * `byte_offset` – number of bytes to skip **before** reading the first +/// value (usually `1` to move past the header byte). +/// * `offset_index` – 0-based index **after** the skip +/// (`0` is the first value, `1` the next, …). +/// +/// Each value is `self as usize` bytes wide (1, 2, 3 or 4). +/// Three-byte values are zero-extended to 32 bits before the final +/// fallible cast to `usize`. +fn unpack_usize( +&self, +bytes: &[u8], +byte_offset: usize, // how many bytes to skip +offset_index: usize, // which offset in an array of offsets +) -> Result { +use OffsetSizeBytes::*; +let offset = byte_offset + (*self as usize) * offset_index; +let result = match self { +One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(), +Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(), +Three => { +// Let's grab the three byte le-chunk first +let b3_chunks: [u8; 3] = array_from_slice(bytes, offset)?; +// Let's pad it and construct a padded u32 from it. +let mut buf = [0u8; 4]; +buf[..3].copy_from_slice(&b3_chunks); +u32::from_le_bytes(buf) +.try_into() +.map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))? +} +Four => u32::from_le_bytes(array_from_slice(bytes, offset)?) +.try_into() +.map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))?, +}; +Ok(result) +} +} + +#[derive(Clone, Debug, Copy, PartialEq)] +pub(crate) struct VariantMetadataHeader { +version: u8, +is_sorted: bool, +/// Note: This is `offset_size_minus_one` + 1 +offset_size: OffsetSizeBytes, +} + +// According to the spec this is currently always = 1, and so we store this const for validation +// purposes and to make that visible. +const CORRECT_VERSION_VALUE: u8 = 1; + +impl VariantMetadataHeader { +/// Tries to construct the variant metadata header, which has the form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn try_new(bytes: &[u8]) -> Result { +let header = first_byte_from_slice(bytes)?; + +let version = header & 0x0F; // First four bits +if version != CORRECT_VERSION_VALUE { +let err_msg = format!( +"The version bytes in the header is not {CORRECT_VERSION_VALUE}, got {:b}", +version +); +return Err(ArrowError::InvalidArgumentError(err_msg)); +} +let is_sorted = (header & 0x10) != 0; // Fifth bit +let offset_size_minus_one = header >> 6; // Last two bits +Ok(Self { +version, +is_sorted, +offset_size: OffsetSizeBytes::try_new(offset_size_minus_one)?, +}) +} +
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2105117041 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information Review Comment: Should we open a ticket to explore this after this one? Would like to get the API ready so my team can work in parallel on the rest of the primitives and optimisations needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#issuecomment-2904386363 I am refactoring the Array to have `try_new` and will try to consolidate to avoid double validations of metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104562325 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104557928 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { Review Comment: Np :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104556528 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104534207 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { Review Comment: Ah, I had it backward, sorry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104529271 ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { +ArrowError::InvalidArgumentError(format!( +"Tried to extract {} bytes at offset {} from {}-byte buffer", +range.end - range.start, +range.start, +bytes.len(), +)) Review Comment: Update: If we go for the `SliceIndex` approach that just prints `{index:?}` in the error message, we no longer need to worry about overflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104484565 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index)
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104508312 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104122900 ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { +ArrowError::InvalidArgumentError(format!( +"Tried to extract {} bytes at offset {} from {}-byte buffer", +range.end - range.start, +range.start, +bytes.len(), +)) +}) +} +pub(crate) fn array_from_slice( +bytes: &[u8], +offset: usize, +) -> Result<[u8; N], ArrowError> { +let bytes = slice_from_slice(bytes, offset..offset + N)?; +bytes.try_into().map_err(map_try_from_slice_error) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +pub(crate) fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +pub(crate) fn non_empty_slice(slice: &[u8]) -> Result<&[u8], ArrowError> { +if slice.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Received empty bytes".to_string(), +)); +} else { +return Ok(slice); +} Review Comment: Yes, has to return `Result<&u8, ArrowError>` but that worked fine as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104484565 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index)
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104464049 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index)
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104449497 ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { +ArrowError::InvalidArgumentError(format!( +"Tried to extract {} bytes at offset {} from {}-byte buffer", +range.end - range.start, +range.start, +bytes.len(), +)) +}) +} +pub(crate) fn array_from_slice( +bytes: &[u8], +offset: usize, +) -> Result<[u8; N], ArrowError> { +let bytes = slice_from_slice(bytes, offset..offset + N)?; +bytes.try_into().map_err(map_try_from_slice_error) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +pub(crate) fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +pub(crate) fn non_empty_slice(slice: &[u8]) -> Result<&[u8], ArrowError> { +if slice.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Received empty bytes".to_string(), +)); +} else { +return Ok(slice); +} Review Comment: Could also just add `.map(Copy::copy)` to the chain to make callers' lives easier? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r210556 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index)
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104428765 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { Review Comment: You're right, added another method. Code is nearly duplicate, but it avoids 2 calls to `unpack_usize` this way. Could also just call the single-version twice in the double, or make a helper, but felt like the trade-off in extra code was not worth it. Happy to change it, lmk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104428765 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { Review Comment: You're right, added another method. Code is nearly duplicate, but it avoids 2 calls to `unpack_usize` this way. Could also just call the single-version twice in the range/location version, or make a helper, but felt like the trade-off in extra code was not worth it. Happy to change it, lmk. ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104396739 ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { +ArrowError::InvalidArgumentError(format!( +"Tried to extract {} bytes at offset {} from {}-byte buffer", +range.end - range.start, +range.start, +bytes.len(), +)) +}) +} +pub(crate) fn array_from_slice( +bytes: &[u8], +offset: usize, +) -> Result<[u8; N], ArrowError> { +let bytes = slice_from_slice(bytes, offset..offset + N)?; +bytes.try_into().map_err(map_try_from_slice_error) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +pub(crate) fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +pub(crate) fn non_empty_slice(slice: &[u8]) -> Result<&[u8], ArrowError> { +if slice.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Received empty bytes".to_string(), +)); +} else { +return Ok(slice); +} +} + +/// Constructs the error message for an invalid UTF-8 string. +pub(crate) fn invalid_utf8_err() -> ArrowError { +ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) Review Comment: Thanks, good point. Done! :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104383018 ## parquet-variant/src/variant.rs: ## @@ -1,14 +1,105 @@ -use std::ops::Index; - use crate::decoder::{ self, get_basic_type, get_primitive_type, VariantBasicType, VariantPrimitiveType, }; +use crate::utils::{array_from_slice, invalid_utf8_err, non_empty_slice, slice_from_slice}; use arrow_schema::ArrowError; +use std::{ +num::TryFromIntError, +ops::{Index, Range}, +str, +}; + +#[derive(Clone, Debug, Copy, PartialEq)] +enum OffsetSizeBytes { +One = 1, +Two = 2, +Three = 3, +Four = 4, +} + +impl OffsetSizeBytes { +fn try_new(offset_size_minus_one: u8) -> Result { +use OffsetSizeBytes::*; +let result = match offset_size_minus_one { +0 => One, +1 => Two, +2 => Three, +3 => Four, +_ => { +return Err(ArrowError::InvalidArgumentError( +"offset_size_minus_one must be 0–3".to_string(), +)) +} +}; +Ok(result) +} + +fn unpack_usize( +&self, +bytes: &[u8], +byte_offset: usize, // how many bytes to skip +offset_index: usize, // which offset in an array of offsets +) -> Result { +use OffsetSizeBytes::*; +let offset = byte_offset + (*self as usize) * offset_index; +let result = match self { +One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(), +Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(), +// TODO: Do this one +Three => todo!(), +Four => u32::from_le_bytes(array_from_slice(bytes, offset)?) +.try_into() +.map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))?, +}; +Ok(result) +} +} + +#[derive(Clone, Debug, Copy, PartialEq)] +pub(crate) struct VariantMetadataHeader { +version: u8, +is_sorted: bool, +/// Note: This is `offset_size_minus_one` + 1 +offset_size: OffsetSizeBytes, +} + +impl<'m> VariantMetadataHeader { Review Comment: You're right, removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104379377 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); Review Comment: refactored, but kept the pre-validation for now ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 Review Comment: Added checks :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104376735 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. Review Comment: The comment was outdated :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104373793 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104028375 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104141735 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { Review Comment: I would prefer to keep the public api aligned with the spec -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104122900 ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { +ArrowError::InvalidArgumentError(format!( +"Tried to extract {} bytes at offset {} from {}-byte buffer", +range.end - range.start, +range.start, +bytes.len(), +)) +}) +} +pub(crate) fn array_from_slice( +bytes: &[u8], +offset: usize, +) -> Result<[u8; N], ArrowError> { +let bytes = slice_from_slice(bytes, offset..offset + N)?; +bytes.try_into().map_err(map_try_from_slice_error) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +pub(crate) fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +pub(crate) fn non_empty_slice(slice: &[u8]) -> Result<&[u8], ArrowError> { +if slice.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Received empty bytes".to_string(), +)); +} else { +return Ok(slice); +} Review Comment: Yes, has to return `Result<&u8, ArrowError>` but that worked fine as well as callers could just de-ref as needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104100108 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104092119 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104092119 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104092119 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104028375 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104028375 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2104028375 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103978049 ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +return Err(ArrowError::InvalidArgumentError( +"Offsets are not monotonically increasing".to_string(), +)); +} +Ok(Self { +bytes, +header, +dict_size, +}) +} + /// Whether the dictionary keys are sorted and unique pub fn is_sorted(&self) -> bool { -todo!() +self.header.is_sorted } -/// Get the dict length -pub fn dict_len(&self) -> Result { -let end_location = self.offset_size()? as usize + 1; -if self.bytes.len() < end_location { -let err_str = format!( -"Invalid metadata bytes, must have at least length {} but has length {}", -&end_location, -self.bytes.len() -); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let dict_len_bytes = &self.bytes[1..end_location]; -let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { -ArrowError::InvalidArgumentError(format!( -"Unable to convert dictionary_size bytes into usize: {}", -e, -)) -})?); -Ok(dict_len) +/// Get the dictionary size +pub fn dictionary_size(&self) -> usize { +self.dict_size } -pub fn version(&self) -> usize { -todo!() +pub fn version(&self) -> u8 { +self.header.version } -/// Get the offset by index -pub fn get_offset_by(&self, index: usize) -> Result { -todo!() +/// Get the offset by key-index +pub fn get_offset_by(&self, index: usize) -> Result, ArrowError> { +// TODO: Should we memoize the offsets? There could be thousands of them (https://github.com/apache/arrow-rs/pull/7535#discussion_r2101351294) +if index >= self.dict_size { +return Err(ArrowError::InvalidArgumentError(format!( +"Index {} out of bounds for dictionary of length {}", +index, self.dict_size +))); +} + +// Skipping the header byte (setting byte_offset = 1) and the dictionary_size (setting offset_index +1) +// TODO: Validate size before looking up? +// TODO: Fix location / bytes here, the index is wrong. +let start = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 1)?; +let end = self +.header +.offset_size +.unpack_usize(self.bytes, 1, index + 2)?; +Ok(start..end) } -/// Get the header byte, which has the following form -/// 7 6 5 4 3 0 -/// +---+---+---+---+ -/// header | | | |version| -/// +---+---+---+---+ -/// ^ ^ -/// | +-- sorted_strings -/// +-- offset_size_minus_one -/// The version is a 4-bit value that must always contain the value 1. -/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. -/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. -/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 -pub fn header(&self) -> Result { -if self.bytes.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Can't get header from empty buffer".to_string(), -)); +/// Get the key-name by index +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { +match self.get_offset_by(index) {
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103257749 ## parquet-variant/src/decoder.rs: ## @@ -67,56 +69,30 @@ fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { ArrowError::InvalidArgumentError(e.to_string()) } -/// Constructs the error message for an invalid UTF-8 string. -fn invalid_utf8_err() -> ArrowError { -ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) -} - /// Decodes an Int8 from the value section of a variant. pub(crate) fn decode_int8(value: &[u8]) -> Result { -if value.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Got empty value buffer so can't decode into int8.".to_string(), -)); -} -let value = i8::from_le_bytes([value[1]]); +let value = i8::from_le_bytes(array_from_slice(value, 1)?); Ok(value) } /// Decodes a long string from the value section of a variant. pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { -if value.len() < 5 { -return Err(ArrowError::InvalidArgumentError( -"Tried to decode value buffer into long_string, but it's too short (len<5)." -.to_string(), -)); -} -let len = - u32::from_le_bytes(value[1..=4].try_into().map_err(map_try_from_slice_error)?) as usize; -if value.len() < len + 5 { -let err_str = format!("The length of the buffer for the long_string is soo short, it is {} and it should be at least {} ({} < {} + 5)", value.len(), len + 5 , value.len(), len); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let string_bytes = &value[5..5 + len]; -let string = str::from_utf8(string_bytes).map_err(|_| invalid_utf8_err())?; +let len = u32::from_le_bytes( +slice_from_slice(value, 1..5)? +.try_into() +.map_err(map_try_from_slice_error)?, +) as usize; +let string = +str::from_utf8(slice_from_slice(value, 5..5 + len)?).map_err(|_| invalid_utf8_err())?; Review Comment: Basically every caller ends up doing `offset..offset+len`, should the method just receive (offset, len) pairs instead? There's technically a risk of causing an overflow panic when adding the two values inside... but we add them outside without a check right now. If anything, pulling the add operation inside (where it can be checked) would be safer? ## parquet-variant/src/utils.rs: ## @@ -0,0 +1,42 @@ +use std::{array::TryFromSliceError, ops::Range}; + +use arrow_schema::ArrowError; + +#[inline] +pub(crate) fn slice_from_slice(bytes: &[u8], range: Range) -> Result<&[u8], ArrowError> { +bytes.get(range.clone()).ok_or_else(|| { Review Comment: aside: Strange that `Range` lacks an `impl Copy for Range` ## parquet-variant/src/decoder.rs: ## @@ -67,56 +69,30 @@ fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { ArrowError::InvalidArgumentError(e.to_string()) } -/// Constructs the error message for an invalid UTF-8 string. -fn invalid_utf8_err() -> ArrowError { -ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) -} - /// Decodes an Int8 from the value section of a variant. pub(crate) fn decode_int8(value: &[u8]) -> Result { -if value.is_empty() { -return Err(ArrowError::InvalidArgumentError( -"Got empty value buffer so can't decode into int8.".to_string(), -)); -} -let value = i8::from_le_bytes([value[1]]); +let value = i8::from_le_bytes(array_from_slice(value, 1)?); Ok(value) } /// Decodes a long string from the value section of a variant. pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { -if value.len() < 5 { -return Err(ArrowError::InvalidArgumentError( -"Tried to decode value buffer into long_string, but it's too short (len<5)." -.to_string(), -)); -} -let len = - u32::from_le_bytes(value[1..=4].try_into().map_err(map_try_from_slice_error)?) as usize; -if value.len() < len + 5 { -let err_str = format!("The length of the buffer for the long_string is soo short, it is {} and it should be at least {} ({} < {} + 5)", value.len(), len + 5 , value.len(), len); -return Err(ArrowError::InvalidArgumentError(err_str)); -} -let string_bytes = &value[5..5 + len]; -let string = str::from_utf8(string_bytes).map_err(|_| invalid_utf8_err())?; +let len = u32::from_le_bytes( +slice_from_slice(value, 1..5)? +.try_into() +.map_err(map_try_from_slice_error)?, +) as usize; Review Comment: ```suggestion let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize; ``` ## parquet-variant/src/variant.rs: ## @@ -18,158 +109,223 @@ impl<'m> VariantMetadata<'m> { self.bytes } +pub fn try_n
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103399834 ## parquet-variant/src/decoder.rs: ## Review Comment: Based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 and the feedback there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103398459 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} Review Comment: Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103383668 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,415 @@ +use crate::decoder::{ +self, get_basic_type, get_primitive_type, VariantBasicType, VariantPrimitiveType, +}; +use crate::utils::{array_from_slice, invalid_utf8_err, non_empty_slice, slice_from_slice}; +use arrow_schema::ArrowError; +use std::{ +num::TryFromIntError, +ops::{Index, Range}, +str, +}; + +#[derive(Clone, Debug, Copy, PartialEq)] +enum OffsetSizeBytes { +One = 1, +Two = 2, +Three = 3, +Four = 4, +} + +impl OffsetSizeBytes { +fn try_new(offset_size_minus_one: u8) -> Result { +use OffsetSizeBytes::*; +let result = match offset_size_minus_one { +0 => One, +1 => Two, +2 => Three, +3 => Four, +_ => { +return Err(ArrowError::InvalidArgumentError( +"offset_size_minus_one must be 0–3".to_string(), +)) +} +}; +Ok(result) +} + +fn unpack_usize( +&self, +bytes: &[u8], +byte_offset: usize, // how many bytes to skip +offset_index: usize, // which offset in an array of offsets +) -> Result { +use OffsetSizeBytes::*; +let offset = byte_offset + (*self as usize) * offset_index; +let result = match self { +One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(), +Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(), +Three => todo!(), // ugh, endianness +Four => u32::from_le_bytes(array_from_slice(bytes, offset)?) +.try_into() +.map_err(|e: TryFromIntError| ArrowError::InvalidArgumentError(e.to_string()))?, +}; +Ok(result) +} +} + +#[derive(Clone, Debug, Copy, PartialEq)] +pub(crate) struct VariantMetadataHeader { +version: u8, +is_sorted: bool, +/// Note: This is `offset_size_minus_one` + 1 +offset_size: OffsetSizeBytes, +} + +impl<'m> VariantMetadataHeader { +/// Tries to construct the variant metadata header, which has the form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn try_new(bytes: &'m [u8]) -> Result { +let Some(header) = bytes.get(0) else { +return Err(ArrowError::InvalidArgumentError( +"Received zero bytes".to_string(), +)); +}; + +let version = header & 0x0F; // First four bits +let is_sorted = (header & 0x10) != 0; // Fifth bit +let offset_size_minus_one = (header >> 6) & 0x03; // Last two bits +Ok(Self { +version, +is_sorted, +offset_size: OffsetSizeBytes::try_new(offset_size_minus_one)?, +}) +} +} + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +header: VariantMetadataHeader, +dict_size: usize, +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +pub fn try_new(bytes: &'m [u8]) -> Result { +let header = VariantMetadataHeader::try_new(bytes)?; +// Offset 1, index 0 because first element after header is dictionary size +let dict_size = header.offset_size.unpack_usize(bytes, 1, 0)?; + +// TODO: Refactor, add test for validation +let valid = (0..=dict_size) +.map(|i| header.offset_size.unpack_usize(bytes, 1, i + 1)) +.scan(0, |prev, cur| { +let Ok(cur_offset) = cur else { +return Some(false); +}; +// Skip the first offset, which is always 0 +if *prev == 0 { +*prev = cur_offset; +return Some(true); +} + +let valid = cur_offset > *prev; +*prev = cur_offset; +Some(valid) +}) +.all(|valid| valid); + +if !valid { +retur
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103381365 ## parquet-variant/src/variant.rs: ## @@ -182,83 +183,127 @@ impl<'m> VariantMetadata<'m> { } /// Get the key-name by index -pub fn get_by(&self, index: usize) -> Result<&'m str, ArrowError> { +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { match self.get_offset_by(index) { -Ok(range) => { -let bytes = slice_from_slice(self.bytes, 1 + range.start..1 + range.end)?; -let result = str::from_utf8(bytes).map_err(|_| invalid_utf8_err())?; -Ok(result) -} +Ok(range) => self.get_field_by_offset(range), Err(e) => Err(e), } } -pub fn header(&self) -> VariantMetadataHeader { -self.header -} - -// TODO: Fix this + next two -/// Get the offset_minus_one value from the header -pub fn offset_size_minus_one(&self) -> Result { -Ok(non_empty_slice(self.bytes)?[0] & (0b11 << 6)) // Grab the last 2 bits +/// Gets the field using an offset (Range) - helper method to keep consistent API. +pub fn get_field_by_offset(&self, offset: Range) -> Result<&'m str, ArrowError> { Review Comment: made it pub(crate) as it might make some testing easier, we can always make it private if not needed for testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
alamb commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103238782 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, Review Comment: I agree encoding boolean like this is somewhat strange, but I do think it makes sense to follow the spec closely here as this PR does For reference: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types Equivalence Class | Variant Physical Type | Type ID | Equivalent Parquet Type | Binary format -- | -- | -- | -- | -- NullType | null | 0 | UNKNOWN | none Boolean | boolean (True) | 1 | BOOLEAN | none Boolean | boolean (False) | 2 | BOOLEAN | none -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103236238 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, Review Comment: IMO that specific spec wording is really weird (they're already in the `Boolean` equivalence class and has `BOOLEAN` equivalent parquet type). And JSON just calls them `true` and `false` (singletons, similar to `null`). But ultimately I doubt the variant name we choose will matter much. ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, Review Comment: IMO that specific spec wording is really weird (they're already in the `Boolean` equivalence class with `BOOLEAN` equivalent parquet type). And JSON just calls them `true` and `false` (singletons, similar to `null`). But ultimately I doubt the variant name we choose will matter much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103236238 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, Review Comment: IMO that specific spec wording is really weird (it's already in the `Boolean` equivalence class and has `BOOLEAN` equivalent parquet type). And JSON just calls them `true` and `false` (singletons, similar to `null`). But ultimately I doubt the variant name we choose will matter much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
alamb commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103231456 ## parquet-variant/src/variant.rs: ## @@ -182,83 +183,127 @@ impl<'m> VariantMetadata<'m> { } /// Get the key-name by index -pub fn get_by(&self, index: usize) -> Result<&'m str, ArrowError> { +pub fn get_field_by_index(&self, index: usize) -> Result<&'m str, ArrowError> { match self.get_offset_by(index) { -Ok(range) => { -let bytes = slice_from_slice(self.bytes, 1 + range.start..1 + range.end)?; -let result = str::from_utf8(bytes).map_err(|_| invalid_utf8_err())?; -Ok(result) -} +Ok(range) => self.get_field_by_offset(range), Err(e) => Err(e), } } -pub fn header(&self) -> VariantMetadataHeader { -self.header -} - -// TODO: Fix this + next two -/// Get the offset_minus_one value from the header -pub fn offset_size_minus_one(&self) -> Result { -Ok(non_empty_slice(self.bytes)?[0] & (0b11 << 6)) // Grab the last 2 bits +/// Gets the field using an offset (Range) - helper method to keep consistent API. +pub fn get_field_by_offset(&self, offset: Range) -> Result<&'m str, ArrowError> { Review Comment: ```suggestion fn get_field_by_offset(&self, offset: Range) -> Result<&'m str, ArrowError> { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
alamb commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103135519 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,149 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +use crate::utils::{array_from_slice, invalid_utf8_err, non_empty_slice, slice_from_slice}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add types for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { Review Comment: @scovich 's comments got me thinking we could implement this as `impl TryFrom for VariantPrimitiveType` Which I think would be more "idomatic" rust and would make for better look. Similarly for other APIs in this module ## parquet-variant/src/test_variant.rs: ## @@ -0,0 +1,72 @@ +//! End-to-end check: (almost) every sample from apache/parquet-testing/variant +//! can be parsed into our `Variant`. + +// NOTE: We keep this file separate rather than a test mod inside variant.rs because it should be +// moved to the test folder later +use std::fs; +use std::path::{Path, PathBuf}; + +use crate::variant::{Variant, VariantMetadata}; +use arrow_schema::ArrowError; + +fn cases_dir() -> PathBuf { +Path::new(env!("CARGO_MANIFEST_DIR")) +.join("..") +.join("parquet-testing") +.join("variant") +} + +fn load_case(name: &str) -> Result<(Vec, Vec), ArrowError> { +let root = cases_dir(); +let meta = fs::read(root.join(format!("{name}.metadata")))?; +let val = fs::read(root.join(format!("{name}.value")))?; +Ok((meta, val)) +} + +fn get_primitive_cases() -> Vec<(&'static str, Variant<'static, 'static>)> { +vec![ +("primitive_boolean_false", Variant::BooleanFalse), +("primitive_boolean_true", Variant::BooleanTrue), +("primitive_int8", Variant::Int8(42)), +// Using the From trait +("primitive_string", Variant::from("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), +// Using the From trait +("short_string", Variant::from("Less than 64 bytes (❤\u{fe0f} with utf8)")), +// TODO Reenable when https://github.com/apache/parquet-testing/issues/81 is fixed Review Comment: BTW I made a PR to fix this - https://github.com/apache/parquet-testing/pull/84 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103019756 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2103016796 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102992366 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get variant type from empty buffer array".to_string(), +)); +} +let header = value[0]; +let variant_type = match get_basic_type(header)? { +VariantBasicType::Primitive => match get_primitive_type(header)? { +VariantPrimitiveType::Null => VariantType::Null, +VariantPrimitiveType::Int8 => VariantType::Int8, +VariantPrimitiveType::BooleanTrue => VariantType::BooleanTrue, +VariantPrimitiveType::BooleanFalse => VariantType::BooleanFalse, +// TODO: Add 'legs' for the rest, once API is agreed upon +VariantPrimitiveType::String => VariantType::String, +}, +VariantBasicType::ShortString => VariantType::ShortString, +VariantBasicType::Object => VariantType::Object, +VariantBasicType::Array => VariantType::Array, +}; +Ok(variant_type) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +/// Constructs the error message for an invalid UTF-8 string. +fn invalid_utf8_err() -> ArrowError { +ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) +} + +/// Decodes an Int8 from the value section of a variant. +pub(crate) fn decode_int8(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Got empty value buffer so can't decode into int8.".to_string(), +)); +} +let value = i8::from_le_bytes([value[1]]); +Ok(value) +} + +/// Decodes a long string from the value section of a variant. +pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { +if value.len() < 5 { +return Err(ArrowError::InvalidArgumentError( +"Tried to decode value buffer into long_string, but it's too short (len<5)." +.to_string(), +)); Review Comment: Thanks! Also added `non_empty_slice(...) -> Result<&[u8], ArrowError>` for the case where we just want to e.g., index first byte and don't want an array. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102992366 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get variant type from empty buffer array".to_string(), +)); +} +let header = value[0]; +let variant_type = match get_basic_type(header)? { +VariantBasicType::Primitive => match get_primitive_type(header)? { +VariantPrimitiveType::Null => VariantType::Null, +VariantPrimitiveType::Int8 => VariantType::Int8, +VariantPrimitiveType::BooleanTrue => VariantType::BooleanTrue, +VariantPrimitiveType::BooleanFalse => VariantType::BooleanFalse, +// TODO: Add 'legs' for the rest, once API is agreed upon +VariantPrimitiveType::String => VariantType::String, +}, +VariantBasicType::ShortString => VariantType::ShortString, +VariantBasicType::Object => VariantType::Object, +VariantBasicType::Array => VariantType::Array, +}; +Ok(variant_type) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +/// Constructs the error message for an invalid UTF-8 string. +fn invalid_utf8_err() -> ArrowError { +ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) +} + +/// Decodes an Int8 from the value section of a variant. +pub(crate) fn decode_int8(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Got empty value buffer so can't decode into int8.".to_string(), +)); +} +let value = i8::from_le_bytes([value[1]]); +Ok(value) +} + +/// Decodes a long string from the value section of a variant. +pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { +if value.len() < 5 { +return Err(ArrowError::InvalidArgumentError( +"Tried to decode value buffer into long_string, but it's too short (len<5)." +.to_string(), +)); Review Comment: Thanks! I also added exclusive range support and a `non_empty_slice(...) -> Result<&[u8], ArrowError>` for the case where we just want to e.g., index first byte and don't want an array. -- This is an automated message from the Apache Git Service. To respond to
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102988142 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get variant type from empty buffer array".to_string(), +)); +} +let header = value[0]; +let variant_type = match get_basic_type(header)? { +VariantBasicType::Primitive => match get_primitive_type(header)? { +VariantPrimitiveType::Null => VariantType::Null, Review Comment: Removed, thanks @alamb ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::Boole
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102987595 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, Review Comment: Just tried to keep it as close to the spec as possible, which writes boolean (True) and boolean (False). Also I write a lot of Python and `True/False` felt illegal in that regard. Will keep for now, happy to change if you guys prefer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102984307 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, Review Comment: Thanks for sharing! It's a bit annoying to duplicate, but it's only a few and it's a (hopefully) locked spec so I don't mind doing it, and prefer the readability over the "right" way you show here, so keeping it as it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102981820 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. Review Comment: Sure, can do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2102975752 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101868041 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get variant type from empty buffer array".to_string(), +)); +} +let header = value[0]; +let variant_type = match get_basic_type(header)? { +VariantBasicType::Primitive => match get_primitive_type(header)? { +VariantPrimitiveType::Null => VariantType::Null, +VariantPrimitiveType::Int8 => VariantType::Int8, +VariantPrimitiveType::BooleanTrue => VariantType::BooleanTrue, +VariantPrimitiveType::BooleanFalse => VariantType::BooleanFalse, +// TODO: Add 'legs' for the rest, once API is agreed upon +VariantPrimitiveType::String => VariantType::String, +}, +VariantBasicType::ShortString => VariantType::ShortString, +VariantBasicType::Object => VariantType::Object, +VariantBasicType::Array => VariantType::Array, +}; +Ok(variant_type) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +/// Constructs the error message for an invalid UTF-8 string. +fn invalid_utf8_err() -> ArrowError { +ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) +} + +/// Decodes an Int8 from the value section of a variant. +pub(crate) fn decode_int8(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Got empty value buffer so can't decode into int8.".to_string(), +)); +} +let value = i8::from_le_bytes([value[1]]); +Ok(value) +} + +/// Decodes a long string from the value section of a variant. +pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { +if value.len() < 5 { +return Err(ArrowError::InvalidArgumentError( +"Tried to decode value buffer into long_string, but it's too short (len<5)." +.to_string(), +)); +} +let len = + u32::from_le_bytes(value[1..=4].try_into().map_err(map_try_from_slice_error)?) as usize; +if value.len() < len + 5 { +let err_str = format!("The length of the buffer for the long_string is soo short, it is {} and it should be at least {} ({
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101886230 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get variant type from empty buffer array".to_string(), +)); +} +let header = value[0]; +let variant_type = match get_basic_type(header)? { +VariantBasicType::Primitive => match get_primitive_type(header)? { +VariantPrimitiveType::Null => VariantType::Null, +VariantPrimitiveType::Int8 => VariantType::Int8, +VariantPrimitiveType::BooleanTrue => VariantType::BooleanTrue, +VariantPrimitiveType::BooleanFalse => VariantType::BooleanFalse, +// TODO: Add 'legs' for the rest, once API is agreed upon +VariantPrimitiveType::String => VariantType::String, +}, +VariantBasicType::ShortString => VariantType::ShortString, +VariantBasicType::Object => VariantType::Object, +VariantBasicType::Array => VariantType::Array, +}; +Ok(variant_type) +} + +/// To be used in `map_err` when unpacking an integer from a slice of bytes. +fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { +ArrowError::InvalidArgumentError(e.to_string()) +} + +/// Constructs the error message for an invalid UTF-8 string. +fn invalid_utf8_err() -> ArrowError { +ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()) +} + +/// Decodes an Int8 from the value section of a variant. +pub(crate) fn decode_int8(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Got empty value buffer so can't decode into int8.".to_string(), +)); +} +let value = i8::from_le_bytes([value[1]]); +Ok(value) +} + +/// Decodes a long string from the value section of a variant. +pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { +if value.len() < 5 { +return Err(ArrowError::InvalidArgumentError( +"Tried to decode value buffer into long_string, but it's too short (len<5)." +.to_string(), +)); +} +let len = + u32::from_le_bytes(value[1..=4].try_into().map_err(map_try_from_slice_error)?) as usize; Review Comment: @alamb is probably more qualified to answer the latter question here reg. arch support -- This is an automated message from the Apache Git
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101869627 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101869627 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101865773 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101840610 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101840610 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, +seen: usize, +offset_size: usize, +} +impl<'m> Iterator for OffsetIterators<'m> { +type Item = (usize, usize); // (start, end) positions of the bytes + +// TODO: Check bounds here or ensure they're correct + +fn next(&mut self) -> Option { +// +1 to skip the first offset +if self.seen < self.total { +let start = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); +self.seen += 1; +let end = usize::from_le_bytes( +self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header +..(self.seen ) * self.offset_size + 1] +.try_into() +.ok()?, +); + +Some((start, end)) +} else { +None +} +} +}
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101836364 ## parquet-variant/Cargo.toml: ## @@ -31,6 +31,9 @@ edition = { workspace = true } rust-version = { workspace = true } [dependencies] +arrow-schema = "55.1.0" +strum = "0.27.1" Review Comment: Thanks, merged it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] API for reading Variant data and metadata [arrow-rs]
scovich commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2100806370 ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. Review Comment: Maybe better as a github file comment instead of a source code comment? ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} Review Comment: Isn't this case provably impossible to hit, given that we masked it to 2 bits and there are match arms for all four possible values? The compiler just isn't smart enough to notice (requires "value range propagation" optimizations) and/or the language forbids it (because technically the `match` is on the u8 type, not a particular u8 value). ```suggestion _ => unreachable!(), ``` (The docs for [unreachable macro](https://doc.rust-lang.org/std/macro.unreachable.html#examples) specifically mention match arms as a use case) ## parquet-variant/src/decoder.rs: ## @@ -0,0 +1,199 @@ +// NOTE: Largely based on the implementation of @PinkCrow007 in https://github.com/apache/arrow-rs/pull/7452 +// And the feedback there. +use crate::variant::VariantType; +use arrow_schema::ArrowError; +use std::{array::TryFromSliceError, str}; + +#[derive(Debug, Clone, Copy)] +pub enum VariantBasicType { +Primitive = 0, +ShortString = 1, +Object = 2, +Array = 3, +} + +#[derive(Debug, Clone, Copy)] +pub enum VariantPrimitiveType { +Null = 0, +BooleanTrue = 1, +BooleanFalse = 2, +Int8 = 3, +// TODO: Add 'legs' for the rest of primitives, once API is agreed upon +String = 16, +} + +/// Extracts the basic type from a header byte +pub(crate) fn get_basic_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding +let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits +let basic_type = match basic_type { +0 => VariantBasicType::Primitive, +1 => VariantBasicType::ShortString, +2 => VariantBasicType::Object, +3 => VariantBasicType::Array, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown basic type: {}", +basic_type +))) +} +}; +Ok(basic_type) +} + +/// Extracts the primitive type from a header byte +pub(crate) fn get_primitive_type(header: u8) -> Result { +// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding + Primitive type is encoded in the last 6 bits of the header byte +let primitive_type = (header >> 2) & 0x3F; +let primitive_type = match primitive_type { +0 => VariantPrimitiveType::Null, +1 => VariantPrimitiveType::BooleanTrue, +2 => VariantPrimitiveType::BooleanFalse, +3 => VariantPrimitiveType::Int8, +// TODO: Add 'legs' for the rest, once API is agreed upon +16 => VariantPrimitiveType::String, +_ => { +return Err(ArrowError::InvalidArgumentError(format!( +"unknown primitive type: {}", +primitive_type +))) +} +}; +Ok(primitive_type) +} + +/// Extracts the variant type from the value section of a variant. The variant +/// type is defined as the set of all basic types and all primitive types. +pub fn get_variant_type(value: &[u8]) -> Result { +if value.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Tried to get varia
Re: [PR] API for reading Variant data and metadata [arrow-rs]
alamb commented on code in PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#discussion_r2101010184 ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError::InvalidArgumentError(err_str)); +} +let dict_len_bytes = &self.bytes[1..end_location]; +let dict_len = usize::from_le_bytes(dict_len_bytes.try_into().map_err(|e| { +ArrowError::InvalidArgumentError(format!( +"Unable to convert dictionary_size bytes into usize: {}", +e, +)) +})?); +Ok(dict_len) +} +pub fn version(&self) -> usize { +todo!() +} + +/// Get the offset by index +pub fn get_offset_by(&self, index: usize) -> Result { +todo!() +} + +/// Get the header byte, which has the following form +/// 7 6 5 4 3 0 +/// +---+---+---+---+ +/// header | | | |version| +/// +---+---+---+---+ +/// ^ ^ +/// | +-- sorted_strings +/// +-- offset_size_minus_one +/// The version is a 4-bit value that must always contain the value 1. +/// - sorted_strings is a 1-bit value indicating whether dictionary strings are sorted and unique. +/// - offset_size_minus_one is a 2-bit value providing the number of bytes per dictionary size and offset field. +/// - The actual number of bytes, offset_size, is offset_size_minus_one + 1 +pub fn header(&self) -> Result { +if self.bytes.is_empty() { +return Err(ArrowError::InvalidArgumentError( +"Can't get header from empty buffer".to_string(), +)); +} +Ok(self.bytes[0]) +} + +/// Get the offset_minus_one value from the header +pub fn offset_size_minus_one(&self) -> Result { +if self.bytes.is_empty() { +Err(ArrowError::InvalidArgumentError( +"Tried to get offset_size_minus_one from header, but self.bytes buffer is emtpy." +.to_string(), +)) +} else { +Ok(self.bytes[0] & (0b11 << 6)) // Grab the last 2 bits +} +} + +/// Get the offset_size +pub fn offset_size(&self) -> Result { +Ok(self.offset_size_minus_one()? + 1) +} + +/// Get the offsets as an iterator +// TODO: Do we want this kind of API? +// TODO: Test once API is agreed upon +pub fn offsets(&'m self) -> Result + 'm, ArrowError> { +struct OffsetIterators<'m> { +buffer: &'m [u8], +total: usize, Review Comment: Maybe we could call this `dict_len` for consistency ## parquet-variant/src/variant.rs: ## @@ -0,0 +1,321 @@ +use std::{borrow::Cow, ops::Index}; + +use crate::decoder::{self, get_variant_type}; +use arrow_schema::ArrowError; +use strum_macros::EnumDiscriminants; + +#[derive(Clone, Copy, Debug, PartialEq)] +/// Encodes the Variant Metadata, see the Variant spec file for more information +pub struct VariantMetadata<'m> { +bytes: &'m [u8], +} + +impl<'m> VariantMetadata<'m> { +/// View the raw bytes (needed by very low-level decoders) +#[inline] +pub const fn as_bytes(&self) -> &'m [u8] { +self.bytes +} + +/// Whether the dictionary keys are sorted and unique +pub fn is_sorted(&self) -> bool { +todo!() +} + +/// Get the dict length +pub fn dict_len(&self) -> Result { +let end_location = self.offset_size()? as usize + 1; +if self.bytes.len() < end_location { +let err_str = format!( +"Invalid metadata bytes, must have at least length {} but has length {}", +&end_location, +self.bytes.len() +); +return Err(ArrowError:
Re: [PR] API for reading Variant data and metadata [arrow-rs]
mkarbo commented on PR #7535: URL: https://github.com/apache/arrow-rs/pull/7535#issuecomment-2898171194 @alamb @PinkCrow007 fyi For first draft, it only contains a small subset of the primitive types, so we can get feedback on that before implementing decoding and conversion for all primitive types. Tried to incorporate and address as many comments from alamb and scovich in https://github.com/apache/arrow-rs/pull/7452 as possible, but can’t promise that all have been addressed, please let me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org