alamb commented on issue #7612: URL: https://github.com/apache/arrow-rs/issues/7612#issuecomment-3340429296
An update here: I have made significant progress on a "SansIO" type of API for the parquet readers, described here: - https://github.com/apache/arrow-rs/issues/7983 The idea is that with those APIs, you could use your own IO routines (aka not the AsyncRead traits). We recently released a metadata parsing version here: - https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html This lets you decode ParquetMetadata with any IO (e.g this example): ```rust use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeek, AsyncSeekExt}; // This function decodes Parquet Metadata from anything that implements // [`AsyncRead`] and [`AsyncSeek`] such as a tokio::fs::File async fn decode_metadata( file_len: u64, mut async_source: impl AsyncRead + AsyncSeek + Unpin ) -> Result<ParquetMetaData, ParquetError> { // We need a ParquetMetaDataPushDecoder to decode the metadata. let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap(); loop { match decoder.try_decode() { Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // decode successful Ok(DecodeResult::NeedsData(ranges)) => { // The decoder needs more data // // In this example we use the AsyncRead and AsyncSeek traits to read the // required ranges from the async source. let mut data = Vec::with_capacity(ranges.len()); for range in &ranges { let mut buffer = vec![0; (range.end - range.start) as usize]; async_source.seek(std::io::SeekFrom::Start(range.start)).await?; async_source.read_exact(&mut buffer).await?; data.push(Bytes::from(buffer)); } // Push the data into the decoder and try to decode again on the next iteration. decoder.push_ranges(ranges, data).unwrap(); } Ok(DecodeResult::Finished) => { unreachable!("returned metadata in previous match arm") } Err(e) => return Err(e), } } } ``` I also have code to do the same for the actual parquet decoder here: - https://github.com/apache/arrow-rs/pull/7997 Would this work better with OpenDAL? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
