alamb commented on issue #7612:
URL: https://github.com/apache/arrow-rs/issues/7612#issuecomment-3340429296

   An update here:
   
   I have made significant progress on a "SansIO" type of API for the parquet 
readers, described here:
   - https://github.com/apache/arrow-rs/issues/7983
   
   The idea is that with those APIs, you could use your own IO routines (aka 
not the AsyncRead traits).
   
   We recently released a metadata parsing version here:
   - 
https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html
   
   This lets you decode ParquetMetadata with any IO (e.g this example):
   
   ```rust
   use tokio::io::{AsyncRead, AsyncReadExt, AsyncSeek, AsyncSeekExt};
   // This function decodes Parquet Metadata from anything that implements
   // [`AsyncRead`] and [`AsyncSeek`] such as a tokio::fs::File
   async fn decode_metadata(
     file_len: u64,
     mut async_source: impl AsyncRead + AsyncSeek + Unpin
   ) -> Result<ParquetMetaData, ParquetError> {
     // We need a ParquetMetaDataPushDecoder to decode the metadata.
     let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
     loop {
       match decoder.try_decode() {
          Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // 
decode successful
          Ok(DecodeResult::NeedsData(ranges)) => {
             // The decoder needs more data
             //
             // In this example we use the AsyncRead and AsyncSeek traits to 
read the
             // required ranges from the async source.
             let mut data = Vec::with_capacity(ranges.len());
             for range in &ranges {
               let mut buffer = vec![0; (range.end - range.start) as usize];
               async_source.seek(std::io::SeekFrom::Start(range.start)).await?;
               async_source.read_exact(&mut buffer).await?;
               data.push(Bytes::from(buffer));
             }
             // Push the data into the decoder and try to decode again on the 
next iteration.
             decoder.push_ranges(ranges, data).unwrap();
          }
          Ok(DecodeResult::Finished) => { unreachable!("returned metadata in 
previous match arm") }
          Err(e) => return Err(e),
       }
     }
   }
   ```
   
   I also have code to do the same for the actual parquet decoder here:
   - https://github.com/apache/arrow-rs/pull/7997
   
   Would this work better with OpenDAL?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to