sdd commented on code in PR #364:
URL: https://github.com/apache/iceberg-rust/pull/364#discussion_r1596342297
##########
crates/iceberg/src/io.rs:
##########
@@ -206,6 +205,35 @@ impl FileIO {
}
}
+/// The struct the represents the metadata of a file.
+///
+/// TODO: we can add last modified time, content type, etc. in the future.
+pub struct FileMetadata {
+ /// The size of the file.
+ pub size: u64,
+}
+
+/// Trait for reading file.
+///
+/// # TODO
+///
+/// It's possible for us to remove the async_trait, but we need to figure
+/// out how to handle the object safety.
+#[async_trait::async_trait]
+pub trait FileRead: Send + Unpin + 'static {
Review Comment:
Maybe we could call this `IcebergFileRead` or `IcebergRead`? I suppose
that's a bit redundant as it would be clear that it is from us by navigating to
the `use` statement where it gets imported, but I'm just conscious that there
are a lot of `Read` and `Write` traits dotted around and we could make it
easier to see which one is ours at-a-glance where ever it is used in the future.
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -187,3 +197,43 @@ impl ArrowReader {
}
}
}
+
+/// ArrowFileReader is a wrapper around a FileRead that impls parquets
AsyncFileReader.
+///
+/// # TODO
+///
+///
[ParquetObjectReader](https://docs.rs/parquet/latest/src/parquet/arrow/async_reader/store.rs.html#64)
contains the following hints to speed up metadata loading, we can consider
adding them to this struct:
+///
+/// - `metadata_size_hint`: Provide a hint as to the size of the parquet
file's footer.
+/// - `preload_column_index`: Load the Column Index as part of
[`Self::get_metadata`].
+/// - `preload_offset_index`: Load the Offset Index as part of
[`Self::get_metadata`].
+struct ArrowFileReader<R: FileRead> {
+ meta: FileMetadata,
+ r: R,
+}
+
+impl<R: FileRead> ArrowFileReader<R> {
+ /// Create a new ArrowFileReader
+ fn new(meta: FileMetadata, r: R) -> Self {
+ Self { meta, r }
+ }
+}
+
+impl<R: FileRead> AsyncFileReader for ArrowFileReader<R> {
+ fn get_bytes(&mut self, range: Range<usize>) -> BoxFuture<'_,
parquet::errors::Result<Bytes>> {
+ Box::pin(
+ self.r
+ .read(range.start as _..range.end as _)
Review Comment:
This `range.start as _..range.end as _` is a bit strange, why do we have to
have that cast, out of interest?
##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -91,12 +98,15 @@ impl ArrowReader {
Ok(try_stream! {
while let Some(Ok(task)) = tasks.next().await {
- let parquet_reader = file_io
- .new_input(task.data().data_file().file_path())?
+ let parquet_file = file_io
+ .new_input(task.data().data_file().file_path())?;
+ let parquet_metadata = parquet_file.metadata().await?;
+ let parquet_reader =parquet_file
Review Comment:
Can we use `futures::try_join` here to run both of these futures
simultaneously rather than sequentially?
```rust
let (parquet_metadata, parquet_reader) = try_join!(parquet_file.metadata(),
parquet_file.reader())?;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]