Re: [PR] feat: make file scan task serializable [iceberg-rust]

via GitHub Thu, 23 May 2024 01:05:37 -0700


Fokko commented on code in PR #377:
URL: https://github.com/apache/iceberg-rust/pull/377#discussion_r1611209532



##########
crates/iceberg/src/scan.rs:
##########
@@ -463,18 +464,19 @@ impl ManifestEvaluatorCache {
 }
 
 /// A task to scan part of file.
-#[derive(Debug)]
+#[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct FileScanTask {
-    data_manifest_entry: ManifestEntryRef,
+    data_file_path: String,

Review Comment:
   This change makes a lot of sense to me. The statistics are used in the 
planning phase to filter out files where possible. The task gets handed over to 
the query engine where it will open up the actual file and there it can 
leverage the Parquet statistics to skip row groups and such.
   
   The task should be extended with delete files (for example, based on the 
upper and lower bound we can efficiently remove unrelated positional deletes). 
Optional, but nice, a possibility of a residual predicate (for example, if you 
filter on `date(created_at) == '2024-03-01' and user_id = 123` then the first 
part of the predicate might be satisfied by the partitioning of the table, and 
we just need to filter on the `user_id`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: make file scan task serializable [iceberg-rust]

Reply via email to