peasee opened a new issue, #18211: URL: https://github.com/apache/datafusion/issues/18211
### Is your feature request related to a problem or challenge? Currently, when a plan is produced for a `ListingTable` provider it creates file groups based on the file at the time the plan was created. I have a use case involving an S3 bucket with object versioning enabled, where the files are frequently overwritten. In this scenario, if the file is updated between planning and execution (or during execution) the calculated file ranges in the file group scan will change - resulting in query failures. It would be ideal if the plan could create file groups that match against specific etags or file versions. ### Describe the solution you'd like The [`ListingOptions`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingOptions.html) struct for configuring a `ListingTable` could have a new property `object_versioning_type: Option<ObjectVersionType>`, where `ObjectVersionType` is: ```rust enum ObjectVersionType { ETag, Version } ``` This value could be set like `with_object_versioning_type()`. The default value is `None`. This value is passed through to the `ListingTable` when it is built. Each enum variant corresponds to matching against a specific property from the `object_store` object meta where: * `ETag`: stores the `e_tag` property, and performs a `get_opts` with a `if_match: Some(e_tag)` * `Version`: stores the `version` property, and performs a `get_opts` with a `version: Some(version)` Because this metadata must be encoded into the physical plan with the file groups (where the file scan ranges are stored), `DataSourceExec` must also be updated to support referencing specific. Luckily, [`PartitionedFile`](https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.PartitionedFile.html) already stores the whole `ObjectMeta` object from `object_store`, so we only need to provide a configuration option to tell the `DataSourceExec` to respect the version/etag. This results in `with_object_versioning_type()` being passed from `FileScanConfigBuilder` -> `FileScanConfig`. The property can then be used from the `FileScanConfig` in respective file reading locations, like when constructing a [`ParquetOpener`](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/opener.rs#L60) When the scan then takes place, if the versioning property is supplied it calls the respective `get_opts` object store functions instead of just `get_range` or `get_ranges`. For example, in `ParquetObjectReader::get_bytes`, `get_range` is replaced with `get_range_opts`. `get_range_opts` doesn't currently exist though, so we must also create that! It should be relatively straightforward though, [as `get_range` just pre-sets the `GetOptions`](https://docs.rs/object_store/latest/src/object_store/lib.rs.html#648): ```rust async fn get_range(&self, location: &Path, range: Range<u64>) -> Result<Bytes> { let options = GetOptions { range: Some(range.into()), ..Default::default() }; self.get_opts(location, options).await?.bytes().await } ``` While I'm at it, I would also like to update `GetOptions` to support a builder pattern, although this is just a refactor. ### Describe alternatives you've considered Instead of storing an enum for the object versioning type, we could just support either the etag or version property. I thought it would be easy to support both though, and provide the flexibility for object stores that may or may not support either property. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
