Re: [PR] fix: default values for native_datafusion scan [datafusion-comet]
andygrove merged PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: default values for native_datafusion scan [datafusion-comet]
parthchandra commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2105011816 ## native/core/src/parquet/mod.rs: ## @@ -715,6 +715,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat file_groups, None, data_filters, +None, Review Comment: From Spark javadoc - ``` org. apache. spark. sql. catalyst. util. ResolveDefaultColumns def constantFoldCurrentDefaultsToExistDefaults(tableSchema: StructType, statementType: String): StructType Finds "current default" expressions in CREATE/ REPLACE TABLE columns and constant-folds them. The results are stored in the "exists default" metadata of the same columns. For example, in the event of this statement: CREATE TABLE T(a INT, b INT DEFAULT 5 + 5) This method constant-folds the "current default" value, stored in the CURRENT_DEFAULT metadata of the "b" column, to "10", storing the result in the "exists default" value within the EXISTS_DEFAULT metadata of that same column. Meanwhile the "current default" metadata of this "b" column retains its original value of "5 + 5". The reason for constant-folding the EXISTS_DEFAULT is to make the end-user visible behavior the same, after executing an ALTER TABLE ADD COLUMNS command with DEFAULT value, as if the system had performed an exhaustive backfill of the provided value to all previously existing rows in the table instead. We choose to avoid doing such a backfill because it would be a time-consuming and costly operation. Instead, we elect to store the EXISTS_DEFAULT in the column metadata for future reference when querying data out of the data source. In turn, each data source then takes responsibility to provide the constant-folded value in the EXISTS_DEFAULT metadata for such columns where the value is not present in storage. ``` I'll assume that the default values you get are the 'existence' defaults -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: default values for native_datafusion scan [datafusion-comet]
parthchandra commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2101451928
##
native/core/src/parquet/parquet_exec.rs:
##
@@ -61,12 +63,14 @@ pub(crate) fn init_datasource_exec(
file_groups: Vec>,
projection_vector: Option>,
data_filters: Option>>,
+default_values: Option>,
session_timezone: &str,
) -> Result, ExecutionError> {
let (table_parquet_options, spark_parquet_options) =
get_options(session_timezone);
-let mut parquet_source =
ParquetSource::new(table_parquet_options).with_schema_adapter_factory(
-Arc::new(SparkSchemaAdapterFactory::new(spark_parquet_options)),
-);
+let mut parquet_source =
+
ParquetSource::new(table_parquet_options).with_schema_adapter_factory(Arc::new(
+SparkSchemaAdapterFactory::new(spark_parquet_options,
default_values),
Review Comment:
I don't think it makes sense to do that even though it might make the code.
a little bit simpler. `default_values` are not exactly options. But I'm not
going to argue if you choose to do it that way.
##
native/core/src/parquet/mod.rs:
##
@@ -715,6 +715,7 @@ pub unsafe extern "system" fn
Java_org_apache_comet_parquet_Native_initRecordBat
file_groups,
None,
data_filters,
+None,
Review Comment:
It's handled in the `ConstantColumnReader` which is shared between
native_comet and native_iceberg_compat.
Also see `ResolveDefaultColumns.getExistenceDefaultValues`. Not quite sure
what the difference between ExistenceDefaultValues and simply default values is.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
