Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
codecov-commenter commented on PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897811173 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `78.57143%` with `3 lines` in your changes missing coverage. Please review. > Project coverage is 56.19%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`75aa236`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/75aa236e625e0fe35225aafe5779c6b21eae907a?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 202 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==) | 78.57% | [0 Missing and 3 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1756 +/- ## + Coverage 56.12% 56.19% +0.06% - Complexity 976 1052 +76 Files 119 130 +11 Lines 1174312667 +924 Branches 2251 2350 +99 + Hits 6591 7118 +527 - Misses 4012 4358 +346 - Partials 1140 1191 +51 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897762420 Something else to look at for Spark 3.4... ``` select column with default value (native_comet, native shuffle) *** FAILED *** (449 milliseconds) org.apache.spark.sql.AnalysisException: Failed to execute ALTER TABLE ADD COLUMNS command because the destination table column col2 has a DEFAULT value with type ByteType, but the statement provided a value of incompatible type IntegerType ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897657612 I accidentally pushed the int96 docs changes to the wrong branch :( I'll just leave it here since CI is already running. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098980947 ## native/core/src/execution/planner.rs: ## @@ -1108,6 +1108,44 @@ impl PhysicalPlanner { .map(|expr| self.create_expr(expr, Arc::clone(&required_schema))) .collect(); +let default_values: Option> = if !scan +.default_values +.is_empty() +{ +// We have default values. Extract the two lists (same length) of values and +// indexes in the schema, and then create a HashMap to use in the SchemaMapper. +let default_values: Vec = scan +.default_values +.iter() +.map(|expr| { +let literal = self +.create_expr(expr, Arc::clone(&required_schema)) +.unwrap(); +literal +.as_any() +.downcast_ref::() +.ok_or_else(|| { +GeneralError("Expected literal of default value.".to_string()) +}) +.map(|literal| literal.value().clone()) +.unwrap() +}) +.collect::>(); Review Comment: Looks good, thank you for the suggestion! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
andygrove commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098863135 ## native/core/src/execution/planner.rs: ## @@ -1108,6 +1108,44 @@ impl PhysicalPlanner { .map(|expr| self.create_expr(expr, Arc::clone(&required_schema))) .collect(); +let default_values: Option> = if !scan +.default_values +.is_empty() +{ +// We have default values. Extract the two lists (same length) of values and +// indexes in the schema, and then create a HashMap to use in the SchemaMapper. +let default_values: Vec = scan +.default_values +.iter() +.map(|expr| { +let literal = self +.create_expr(expr, Arc::clone(&required_schema)) +.unwrap(); +literal +.as_any() +.downcast_ref::() +.ok_or_else(|| { +GeneralError("Expected literal of default value.".to_string()) +}) +.map(|literal| literal.value().clone()) +.unwrap() +}) +.collect::>(); Review Comment: nit: we can avoid unwrap here: ```suggestion let default_values: Result, DataFusionError> = scan .default_values .iter() .map(|expr| { let literal = self.create_expr(expr, Arc::clone(&required_schema))?; let df_literal = literal .as_any() .downcast_ref::() .ok_or_else(|| { GeneralError("Expected literal of default value.".to_string()) })?; Ok(df_literal.value().clone()) }) .collect(); let default_values = default_values?; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098534086 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -2327,18 +2346,18 @@ object QueryPlanSerde extends Logging with CometExprShim { val requiredSchema = schema2Proto(scan.requiredSchema.fields) val dataSchema = schema2Proto(scan.relation.dataSchema.fields) - val data_schema_idxs = scan.requiredSchema.fields.map(field => { + val dataSchemaIndexes = scan.requiredSchema.fields.map(field => { Review Comment: Just fixing incorrectly formatted variables as I find them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098532793 ## spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala: ## @@ -99,6 +100,56 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("select column with default value") { +// This test fails in Spark's vectorized Parquet reader for DECIMAL(36,18) or BINARY default values. +withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { + // This test relies on two tables: 1) t1 the Parquet file generated by ParquetGenerator with random values, and + // 2) t2 is a new table created with one column which we add a second column with different types and random values. + // We use the schema and values of t1 to simplify random value generation for the default column value in t2. + val df = spark.read.parquet(filename) + df.createOrReplaceTempView("t1") + for (col <- df.columns + .slice(1, 14)) { // All the primitive columns based on ParquetGenerator.makeParquetFile. Review Comment: Thanks, I knew that was a hack and hadn't circled back to a more resilient way to do that. That looks great! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
andygrove commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098530916 ## spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala: ## @@ -99,6 +100,56 @@ class CometFuzzTestSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } + test("select column with default value") { +// This test fails in Spark's vectorized Parquet reader for DECIMAL(36,18) or BINARY default values. +withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { + // This test relies on two tables: 1) t1 the Parquet file generated by ParquetGenerator with random values, and + // 2) t2 is a new table created with one column which we add a second column with different types and random values. + // We use the schema and values of t1 to simplify random value generation for the default column value in t2. + val df = spark.read.parquet(filename) + df.createOrReplaceTempView("t1") + for (col <- df.columns + .slice(1, 14)) { // All the primitive columns based on ParquetGenerator.makeParquetFile. Review Comment: this code could break if we change the parquet generator in the future. It would probably be best to filter instead. Something like: ```scala val columns = df.schema.fields.filter(f => !isComplexType(f.dataType)).map(_.name) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098530199 ## native/core/src/parquet/schema_adapter.rs: ## @@ -196,15 +198,43 @@ impl SchemaMapper for SchemaMapping { // go through each field in the projected schema .fields() .iter() +.enumerate() // and zip it with the index that maps fields from the projected table schema to the // projected file schema in `batch` .zip(&self.field_mappings) // and for each one... -.map(|(field, file_idx)| { +.map(|((field_idx, field), file_idx)| { file_idx.map_or_else( -// If this field only exists in the table, and not in the file, then we know -// that it's null, so just return that. -|| Ok(new_null_array(field.data_type(), batch_rows)), Review Comment: Got rid of instantiating an entire null array in favor of a single null value for column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098529235 ## native/core/src/parquet/parquet_support.rs: ## @@ -60,9 +60,6 @@ pub struct SparkParquetOptions { pub allow_incompat: bool, /// Support casting unsigned ints to signed ints (used by Parquet SchemaAdapter) pub allow_cast_unsigned_ints: bool, -/// We also use the cast logic for adapting Parquet schemas, so this flag is used -/// for that use case -pub is_adapting_schema: bool, Review Comment: This is dead code from when we used the cast logic (and `CastOptions`) to handle Parquet type conversion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098527578 ## native/core/src/parquet/mod.rs: ## @@ -715,6 +715,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat file_groups, None, data_filters, +None, Review Comment: As far as I can tell, missing columns for native_iceberg_compat are handled elsewhere and the DataSourceExec will never know about them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]
mbutrovich commented on code in PR #1756: URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098528439 ## native/core/src/parquet/parquet_exec.rs: ## @@ -61,12 +63,14 @@ pub(crate) fn init_datasource_exec( file_groups: Vec>, projection_vector: Option>, data_filters: Option>>, +default_values: Option>, session_timezone: &str, ) -> Result, ExecutionError> { let (table_parquet_options, spark_parquet_options) = get_options(session_timezone); -let mut parquet_source = ParquetSource::new(table_parquet_options).with_schema_adapter_factory( -Arc::new(SparkSchemaAdapterFactory::new(spark_parquet_options)), -); +let mut parquet_source = + ParquetSource::new(table_parquet_options).with_schema_adapter_factory(Arc::new( +SparkSchemaAdapterFactory::new(spark_parquet_options, default_values), Review Comment: We can discuss if it makes more sense to stick `default_values` inside of the `SparkParquetOptions` struct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org