Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-21 Thread via GitHub


codecov-commenter commented on PR #1756:
URL: 
https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897811173

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `78.57143%` with `3 lines` in your changes 
missing coverage. Please review.
   > Project coverage is 56.19%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`75aa236`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/75aa236e625e0fe35225aafe5779c6b21eae907a?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 202 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==)
 | 78.57% | [0 Missing and 3 partials :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main#1756  +/-   ##
   
   + Coverage 56.12%   56.19%   +0.06% 
   - Complexity  976 1052  +76 
   
 Files   119  130  +11 
 Lines 1174312667 +924 
 Branches   2251 2350  +99 
   
   + Hits   6591 7118 +527 
   - Misses 4012 4358 +346 
   - Partials   1140 1191  +51 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1756?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-21 Thread via GitHub


mbutrovich commented on PR #1756:
URL: 
https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897762420

   Something else to look at for Spark 3.4...
   ```
   select column with default value (native_comet, native shuffle) *** FAILED 
*** (449 milliseconds)
 org.apache.spark.sql.AnalysisException: Failed to execute ALTER TABLE ADD 
COLUMNS command because the destination table column col2 has a DEFAULT value 
with type ByteType, but the statement provided a value of incompatible type 
IntegerType
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-21 Thread via GitHub


mbutrovich commented on PR #1756:
URL: 
https://github.com/apache/datafusion-comet/pull/1756#issuecomment-2897657612

   I accidentally pushed the int96 docs changes to the wrong branch :(
   
   I'll just leave it here since CI is already running.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098980947


##
native/core/src/execution/planner.rs:
##
@@ -1108,6 +1108,44 @@ impl PhysicalPlanner {
 .map(|expr| self.create_expr(expr, 
Arc::clone(&required_schema)))
 .collect();
 
+let default_values: Option> = if 
!scan
+.default_values
+.is_empty()
+{
+// We have default values. Extract the two lists (same 
length) of values and
+// indexes in the schema, and then create a HashMap to use 
in the SchemaMapper.
+let default_values: Vec = scan
+.default_values
+.iter()
+.map(|expr| {
+let literal = self
+.create_expr(expr, 
Arc::clone(&required_schema))
+.unwrap();
+literal
+.as_any()
+.downcast_ref::()
+.ok_or_else(|| {
+GeneralError("Expected literal of default 
value.".to_string())
+})
+.map(|literal| literal.value().clone())
+.unwrap()
+})
+.collect::>();

Review Comment:
   Looks good, thank you for the suggestion!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098863135


##
native/core/src/execution/planner.rs:
##
@@ -1108,6 +1108,44 @@ impl PhysicalPlanner {
 .map(|expr| self.create_expr(expr, 
Arc::clone(&required_schema)))
 .collect();
 
+let default_values: Option> = if 
!scan
+.default_values
+.is_empty()
+{
+// We have default values. Extract the two lists (same 
length) of values and
+// indexes in the schema, and then create a HashMap to use 
in the SchemaMapper.
+let default_values: Vec = scan
+.default_values
+.iter()
+.map(|expr| {
+let literal = self
+.create_expr(expr, 
Arc::clone(&required_schema))
+.unwrap();
+literal
+.as_any()
+.downcast_ref::()
+.ok_or_else(|| {
+GeneralError("Expected literal of default 
value.".to_string())
+})
+.map(|literal| literal.value().clone())
+.unwrap()
+})
+.collect::>();

Review Comment:
   nit: we can avoid unwrap here:
   
   ```suggestion
   let default_values: Result, 
DataFusionError> = scan
   .default_values
   .iter()
   .map(|expr| {
   let literal = self.create_expr(expr, 
Arc::clone(&required_schema))?;
   let df_literal = literal
   .as_any()
   .downcast_ref::()
   .ok_or_else(|| {
   GeneralError("Expected literal of default 
value.".to_string())
   })?;
   Ok(df_literal.value().clone())
   })
   .collect();
   let default_values = default_values?;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098534086


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2327,18 +2346,18 @@ object QueryPlanSerde extends Logging with 
CometExprShim {
   val requiredSchema = schema2Proto(scan.requiredSchema.fields)
   val dataSchema = schema2Proto(scan.relation.dataSchema.fields)
 
-  val data_schema_idxs = scan.requiredSchema.fields.map(field => {
+  val dataSchemaIndexes = scan.requiredSchema.fields.map(field => {

Review Comment:
   Just fixing incorrectly formatted variables as I find them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098532793


##
spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala:
##
@@ -99,6 +100,56 @@ class CometFuzzTestSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("select column with default value") {
+// This test fails in Spark's vectorized Parquet reader for DECIMAL(36,18) 
or BINARY default values.
+withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
+  // This test relies on two tables: 1) t1 the Parquet file generated by 
ParquetGenerator with random values, and
+  // 2) t2 is a new table created with one column which we add a second 
column with different types and random values.
+  // We use the schema and values of t1 to simplify random value 
generation for the default column value in t2.
+  val df = spark.read.parquet(filename)
+  df.createOrReplaceTempView("t1")
+  for (col <- df.columns
+  .slice(1, 14)) { // All the primitive columns based on 
ParquetGenerator.makeParquetFile.

Review Comment:
   Thanks, I knew that was a hack and hadn't circled back to a more resilient 
way to do that. That looks great!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


andygrove commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098530916


##
spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala:
##
@@ -99,6 +100,56 @@ class CometFuzzTestSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("select column with default value") {
+// This test fails in Spark's vectorized Parquet reader for DECIMAL(36,18) 
or BINARY default values.
+withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
+  // This test relies on two tables: 1) t1 the Parquet file generated by 
ParquetGenerator with random values, and
+  // 2) t2 is a new table created with one column which we add a second 
column with different types and random values.
+  // We use the schema and values of t1 to simplify random value 
generation for the default column value in t2.
+  val df = spark.read.parquet(filename)
+  df.createOrReplaceTempView("t1")
+  for (col <- df.columns
+  .slice(1, 14)) { // All the primitive columns based on 
ParquetGenerator.makeParquetFile.

Review Comment:
   this code could break if we change the parquet generator in the future. It 
would probably be best to filter instead. Something like:
   
   ```scala
   val columns = df.schema.fields.filter(f => 
!isComplexType(f.dataType)).map(_.name)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098530199


##
native/core/src/parquet/schema_adapter.rs:
##
@@ -196,15 +198,43 @@ impl SchemaMapper for SchemaMapping {
 // go through each field in the projected schema
 .fields()
 .iter()
+.enumerate()
 // and zip it with the index that maps fields from the projected 
table schema to the
 // projected file schema in `batch`
 .zip(&self.field_mappings)
 // and for each one...
-.map(|(field, file_idx)| {
+.map(|((field_idx, field), file_idx)| {
 file_idx.map_or_else(
-// If this field only exists in the table, and not in the 
file, then we know
-// that it's null, so just return that.
-|| Ok(new_null_array(field.data_type(), batch_rows)),

Review Comment:
   Got rid of instantiating an entire null array in favor of a single null 
value for column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098529235


##
native/core/src/parquet/parquet_support.rs:
##
@@ -60,9 +60,6 @@ pub struct SparkParquetOptions {
 pub allow_incompat: bool,
 /// Support casting unsigned ints to signed ints (used by Parquet 
SchemaAdapter)
 pub allow_cast_unsigned_ints: bool,
-/// We also use the cast logic for adapting Parquet schemas, so this flag 
is used
-/// for that use case
-pub is_adapting_schema: bool,

Review Comment:
   This is dead code from when we used the cast logic (and `CastOptions`) to 
handle Parquet type conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098527578


##
native/core/src/parquet/mod.rs:
##
@@ -715,6 +715,7 @@ pub unsafe extern "system" fn 
Java_org_apache_comet_parquet_Native_initRecordBat
 file_groups,
 None,
 data_filters,
+None,

Review Comment:
   As far as I can tell, missing columns for native_iceberg_compat are handled 
elsewhere and the DataSourceExec will never know about them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: default values for experimental native_datafusion scan [datafusion-comet]

2025-05-20 Thread via GitHub


mbutrovich commented on code in PR #1756:
URL: https://github.com/apache/datafusion-comet/pull/1756#discussion_r2098528439


##
native/core/src/parquet/parquet_exec.rs:
##
@@ -61,12 +63,14 @@ pub(crate) fn init_datasource_exec(
 file_groups: Vec>,
 projection_vector: Option>,
 data_filters: Option>>,
+default_values: Option>,
 session_timezone: &str,
 ) -> Result, ExecutionError> {
 let (table_parquet_options, spark_parquet_options) = 
get_options(session_timezone);
-let mut parquet_source = 
ParquetSource::new(table_parquet_options).with_schema_adapter_factory(
-Arc::new(SparkSchemaAdapterFactory::new(spark_parquet_options)),
-);
+let mut parquet_source =
+
ParquetSource::new(table_parquet_options).with_schema_adapter_factory(Arc::new(
+SparkSchemaAdapterFactory::new(spark_parquet_options, 
default_values),

Review Comment:
   We can discuss if it makes more sense to stick `default_values` inside of 
the `SparkParquetOptions` struct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org