Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove merged PR #461: URL: https://github.com/apache/datafusion-comet/pull/461 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614789127 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: This test relies on a cast that we do not yet support and enables `COMET_CAST_ALLOW_INCOMPATIBLE` to allow it. I will revert the last change and add a comment about this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614788921 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: Removing that case causes a test failure: ``` - scalar subquery *** FAILED *** (8 seconds, 253 milliseconds) Cause: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 410.0 failed 1 times, most recent failure: Lost task 0.0 in stage 410.0 (TID 1286) (192.168.64.23 executor driver): org.apache.comet.CometNativeException: Execution error: Comet Internal Error: Native cast invoked for unsupported cast from Int32 to Decimal128(38, 10) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614639502 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +), +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Decimal256(_, _) +), +DataType::Utf8 => matches!(to_type, DataType::Binary), +DataType::Date32 => matches!(to_type, DataType::Utf8), +DataType::Timestamp(_, _) => { +matches!( +to_type, +DataType::Int64 | DataType::Date32 | DataType::Utf8 | DataType::Timestamp(_, _) +) +} +DataType::Binary => { +// note that this is not completely Spark compatible because +// DataFusion only supports binary data containing valid UTF-8 strings +matches!(to_type, DataType::Utf8) +} +_ => false, Review Comment: Casting from `Int64` to `Int32` for `Try` is covered here: ```rust DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean | DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 | DataType::Float32 | DataType::Float64 | DataType::Decimal128(_, _) | DataType::Utf8 ), ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org -
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614638361 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 Review Comment: Yes, that is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: The current code says that datafusion is compatible with Spark for all int types -> decimal: ```rust DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean ... | DataType::Decimal128(_, _) ``` However, this is actually not correct since DataFusion does not have overflow checks for int32 and int64 -> decimal and is not compatible with Spark. I will look at removing those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: The current code says that datafusion is compatible with Spark for all int types -> decimal" ```rust DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean ... | DataType::Decimal128(_, _) ``` However, this is actually not correct since DataFusion does not have overflow checks for int32 and int64 -> decimal and is not compatible with Spark. I will look at removing those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: The current code says that datafusion is compatible with Spark for all int types -> decima;l: ``` DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean ... | DataType::Decimal128(_, _) ``` However, this is actually not correct since DataFusion does not have overflow checks for int32 and int64 -> decimal and is not compatible with Spark. I will look at removing those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
kazuyukitanimura commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613863992 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: So right now, there is not `Int8` to `Decimal128` cast supported, looks like? ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +), +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( +to_type, +
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
viirya commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612605807 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -644,14 +644,16 @@ impl Cast { | DataType::Float32 | DataType::Float64 ), -DataType::Decimal128(_, _) => matches!( +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( Review Comment: Ah, right, we temporarily upcast decimal 128 to decimal 256 to avoid overflow issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612492325 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -503,41 +503,37 @@ impl Cast { fn cast_array(&self, array: ArrayRef) -> DataFusionResult { let to_type = &self.data_type; let array = array_with_timezone(array, self.timezone.clone(), Some(to_type)); +let from_type = array.data_type().clone(); + +// unpack dictionary string arrays first +// TODO: we are unpacking a dictionary-encoded array and then performing +// the cast. We could potentially improve performance here by casting the +// dictionary values directly without unpacking the array first, although this +// would add more complexity to the code +let array = match &from_type { +DataType::Dictionary(key_type, value_type) +if key_type.as_ref() == &DataType::Int32 +&& (value_type.as_ref() == &DataType::Utf8 +|| value_type.as_ref() == &DataType::LargeUtf8) => +{ +cast_with_options(&array, value_type.as_ref(), &CAST_OPTIONS)? +} +_ => array, +}; Review Comment: We were previously unpacking dictionary-encoded string arrays only for string to int and string to date. I just moved it earlier on so that we don't have to handle it specifically for certain casts from string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#issuecomment-2128254128 This is ready for review now @viirya @parthchandra @kazuyukitanimura @huaxingao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612478631 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,91 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +), +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Decimal256(_, _) +), +DataType::Utf8 => matches!(to_type, DataType::Binary), +DataType::Date32 => matches!(to_type, DataType::Utf8), +DataType::Timestamp(_, _) => { +// TODO need to add tests to see if we really do support all +// timestamp to timestamp conversions Review Comment: I filed https://github.com/apache/datafusion-comet/issues/467 for adding timestamp to timestamp tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612318862 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -644,14 +644,16 @@ impl Cast { | DataType::Float32 | DataType::Float64 ), -DataType::Decimal128(_, _) => matches!( +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( Review Comment: I think this is probably the root cause (in `planner.rs`): ```rust // For some Decimal128 operations, we need wider internal digits. // Cast left and right to Decimal256 and cast the result back to Decimal128 let left = Arc::new(Cast::new_without_timezone( left, DataType::Decimal256(p1, s1), EvalMode::Legacy, )); let right = Arc::new(Cast::new_without_timezone( right, DataType::Decimal256(p2, s2), EvalMode::Legacy, )); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612318862 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -644,14 +644,16 @@ impl Cast { | DataType::Float32 | DataType::Float64 ), -DataType::Decimal128(_, _) => matches!( +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( Review Comment: I think this is probably the root cause (in `planner.rs`): ```rust // For some Decimal128 operations, we need wider internal digits. // Cast left and right to Decimal256 and cast the result back to Decimal128 let left = Arc::new(Cast::new_without_timezone( left, DataType::Decimal256(p1, s1), EvalMode::Legacy, )); let right = Arc::new(Cast::new_without_timezone( right, DataType::Decimal256(p2, s2), EvalMode::Legacy, )); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612317278 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -644,14 +644,16 @@ impl Cast { | DataType::Float32 | DataType::Float64 ), -DataType::Decimal128(_, _) => matches!( +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( Review Comment: If I remove the decimal 256 support here, I get this test failure: ``` - Fix NPE in partial decimal sum *** FAILED *** (119 milliseconds) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 190.0 failed 1 times, most recent failure: Lost task 0.0 in stage 190.0 (TID 468) (192.168.86.33 executor driver): org.apache.comet.CometNativeException: Execution error: Comet Internal Error: Native cast invoked for unsupported cast from Decimal128(5, 2) to Decimal256(5, 2) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
viirya commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612162690 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -644,14 +644,16 @@ impl Cast { | DataType::Float32 | DataType::Float64 ), -DataType::Decimal128(_, _) => matches!( +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( Review Comment: Hmm, I think Spark doesn't have decimal 256. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove opened a new pull request, #461: URL: https://github.com/apache/datafusion-comet/pull/461 ## Which issue does this PR close? N/A ## Rationale for this change We have a catchall block in `cast.rs` that delegates to DataFusion for any cast that we don't have a specific match arm for. This is dangerous because it means we sometimes delegate to DataFusion for casts where DataFusion is not compatible with Spark, which can lead to data corruption and hard-to-debug issues such as https://github.com/apache/datafusion-comet/pull/383#issuecomment-2127417428 ## What changes are included in this PR? This PR introduces specific checks so that we only delegate to DataFusion for specific casts and changes the catchall to return an error. I also improved handling of dictionary-encoded string arrays so that these are unpacked early on and this would have prevented the issue in https://github.com/apache/datafusion-comet/pull/383#issuecomment-2127417428 ## How are these changes tested? Existing tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org