Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub


andygrove merged PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614789127


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   This test relies on a cast that we do not yet support and enables 
`COMET_CAST_ALLOW_INCOMPATIBLE` to allow it. I will revert the last change and 
add a comment about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614788921


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   Removing that case causes a test failure:
   
   ```
   - scalar subquery *** FAILED *** (8 seconds, 253 milliseconds)
   
 Cause: java.util.concurrent.ExecutionException: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 410.0 failed 1 times, most recent failure: Lost task 0.0 in stage 410.0 
(TID 1286) (192.168.64.23 executor driver): 
org.apache.comet.CometNativeException: Execution error: Comet Internal Error: 
Native cast invoked for unsupported cast from Int32 to Decimal128(38, 10)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614639502


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+),
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Decimal256(_, _)
+),
+DataType::Utf8 => matches!(to_type, DataType::Binary),
+DataType::Date32 => matches!(to_type, DataType::Utf8),
+DataType::Timestamp(_, _) => {
+matches!(
+to_type,
+DataType::Int64 | DataType::Date32 | DataType::Utf8 | 
DataType::Timestamp(_, _)
+)
+}
+DataType::Binary => {
+// note that this is not completely Spark compatible because
+// DataFusion only supports binary data containing valid UTF-8 
strings
+matches!(to_type, DataType::Utf8)
+}
+_ => false,

Review Comment:
   Casting from `Int64` to `Int32` for `Try` is covered here:
   
   ```rust
  DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
to_type,
DataType::Boolean
| DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
| DataType::Float32
| DataType::Float64
| DataType::Decimal128(_, _)
| DataType::Utf8
),
 ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614638361


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64

Review Comment:
   Yes, that is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-24 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   The current code says that datafusion is compatible with Spark for all int 
types -> decimal:
   
   ```rust
  DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => 
matches!(
   to_type,
   DataType::Boolean
   ...
   | DataType::Decimal128(_, _)
   ```
   
   However, this is actually not correct since DataFusion does not have 
overflow checks for int32 and int64 -> decimal and is not compatible with 
Spark. I will look at removing those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-24 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   The current code says that datafusion is compatible with Spark for all int 
types -> decimal"
   
   ```rust
  DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => 
matches!(
   to_type,
   DataType::Boolean
   ...
   | DataType::Decimal128(_, _)
   ```
   
   However, this is actually not correct since DataFusion does not have 
overflow checks for int32 and int64 -> decimal and is not compatible with 
Spark. I will look at removing those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-24 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613955491


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   The current code says that datafusion is compatible with Spark for all int 
types -> decima;l:
   
   ```
  DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => 
matches!(
   to_type,
   DataType::Boolean
   ...
   | DataType::Decimal128(_, _)
   ```
   
   However, this is actually not correct since DataFusion does not have 
overflow checks for int32 and int64 -> decimal and is not compatible with 
Spark. I will look at removing those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-24 Thread via GitHub


kazuyukitanimura commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1613863992


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   So right now, there is not `Int8` to `Decimal128` cast supported, looks like?



##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+),
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(
+to_type,
+ 

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


viirya commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612605807


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -644,14 +644,16 @@ impl Cast {
 | DataType::Float32
 | DataType::Float64
 ),
-DataType::Decimal128(_, _) => matches!(
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(

Review Comment:
   Ah, right, we temporarily upcast decimal 128 to decimal 256 to avoid 
overflow issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612492325


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -503,41 +503,37 @@ impl Cast {
 fn cast_array(&self, array: ArrayRef) -> DataFusionResult {
 let to_type = &self.data_type;
 let array = array_with_timezone(array, self.timezone.clone(), 
Some(to_type));
+let from_type = array.data_type().clone();
+
+// unpack dictionary string arrays first
+// TODO: we are unpacking a dictionary-encoded array and then 
performing
+// the cast. We could potentially improve performance here by casting 
the
+// dictionary values directly without unpacking the array first, 
although this
+// would add more complexity to the code
+let array = match &from_type {
+DataType::Dictionary(key_type, value_type)
+if key_type.as_ref() == &DataType::Int32
+&& (value_type.as_ref() == &DataType::Utf8
+|| value_type.as_ref() == &DataType::LargeUtf8) =>
+{
+cast_with_options(&array, value_type.as_ref(), &CAST_OPTIONS)?
+}
+_ => array,
+};

Review Comment:
   We were previously unpacking dictionary-encoded string arrays only for 
string to int and string to date. I just moved it earlier on so that we don't 
have to handle it specifically for certain casts from string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#issuecomment-2128254128

   This is ready for review now @viirya @parthchandra @kazuyukitanimura 
@huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612478631


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,91 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+),
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Decimal256(_, _)
+),
+DataType::Utf8 => matches!(to_type, DataType::Binary),
+DataType::Date32 => matches!(to_type, DataType::Utf8),
+DataType::Timestamp(_, _) => {
+// TODO need to add tests to see if we really do support all
+// timestamp to timestamp conversions

Review Comment:
   I filed https://github.com/apache/datafusion-comet/issues/467 for adding 
timestamp to timestamp tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612318862


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -644,14 +644,16 @@ impl Cast {
 | DataType::Float32
 | DataType::Float64
 ),
-DataType::Decimal128(_, _) => matches!(
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(

Review Comment:
   I think this is probably the root cause (in `planner.rs`):
   
   ```rust
   // For some Decimal128 operations, we need wider internal digits.
   // Cast left and right to Decimal256 and cast the result back to Decimal128
   let left = Arc::new(Cast::new_without_timezone(
   left,
   DataType::Decimal256(p1, s1),
   EvalMode::Legacy,
   ));
   let right = Arc::new(Cast::new_without_timezone(
   right,
   DataType::Decimal256(p2, s2),
   EvalMode::Legacy,
 ));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612318862


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -644,14 +644,16 @@ impl Cast {
 | DataType::Float32
 | DataType::Float64
 ),
-DataType::Decimal128(_, _) => matches!(
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(

Review Comment:
   I think this is probably the root cause (in `planner.rs`):
   
   ```rust
   // For some Decimal128 operations, we need wider internal 
digits.
   // Cast left and right to Decimal256 and cast the result 
back to Decimal128
   let left = Arc::new(Cast::new_without_timezone(
   left,
   DataType::Decimal256(p1, s1),
   EvalMode::Legacy,
   ));
   let right = Arc::new(Cast::new_without_timezone(
   right,
   DataType::Decimal256(p2, s2),
   EvalMode::Legacy,
   ));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612317278


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -644,14 +644,16 @@ impl Cast {
 | DataType::Float32
 | DataType::Float64
 ),
-DataType::Decimal128(_, _) => matches!(
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(

Review Comment:
   If I remove the decimal 256 support here, I get this test failure:
   
   ```
   - Fix NPE in partial decimal sum *** FAILED *** (119 milliseconds)
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 190.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
190.0 (TID 468) (192.168.86.33 executor driver): 
org.apache.comet.CometNativeException: Execution error: Comet Internal Error: 
Native cast invoked for unsupported cast from Decimal128(5, 2) to Decimal256(5, 
2)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


viirya commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1612162690


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -644,14 +644,16 @@ impl Cast {
 | DataType::Float32
 | DataType::Float64
 ),
-DataType::Decimal128(_, _) => matches!(
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(

Review Comment:
   Hmm, I think Spark doesn't have decimal 256.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-23 Thread via GitHub


andygrove opened a new pull request, #461:
URL: https://github.com/apache/datafusion-comet/pull/461

   ## Which issue does this PR close?
   
   
   
   N/A
   
   ## Rationale for this change
   
   
   
   We have a catchall block in `cast.rs` that delegates to DataFusion for any 
cast that we don't have a specific match arm for. This is dangerous because it 
means we sometimes delegate to DataFusion for casts where DataFusion is not 
compatible with Spark, which can lead to data corruption and hard-to-debug 
issues such as 
https://github.com/apache/datafusion-comet/pull/383#issuecomment-2127417428
   
   ## What changes are included in this PR?
   
   
   
   This PR introduces specific checks so that we only delegate to DataFusion 
for specific casts and changes the catchall to return an error.
   
   
   I also improved handling of dictionary-encoded string arrays so that these 
are unpacked early on and this would have prevented the issue in 
https://github.com/apache/datafusion-comet/pull/383#issuecomment-2127417428
   
   ## How are these changes tested?
   
   
   
   Existing tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org