date:20240501

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -293,6 +314,32 @@ fn maybe_data_types(
 Some(new_type)
 }
 
+/// Check if the current argument types can be coerced to match the given 
`valid_types`
+/// unlike `maybe_data_types`, this function does not coerce the types.

Review Comment:
   I think the role of `can_cast_types` is more like validator to ensure the 
valid_types given are castable for arrow-cast. Maybe we don't even need 
`maybe_data_types_without_coercion` or `maybe_data_types` like function at the 
end 樂 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -293,6 +314,32 @@ fn maybe_data_types(
 Some(new_type)
 }
 
+/// Check if the current argument types can be coerced to match the given 
`valid_types`
+/// unlike `maybe_data_types`, this function does not coerce the types.

Review Comment:
   I think the role of `can_cast_types` is more like validator to ensure the 
valid_types given are castable for arrow-cast. Maybe we don't even need 
`maybe_data_types_without_coercion` at the end 樂 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -293,6 +314,32 @@ fn maybe_data_types(
 Some(new_type)
 }
 
+/// Check if the current argument types can be coerced to match the given 
`valid_types`
+/// unlike `maybe_data_types`, this function does not coerce the types.

Review Comment:
   I think the role of `can_cast_types` is more like validator to ensure the 
valid_types given are castable for arrow-cast. Maybe we don't even need 
`maybe_data_types_without_coercion` at the end 樂 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1587073299


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -53,20 +54,31 @@ pub fn data_types(
 }
 }
 
-let valid_types = get_valid_types(_signature, 
current_types)?;
-
+let mut valid_types = get_valid_types(_signature, 
current_types)?;
 if valid_types
 .iter()
 .any(|data_type| data_type == current_types)
 {
 return Ok(current_types.to_vec());
 }
 
-// Try and coerce the argument types to match the signature, returning the
-// coerced types from the first matching signature.
-for valid_types in valid_types {
-if let Some(types) = maybe_data_types(_types, current_types) {
-return Ok(types);
+// Well-supported signature that returns exact valid types.
+if !valid_types.is_empty()
+&& matches!(signature.type_signature, 
TypeSignature::VariadicEqualOrNull)
+{
+// exact valid types
+assert_eq!(valid_types.len(), 1);
+let valid_types = valid_types.swap_remove(0);

Review Comment:
   I assume there is only one valid types vec, so we always get one. 
swap_remove it just a fancy way to get the first element in valid_types: 
Vec>.



##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -53,20 +54,31 @@ pub fn data_types(
 }
 }
 
-let valid_types = get_valid_types(_signature, 
current_types)?;
-
+let mut valid_types = get_valid_types(_signature, 
current_types)?;
 if valid_types
 .iter()
 .any(|data_type| data_type == current_types)
 {
 return Ok(current_types.to_vec());
 }
 
-// Try and coerce the argument types to match the signature, returning the
-// coerced types from the first matching signature.
-for valid_types in valid_types {
-if let Some(types) = maybe_data_types(_types, current_types) {
-return Ok(types);
+// Well-supported signature that returns exact valid types.
+if !valid_types.is_empty()
+&& matches!(signature.type_signature, 
TypeSignature::VariadicEqualOrNull)
+{
+// exact valid types
+assert_eq!(valid_types.len(), 1);
+let valid_types = valid_types.swap_remove(0);

Review Comment:
   I assume there is only one valid types vec, so we always get one. 
swap_remove it just a fancy way to get the first element in `valid_types: 
Vec>`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1587073726


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -53,20 +54,31 @@ pub fn data_types(
 }
 }
 
-let valid_types = get_valid_types(_signature, 
current_types)?;
-
+let mut valid_types = get_valid_types(_signature, 
current_types)?;
 if valid_types
 .iter()
 .any(|data_type| data_type == current_types)
 {
 return Ok(current_types.to_vec());
 }
 
-// Try and coerce the argument types to match the signature, returning the
-// coerced types from the first matching signature.
-for valid_types in valid_types {
-if let Some(types) = maybe_data_types(_types, current_types) {
-return Ok(types);
+// Well-supported signature that returns exact valid types.
+if !valid_types.is_empty()
+&& matches!(signature.type_signature, 
TypeSignature::VariadicEqualOrNull)
+{
+// exact valid types
+assert_eq!(valid_types.len(), 1);
+let valid_types = valid_types.swap_remove(0);

Review Comment:
   equivalent to `valid_types[0].clone` but without clone



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: Switch to use Rust stable by default [datafusion-comet]

2024-05-01 Thread via GitHub



sunchao opened a new pull request, #373:
URL: https://github.com/apache/datafusion-comet/pull/373

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   There is little reason for Comet to use nightly at the moment. The only 
feature we needed from nightly is `specialization` which is used in a few 
places that are no longer required. 
   
   This PR removes a few stale structs: `vector` and `mutable_vector`, which 
were added during an experiment of implementing Comet's own vector to be used 
by execution. However, as we've fully moved to use DataFusion, these are no 
longer needed.
   
   With these structs gone, we can now use `stable` as the default Rust 
toolchain for the project.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]

2024-05-01 Thread via GitHub



sunchao commented on issue #329:
URL: 
https://github.com/apache/datafusion-comet/issues/329#issuecomment-2089524991

   The original purpose of `comet-common` module is to make it engine-agnostic 
so it can be used for other use cases like Iceberg. Unfortunately we didn't 
have time to make it completely isolated, so it is still tightly coupled with 
Spark in several ways like Parquet -> catalyst schema conversion, ColumnVector, 
and later on a bunch of shuffle related stuff which are all closely related to 
Spark.
   
   If necessary, we can perhaps consider splitting the module further into 
`comet-parquet`, `comet-spark-shuffle` etc. For the Parquet part, we may need 
to define something like `CometDataType` which gets converted from the Parquet 
schema, and from which we can derive Spark catalyst data type or Iceberg data 
type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat(7181): cascading loser tree merges [datafusion]

2024-05-01 Thread via GitHub



wiedld commented on PR #7379:
URL: https://github.com/apache/datafusion/pull/7379#issuecomment-2089478253

   Working on other things. If/when we circle back, we'll be recreating 
differently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat(7181): cascading loser tree merges [datafusion]

2024-05-01 Thread via GitHub



wiedld closed pull request #7379: feat(7181): cascading loser tree merges
URL: https://github.com/apache/datafusion/pull/7379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove ScalarFunctionDefinition [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10325:
URL: https://github.com/apache/datafusion/pull/10325#discussion_r1587001540


##
datafusion/expr/src/expr_schema.rs:
##
@@ -138,25 +138,21 @@ impl ExprSchemable for Expr {
 .iter()
 .map(|e| e.get_type(schema))
 .collect::>>()?;
-match func_def {
-ScalarFunctionDefinition::UDF(fun) => {
 // verify that function is invoked with correct number 
and type of arguments as defined in `TypeSignature`
-data_types(_data_types, 
fun.signature()).map_err(|_| {
+data_types(_data_types, 
func_def.signature()).map_err(|_| {

Review Comment:
   I think it is fine to name it `func`



##
datafusion/expr/src/expr_schema.rs:
##
@@ -138,25 +138,21 @@ impl ExprSchemable for Expr {
 .iter()
 .map(|e| e.get_type(schema))
 .collect::>>()?;
-match func_def {
-ScalarFunctionDefinition::UDF(fun) => {
 // verify that function is invoked with correct number 
and type of arguments as defined in `TypeSignature`
-data_types(_data_types, 
fun.signature()).map_err(|_| {
+data_types(_data_types, 
func_def.signature()).map_err(|_| {

Review Comment:
   I think it is fine to name it `func` or `fun`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: add a few more dictionary unwrap tests [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 merged PR #10335:
URL: https://github.com/apache/datafusion/pull/10335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: reuse the group key row buffer to avoid reallocation [datafusion]

2024-05-01 Thread via GitHub



github-actions[bot] commented on PR #7426:
URL: https://github.com/apache/datafusion/pull/7426#issuecomment-2089392509

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat(7181): cascading loser tree merges [datafusion]

2024-05-01 Thread via GitHub



github-actions[bot] commented on PR #7379:
URL: https://github.com/apache/datafusion/pull/7379#issuecomment-2089392557

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Use make_array to handle SQLExpr::Array. [datafusion]

2024-05-01 Thread via GitHub



github-actions[bot] commented on PR #7427:
URL: https://github.com/apache/datafusion/pull/7427#issuecomment-2089392469

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Prepare Spark 4.0 shims [datafusion-comet]

2024-05-01 Thread via GitHub



kazuyukitanimura commented on issue #372:
URL: 
https://github.com/apache/datafusion-comet/issues/372#issuecomment-2089389739

   I am on it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `SessionContext`/`SessionState::create_physical_expr()` to create `PhysicalExpressions` from `Expr`s [datafusion]

2024-05-01 Thread via GitHub



phillipleblanc commented on code in PR #10330:
URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586971614


##
datafusion/common/src/dfschema.rs:
##
@@ -806,6 +820,12 @@ impl From<> for Schema {
 }
 }
 
+impl AsRef for DFSchema {

Review Comment:
   Very nice!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Introducing repartitioning outside [datafusion]

2024-05-01 Thread via GitHub



edmondop opened a new pull request, #10338:
URL: https://github.com/apache/datafusion/pull/10338

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586960617


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.
+// TODO: Temporary Signature, to differentiate existing VariadicEqual.
+// After we swtich `make_array` to VariadicEqualOrNull,
+// we can reuse VariadicEqual.
+VariadicEqualOrNull,

Review Comment:
   Nice idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.

Review Comment:
   yes. `Non-comparison coercion` is `type_union_resolution`
   
   Actually, I think `type_union_resolution` is what we really need for 
coercion. Current `comparison_coercion` may not exist in the long term because 
I think there are many `workaround` solutions included in it, and those are 
possible to pull out to their **correct places**. For example, List coercion in 
`string_coercion` or Dict coercion could be handled in `unwrap cast in 
comparison`
   
   Ideally, we could shape `comparison_coercion` to more like 
`type_union_resolution`. But, adjusting logic inside it is error-prone if the 
test is not covered.
   
   
   At least I can tell the dict-coercion is different between these two now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.

Review Comment:
   yes. `Non-comparison coercion` is `type_union_resolution`
   
   Actually, I think `type_union_resolution` is what we really need for 
coercion. Current `comparison_coercion` may not exist in the long term because 
I think there are many `workaround` solutions included in it, and those are 
possible to pull out to their **correct places**. For example, List coercion in 
`string_coercion` or Dict coercion that compare value only could be handled in 
`unwrap cast in comparison`
   
   Ideally, we could shape `comparison_coercion` to more like 
`type_union_resolution`. But, adjusting logic inside it is error-prone if the 
test is not covered.
   
   
   At least I can tell the dict-coercion is different between these two now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.

Review Comment:
   yes. `Non-comparison coercion` is `type_union_resolution`
   
   Actually, I think `type_union_resolution` is what we really need for 
coercion. Current `comparison_coercion` may not exist in the long term because 
I think there are many `workaround` solutions included in it, and those are 
possible to pull out to their **correct places**.
   
   Ideally, we could shape `comparison_coercion` to more like 
`type_union_resolution`. But, adjusting logic inside it is error-prone if the 
test is not covered.
   
   
   At least I can tell the dict-coercion is different between these two now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.

Review Comment:
   yes. `Non-comparison coercion` is `type_union_resolution`
   
   Actually, I think `type_union_resolution` is what we really need for 
coercion. Current `comparison_coercion` may not exist in the long term because 
I think there are many `workaround` solutions included in it, and those are 
possible to pull out to their **correct places**.
   
   Ideally, we should shape `comparison_coercion` to more like 
`type_union_resolution`. But, adjusting logic inside it is error-prone if the 
test is not covered.
   
   
   At least I can tell the dict-coercion is different between these two now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -293,6 +314,32 @@ fn maybe_data_types(
 Some(new_type)
 }
 
+/// Check if the current argument types can be coerced to match the given 
`valid_types`
+/// unlike `maybe_data_types`, this function does not coerce the types.

Review Comment:
   I think the role of `can_cast_types` is more like validator to ensure the 
valid_types given are castable for arrow-cast. Maybe we don't even need this 樂 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192


##
datafusion/expr/src/signature.rs:
##
@@ -92,14 +92,22 @@ pub enum TypeSignature {
 /// A function such as `concat` is `Variadic(vec![DataType::Utf8, 
DataType::LargeUtf8])`
 Variadic(Vec),
 /// One or more arguments of an arbitrary but equal type.
-/// DataFusion attempts to coerce all argument types to match the first 
argument's type
+/// DataFusion attempts to coerce all argument types to match to the 
common type with comparision coercion.
 ///
 /// # Examples
 /// Given types in signature should be coercible to the same final type.
 /// A function such as `make_array` is `VariadicEqual`.
 ///
 /// `make_array(i32, i64) -> make_array(i64, i64)`
 VariadicEqual,
+/// One or more arguments of an arbitrary but equal type or Null.
+/// Non-comparison coercion is attempted to match the signatures.
+///
+/// Functions like `coalesce` is `VariadicEqual`.

Review Comment:
   yes. `Non-comparison coercion` is `type_union_resolution`
   
   Actually, I think `type_union_resolution` is what we really need for 
coercion. Current `comparison_coercion` may not exist in the long term because 
I think there are many `workaround` solutions included in it, and those are 
possible to pull out to their **correct places**.
   
   Ideally, we should shape `comparison_coercion` to more like 
`type_union_resolution`. But, adjusting logic inside it is vulnerable if the 
test is not covered.
   
   
   At least I can tell the dict-coercion is different between these two now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Add additional coalesce tests [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 merged PR #10334:
URL: https://github.com/apache/datafusion/pull/10334


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Move `create_physical_expr` to `physical-expr-common` [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on issue #10074:
URL: https://github.com/apache/datafusion/issues/10074#issuecomment-2089325105

   @alamb 
   First of all, we plan to pull the aggregate function out from 
`datafusion-physical-expr`. And, we also split `datafusion-physical-common` 
from `datafusion-physical`, **so the user can only import the minimum things 
they need**, but this is not enough. We want to handle `physical-expr` in 
UDFImpl, which is not possible given the current crate graph, because `UDFImpl` 
should be in `datafusion-expr` and we can't import `datafusion-physical-expr` 
to `datafusion-expr`.
   
   But, what if we allow `datafusion-expr` to import `physical-expr-common` 樂 
   
   We now differentiate **common** as the higher level crate. 
`datafusion-common` is the highest. Introduce `expr-common` as the second, 
`datafusion-physical-common` as the third (import common and expr-common), and 
`aggregate-common` as the fifth, that import all other **common**. Allow 
`datafusion-expr` to import `datafusion-physical-common`, so we can handle 
`physical-expr` concept in UDFImpl.
   
   As long as we don't introduce things that depend on `datafusion-expr` into 
`physical-expr-common`, I think we don't need to strictly avoid any 
physical-expr concept things ported into `datafusion-expr`.
   
   > Can you remind me why datafusion-functions-aggregate (the implementation 
of aggregate functions) shouldn't depend on datafusion-physical-expr and 
datafusion-expr
   
   `datafusion-functions-aggregate` can import `datafusion-expr` and 
`datafusion-physical-expr-common` but not `datafusion-physical-expr` to keep 
the minimum dependencies for the user.
   
   > I thought the thing we are trying to do is avoid datafusion-expr depending 
on datafusion-physical-expr
   
   This idea is what blocking us! Now, I would allow 
`datafusion-physical-expr-common` dependency for `datafusion-expr`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix: Sort Merge Join crashes on TPCH Q21 [datafusion]

2024-05-01 Thread via GitHub



comphead commented on PR #10304:
URL: https://github.com/apache/datafusion/pull/10304#issuecomment-2089299005

   Q21 contains both LeftSemi and LeftAnti joins. Hopefully LeftSemi works now, 
solving LeftAnti


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]

2024-05-01 Thread via GitHub



jayzhan211 commented on code in PR #10321:
URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586925126


##
datafusion/functions/src/core/mod.rs:
##
@@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, 
get_field);
 make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce);
 
 // Export the functions out of this package, both as expr_fn as well as a list 
of functions
-export_functions!(
-(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it 
returns value1. This can be used to perform the inverse operation of the 
COALESCE expression."),
-(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given 
the second argument. This can be used to cast to a specific `arrow_type`."),
-(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns 
value1"),
-(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; 
otherwise, it returns value3."),
-(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."),
-(r#struct, args, "Returns a struct with the given arguments"),
-(named_struct, args, "Returns a struct with the given names and arguments 
pairs"),
-(get_field, arg_1 arg_2, "Returns the value of the field with the given 
name from the struct"),
-(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the 
value of the first expr which is not NULL")
-);
+pub mod expr_fn {

Review Comment:
   
https://github.com/apache/datafusion/blob/e3487eee1365a37b8651bf5a393ae8d2cc16e653/datafusion/functions-array/src/macros.rs#L47-L109
   
   In array macro, we support two syntaxes, which is surprisingly easy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]

2024-05-01 Thread via GitHub



Omega359 commented on code in PR #10321:
URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586921438


##
datafusion/functions/src/core/mod.rs:
##
@@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, 
get_field);
 make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce);
 
 // Export the functions out of this package, both as expr_fn as well as a list 
of functions
-export_functions!(
-(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it 
returns value1. This can be used to perform the inverse operation of the 
COALESCE expression."),
-(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given 
the second argument. This can be used to cast to a specific `arrow_type`."),
-(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns 
value1"),
-(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; 
otherwise, it returns value3."),
-(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."),
-(r#struct, args, "Returns a struct with the given arguments"),
-(named_struct, args, "Returns a struct with the given names and arguments 
pairs"),
-(get_field, arg_1 arg_2, "Returns the value of the field with the given 
name from the struct"),
-(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the 
value of the first expr which is not NULL")
-);
+pub mod expr_fn {

Review Comment:
   I spent a whole day trying to make it work as well and got lost in the 
depths of macro programming. I'm not sure it's worth the effort of trying to 
get it to work tbh - the alternative is not exactly hard work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Avoid inlining non deterministic CTE [datafusion]

2024-05-01 Thread via GitHub



tgujar opened a new issue, #10337:
URL: https://github.com/apache/datafusion/issues/10337

   ### Describe the bug
   
   Currently Datafusion will inline all CTE, a non-deterministic expression can 
be executed multiple times producing different results
   
   
   ### To Reproduce
   
   Consider the following query which uses the `aggregate_test_100` data from 
datafusion-examples. Here, column c11 is a Float64 
   ```
   WITH cte as (
   SELECT sum(c4 * c11) as total 
   FROM aggregate_test_100 
   GROUP BY c1) 
   SELECT total 
   FROM cte 
   WHERE total = (select max(total) from cte)
   ```
   
   The optimized plan generated will inline the CTE and thus execute it twice
   ```
   Projection: cte.total
 Inner Join: cte.total = __scalar_sq_1.MAX(cte.total)
   SubqueryAlias: cte
 Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS 
total
   Aggregate: groupBy=[[aggregate_test_100.c1]], 
aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]]
 TableScan: aggregate_test_100 projection=[c1, c4, c11]
   SubqueryAlias: __scalar_sq_1
 Aggregate: groupBy=[[]], aggr=[[MAX(cte.total)]]
   SubqueryAlias: cte
 Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS 
total
   Aggregate: groupBy=[[aggregate_test_100.c1]], 
aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]]
 TableScan: aggregate_test_100 projection=[c1, c4, c11]
   ```
   
   ### Expected behavior
   
   Since summation here is dependent on ordering, I believe it is incorrect to 
inline the CTE here and execute it more than once.
   
   ### Additional context
   
   Related issue, which talks about possible advantages on not inlining CTE in 
some cases: https://github.com/apache/datafusion/issues/8777
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST between integer types [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on code in PR #340:
URL: https://github.com/apache/datafusion-comet/pull/340#discussion_r1586886509


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -807,11 +833,22 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 } else {
   // Spark 3.2 and 3.3 have a different error message format so we 
can't do a direct
   // comparison between Spark and Comet.
+  // In the case of CAST_INVALID_INPUT
   // Spark message is in format `invalid input syntax for type 
TYPE: VALUE`
   // Comet message is in format `The value 'VALUE' of the type 
FROM_TYPE cannot be cast to TO_TYPE`
   // We just check that the comet message contains the same 
invalid value as the Spark message
-  val sparkInvalidValue = 
sparkMessage.substring(sparkMessage.indexOf(':') + 2)
-  assert(cometMessage.contains(sparkInvalidValue))
+  // In the case of CAST_OVERFLOW
+  // Spark message is in format `Casting VALUE to TO_TYPE causes 
overflow`
+  // Comet message is in format `The value 'VALUE' of the type 
FROM_TYPE cannot be cast to TO_TYPE
+  // due to an overflow`
+  // We check if the comet message contains 'overflow'.
+  val sparkInvalidValue = if (sparkMessage.indexOf(':') == -1) {
+EMPTY_STRING
+  } else {
+sparkMessage.substring(sparkMessage.indexOf(':') + 2)
+  }
+  assert(
+cometMessage.contains(sparkInvalidValue) || 
cometMessage.contains("overflow"))

Review Comment:
   If `sparkInvalidValue` is `EMPTY_STRING`, won't 
`cometMessage.contains(sparkInvalidValue)` always be true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Stop copying LogicalPlan and Exprs in EliminateNestedUnion [datafusion]

2024-05-01 Thread via GitHub



emgeee commented on PR #10319:
URL: https://github.com/apache/datafusion/pull/10319#issuecomment-2089232074

   I went ahead and fixed the formatting so all tests pass and managed to 
remove 1 clone() call from with in the `coerce_plan_expr_for_schema()` call 
stack.
   
   If we want to optimize `coerce_plan_expr_for_schema` further, I'd definitely 
agree a separate PR makes sense as it is called from a number of other 
locations as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on PR #335:
URL: https://github.com/apache/datafusion-comet/pull/335#issuecomment-2089227013

   This is looking great @vaibhawvipul. I think this is close to being ready to 
merge and then have some follow on issues for remaining items.
   
   I think the one thing I would like to see before merging is that we can run 
the fuzz test without causing any panics, so just replacing unwraps with error 
handling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `SessionContext`/`SessionState::create_physical_expr()` to create `PhysicalExpressions` from `Expr`s [datafusion]

2024-05-01 Thread via GitHub



westonpace commented on code in PR #10330:
URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586867994


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -510,6 +515,34 @@ impl SessionContext {
 }
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
+/// coercion, and function rewrites.

Review Comment:
   ```suggestion
   /// coercion and function rewrites.
   ```



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -510,6 +515,34 @@ impl SessionContext {
 }
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
+/// coercion, and function rewrites.
+///
+/// # Example
+/// ```
+/// # use std::sync::Arc;
+/// # use arrow::datatypes::{DataType, Field, Schema};
+/// # use datafusion::prelude::*;
+/// # use datafusion_common::DFSchema;
+/// // a = 1 (i64)
+/// let expr = col("a").eq(lit(1i64));
+/// // provide type information that `a` is an Int32
+/// let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]);
+/// let df_schema = DFSchema::try_from(schema).unwrap();
+/// // Create a PhysicalExpr. Note DataFusion automatically coerces 
(casts) `1i64` to `1i32`

Review Comment:
   Nice, we have some (probably duplicated) type coercion logic on our side we 
might be able to replace.



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1947,6 +1982,36 @@ impl SessionState {
 .await
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
+/// coercion, and function rewrites.
+///
+/// Note: The expression is not [simplified]

Review Comment:
   Why not?



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1947,6 +1982,36 @@ impl SessionState {
 .await
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
+/// coercion, and function rewrites.
+///
+/// Note: The expression is not [simplified]
+///
+/// # See Also:
+/// * [`SessionContext::create_physical_expr`] for a higher-level API

Review Comment:
   This is somewaht unrelated to this PR but It's news to me that 
"SessionContext" is higher level than "SessionState".  The only comparison 
between the two that I can find is 
https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#sessioncontext-sessionstate-and-taskcontext
 which says:
   
   > A 
[SessionContext](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html)
 can be created from a 
[SessionConfig](https://docs.rs/datafusion/latest/datafusion/prelude/struct.SessionConfig.html)
 and stores the state for a particular query session. A single 
[SessionContext](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html)
 can run multiple queries.
   
   > 
[SessionState](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionState.html)
 contains information available during query planning (creating 
[LogicalPlan](https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html)s
 and 
[ExecutionPlan](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html)s).
   
   From this it is not obvious that a "SessionContext" contains a 
"SessionState".  It's also very confusing when I would use "SessionContext" and 
when I would use "SessionState" in my user application.  Currently, I've been 
going off the philosophy of "if an API method needs a SessionState I'd better 
create one and if an API method needs a SessionContext I'd better create one."
   
   Some kind of long form design documentation on these two types would be 
interesting (or if something already exists that I've missed or is in a blog 
post somewhere that would be cool too).



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -2024,6 +2089,35 @@ impl SessionState {
 }
 }
 
+struct SessionSimpifyProvider<'a> {
+state: &'a SessionState,
+df_schema: &'a DFSchema,
+}
+
+impl<'a> SessionSimpifyProvider<'a> {

Review Comment:
   ```suggestion
   impl<'a> SessionSimplifyProvider<'a> {
   ```



##
datafusion/common/src/dfschema.rs:
##
@@ -806,6 +820,12 @@ impl From<> for Schema {
 }
 }
 
+impl AsRef for DFSchema {

Review Comment:
   :exploding_head: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Determine ordering of file groups [datafusion]

2024-05-01 Thread via GitHub



suremarc commented on code in PR #9593:
URL: https://github.com/apache/datafusion/pull/9593#discussion_r1586860589


##
datafusion/core/src/datasource/physical_plan/mod.rs:
##
@@ -473,15 +468,281 @@ fn get_projected_output_ordering(
 // since rest of the orderings are violated
 break;
 }
+
 // do not push empty entries
 // otherwise we may have `Some(vec![])` at the output ordering.
-if !new_ordering.is_empty() {
-all_orderings.push(new_ordering);
+if new_ordering.is_empty() {
+continue;
+}
+
+// Check if any file groups are not sorted
+if base_config.file_groups.iter().any(|group| {
+if group.len() <= 1 {
+// File groups with <= 1 files are always sorted
+return false;
+}
+
+let statistics = match MinMaxStatistics::new_from_files(
+_ordering,
+projected_schema,
+base_config.projection.as_deref(),
+group,
+) {
+Ok(statistics) => statistics,
+Err(e) => {
+log::trace!("Error fetching statistics for file group: 
{e}");
+// we can't prove that it's ordered, so we have to reject 
it
+return true;
+}
+};
+
+!statistics.is_sorted()
+}) {
+debug!(
+"Skipping specified output ordering {:?}. \
+Some file groups couldn't be determined to be sorted: {:?}",
+base_config.output_ordering[0], base_config.file_groups
+);
+continue;
 }
+
+all_orderings.push(new_ordering);
 }
 all_orderings
 }
 
+/// A normalized representation of file min/max statistics that allows for 
efficient sorting & comparison.
+/// The min/max values are ordered by [`Self::sort_order`].
+/// Furthermore, any columns that are reversed in the sort order have their 
min/max values swapped.
+pub(crate) struct MinMaxStatistics {
+min_by_sort_order: arrow::row::Rows,
+max_by_sort_order: arrow::row::Rows,
+sort_order: Vec,
+}
+
+impl MinMaxStatistics {
+#[allow(unused)]
+fn sort_order() -> &[PhysicalSortExpr] {
+_order
+}
+
+fn new_from_files<'a>(
+projected_sort_order: &[PhysicalSortExpr], // Sort order with respect 
to projected schema
+projected_schema: ,  // Projected schema
+projection: Option<&[usize]>, // Indices of projection in full table 
schema (None = all columns)

Review Comment:
   Oh... I didn't even think about option 1. But I was assuming that the layout 
of the file statistics should match the table schema and not the individual 
file's schema. It seems that that's what DataFusion does currently. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery` [datafusion]

2024-05-01 Thread via GitHub



alamb merged PR #10318:
URL: https://github.com/apache/datafusion/pull/10318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery ` [datafusion]

2024-05-01 Thread via GitHub



alamb closed issue #10289: Stop copying LogicalPlan and Exprs in 
`DecorrelatePredicateSubquery `
URL: https://github.com/apache/datafusion/issues/10289


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery` [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #10318:
URL: https://github.com/apache/datafusion/pull/10318#issuecomment-2089182854

   Thanks for the review @waynexia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10321:
URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586856895


##
datafusion/functions/src/core/mod.rs:
##
@@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, 
get_field);
 make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce);
 
 // Export the functions out of this package, both as expr_fn as well as a list 
of functions
-export_functions!(
-(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it 
returns value1. This can be used to perform the inverse operation of the 
COALESCE expression."),
-(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given 
the second argument. This can be used to cast to a specific `arrow_type`."),
-(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns 
value1"),
-(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; 
otherwise, it returns value3."),
-(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."),
-(r#struct, args, "Returns a struct with the given arguments"),
-(named_struct, args, "Returns a struct with the given names and arguments 
pairs"),
-(get_field, arg_1 arg_2, "Returns the value of the field with the given 
name from the struct"),
-(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the 
value of the first expr which is not NULL")
-);
+pub mod expr_fn {

Review Comment:
   What I could't figure out how to do was make the macro take both syntaxes
   
   It would have to look something like 
   
   ```rust
   export_functions!(
   // create a function with arg_1 and arg_2 Expr arguments
   (arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given 
the second argument. This can be used to cast to a specific `arrow_type`."),
  /// create a function with a single  Vec arg
   (coalesce, args, "Returns `coalesce(args...)`, which evaluates to the 
value of the first expr which is not NULL")
   ...
   }
   ```
   
   the macro definition is 
https://github.com/apache/datafusion/blob/7c1c7941a078153173a6c82bf2c8742bbcbdefa8/datafusion/functions/src/macros.rs#L39-L60
   
   there may be some way to do this in rust, but I could't figure it out



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Move `create_physical_expr` to `physical-expr-common` [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10074:
URL: https://github.com/apache/datafusion/issues/10074#issuecomment-2089174413

   > The overall idea is to enable us to import common things to 
Expr::AggregateFunction which lives in datafusion-expr.
   
   In the chart you have in 
https://github.com/apache/datafusion/issues/10074#issuecomment-2088004947 it 
seems like it will mean that `datafusion-expr` will depend (indirectly) on 
`datafusion-physical-expr` which I thought was the dependency we are trying to 
avoid
   
   I am sorry I am so slow to respond and getting lost here. Can we take a step 
back and help figure out what problem we are trying to solve
   
   Can you remind me why `datafusion-functions-aggregate` (the implementation 
of aggregate functions) shouldn't depend on `datafusion-physical-expr` and 
`datafusion-expr`? I thought the thing we are trying to do is avoid 
`datafusion-expr` depending on `datafusion-physical-expr` 樂 
   
   What I believe we are trying to do is to pull the aggregate functions out of 
datafusion-physical-expr
   
   I am sorry if you have explained this to me already.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586823032


##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   > Actually I was mistaken, these errors weren't being ignored. I could add 
them back in in a new PR?
   
   I think that would be really nice if you had the time. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Epic] A Collection of Sort Based Optimizations [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10313:
URL: https://github.com/apache/datafusion/issues/10313#issuecomment-2089129451

   > Would this ticket be an appropriate place to add tickets related to 
pushing down sorts to federated query engines? I know that this was discussed 
previously (i.e. #7871) and it seems that writing a custom optimizer is the 
current way to handle that.
   
   I added #7871 to the list above -- thank you.
   
   Yes I think this would be a good place to discuss 
   
   > I will need to do this soon (federated sort pushdown) and it initially 
wasn't clear to me how to make this work in DataFusion. I can volunteer to 
write some docs on how to do this once I have an implementation that works.
   
   That would be great, thanks @phillipleblanc 
   
   Right now, once `TableProvider::execute` gets called, the returned 
`ExecutionPlan` can report how it is already sorted.
   
   What we don't have is any way to have the optimizer tell a `ExecutionPlan` 
that it could reduce the work required in the DataFusion plan if the data was 
already sorted.
   
   I wonder if we could add something to `ExecutionPlan` trait similar to 
[`ExecutionPlan::repartitioned`](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.repartitioned)
 like
   ```rust
   trait ExecutionPlan {
   ...
 /// return other possible orders that this ExecutionPlan could return
 /// (the DataFusion optimizer will use this information to potentially 
push Sorts 
 /// into the Node
 fn pushable_sorts() -> Result>> {
   return Ok(None)
 }
   
 /// return a node like this one except that it its output is sorted 
according to exprs
fn resorted() -> Result>> {
 return Ok(None)
}
   ```
   
   And then add a new optimizer pass that tries to push sorts into the plan 
nodes that report they can provide sorted data 樂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] docs: update docs CI to install python-311 requirements [datafusion-python]

2024-05-01 Thread via GitHub



andygrove merged PR #661:
URL: https://github.com/apache/datafusion-python/pull/661


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Determine ordering of file groups [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #9593:
URL: https://github.com/apache/datafusion/pull/9593#issuecomment-2089122508

   Filed https://github.com/apache/datafusion/issues/10336 to track enable this 
flag by default


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2089121776

   Example test coverage we should add I think: 
https://github.com/apache/datafusion/pull/9593#discussion_r1585517605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Epic] A Collection of Sort Based Optimizations [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10313:
URL: https://github.com/apache/datafusion/issues/10313#issuecomment-2089116574

   Update here: we merged https://github.com/apache/datafusion/pull/9593 and 
now will work on increasing the test coverage to enable it by default (tracked 
in https://github.com/apache/datafusion/issues/10336)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Enable `split_file_groups_by_statistics` by default [datafusion]

2024-05-01 Thread via GitHub



alamb opened a new issue, #10336:
URL: https://github.com/apache/datafusion/issues/10336

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/10313
   
   In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to 
reorganize input files in a ListingTable to avoid a merge, if the sort key 
ranges do not overlap
   
   This feature is behind a feature flag, `split_file_groups_by_statistics`  
which defaults to `false` as I think there needs to be some more tests in place 
before we turn it on 
   
   ### Describe the solution you'd like
   
   Add additional tests and then enable `split_file_groups_by_statistics` by 
default
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



erratic-pattern commented on code in PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586805024


##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   Actually I was mistaken, these errors weren't being ignored. I could add 
them back in in a new PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Determine ordering of file groups [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #9593:
URL: https://github.com/apache/datafusion/pull/9593#discussion_r1586804068


##
datafusion/core/src/datasource/physical_plan/mod.rs:
##
@@ -473,15 +468,281 @@ fn get_projected_output_ordering(
 // since rest of the orderings are violated
 break;
 }
+
 // do not push empty entries
 // otherwise we may have `Some(vec![])` at the output ordering.
-if !new_ordering.is_empty() {
-all_orderings.push(new_ordering);
+if new_ordering.is_empty() {
+continue;
+}
+
+// Check if any file groups are not sorted
+if base_config.file_groups.iter().any(|group| {
+if group.len() <= 1 {
+// File groups with <= 1 files are always sorted
+return false;
+}
+
+let statistics = match MinMaxStatistics::new_from_files(
+_ordering,
+projected_schema,
+base_config.projection.as_deref(),
+group,
+) {
+Ok(statistics) => statistics,
+Err(e) => {
+log::trace!("Error fetching statistics for file group: 
{e}");
+// we can't prove that it's ordered, so we have to reject 
it
+return true;
+}
+};
+
+!statistics.is_sorted()
+}) {
+debug!(
+"Skipping specified output ordering {:?}. \
+Some file groups couldn't be determined to be sorted: {:?}",
+base_config.output_ordering[0], base_config.file_groups
+);
+continue;
 }
+
+all_orderings.push(new_ordering);
 }
 all_orderings
 }
 
+/// A normalized representation of file min/max statistics that allows for 
efficient sorting & comparison.
+/// The min/max values are ordered by [`Self::sort_order`].
+/// Furthermore, any columns that are reversed in the sort order have their 
min/max values swapped.
+pub(crate) struct MinMaxStatistics {
+min_by_sort_order: arrow::row::Rows,
+max_by_sort_order: arrow::row::Rows,
+sort_order: Vec,
+}
+
+impl MinMaxStatistics {
+#[allow(unused)]
+fn sort_order() -> &[PhysicalSortExpr] {
+_order
+}
+
+fn new_from_files<'a>(
+projected_sort_order: &[PhysicalSortExpr], // Sort order with respect 
to projected schema
+projected_schema: ,  // Projected schema
+projection: Option<&[usize]>, // Indices of projection in full table 
schema (None = all columns)

Review Comment:
   I guess I was thinking of subtle bugs related to when:
   1. The schema of the files is different but compatible (e.g. one file as 
(`time`, `date`, `symbol`) but the other file had (`date`, `symbol`, `time`) 
for example
   2. The query orders by a subset of the columns (e.g. `ORDER BY time`)
   3. The query orders by a subset of the columns that is not the sort order 
(`ORDER BY date`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Use file statistics in query planning to avoid sorting when unecessary [datafusion]

2024-05-01 Thread via GitHub



alamb closed issue #7490: Use file statistics in query planning to avoid 
sorting when unecessary
URL: https://github.com/apache/datafusion/issues/7490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Determine ordering of file groups [datafusion]

2024-05-01 Thread via GitHub



alamb merged PR #9593:
URL: https://github.com/apache/datafusion/pull/9593


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



erratic-pattern commented on code in PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586797735


##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   see https://doc.rust-lang.org/std/macro.debug_assert.html 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Determine ordering of file groups [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #9593:
URL: https://github.com/apache/datafusion/pull/9593#issuecomment-2089091928

   > @alamb I added a config value, and I moved MinMaxStatistics to its own 
module as requested. I wasn't sure if I should delay addressing your feedback 
on tests to the next PR, since it seems like the suggested plan is to merge 
this PR first.
   
   Sorry -- sounds good. I am going to give this PR another look and file some 
follow on tickets. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



erratic-pattern commented on code in PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586795632


##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   > I wonder if ignoring errors will make tracking down unsupported types 
harder (I am imagining when we try and add REEArray or StringViewArray). 樂
   
   yeah we may want to reintroduce a Result type, but as it stands currently 
the errors were already being ignored anyway so they didn't serve any purpose.
   
   also the boolean is_supported_* means that we'll just silently skip those 
cases anyway.
   
   so realistically most of those errors just don't ever occur and if they did 
occur we wouldn't know about them. we would need to rethink the design of this 
code to surface those errors. 
   
   another interesting possibility is debug assertions that panic, but only on 
'debug' builds. I've always found this to be an interesting but underutilized 
tool for uncovering bugs
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Stop copying LogicalPlan and Exprs in `CommonSubexprEliminate` [datafusion]

2024-05-01 Thread via GitHub



alamb closed issue #10211: Stop copying LogicalPlan and Exprs in  
`CommonSubexprEliminate`
URL: https://github.com/apache/datafusion/issues/10211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Stop copying LogicalPlan and Exprs in `CommonSubexprEliminate` [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10211:
URL: https://github.com/apache/datafusion/issues/10211#issuecomment-2089088807

   > @alamb, why did you reopen this?
   
   I am not sure. Sorry about that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-05-01 Thread via GitHub



alamb closed pull request #10221: fix: preserve more dictionaries when coercing 
types
URL: https://github.com/apache/datafusion/pull/10221


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #10221:
URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2089086256

   I believe this is superceded by 
https://github.com/apache/datafusion/pull/10323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on code in PR #335:
URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586766148


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -510,9 +558,246 @@ impl PhysicalExpr for Cast {
 }
 }
 
+fn timestamp_parser(value: , eval_mode: EvalMode) -> 
CometResult> {
+let value = value.trim();
+if value.is_empty() {
+return Ok(None);
+}
+
+// Define regex patterns and corresponding parsing functions
+let patterns = &[
+(
+Regex::new(r"^\d{4}$").unwrap(),
+parse_str_to_year_timestamp as fn() -> 
CometResult>,
+),
+(
+Regex::new(r"^\d{4}-\d{2}$").unwrap(),
+parse_str_to_month_timestamp,
+),
+(
+Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(),
+parse_str_to_day_timestamp,
+),
+(
+Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{1,2}$").unwrap(),
+parse_str_to_hour_timestamp,
+),
+(
+Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$").unwrap(),
+parse_str_to_minute_timestamp,
+),
+(
+Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}$").unwrap(),
+parse_str_to_second_timestamp,
+),
+(
+
Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}$").unwrap(),
+parse_str_to_microsecond_timestamp,
+),
+(
+Regex::new(r"^T\d{1,2}$").unwrap(),
+parse_str_to_time_only_timestamp,
+),
+];
+
+let mut timestamp = None;
+
+// Iterate through patterns and try matching
+for (pattern, parse_func) in patterns {
+if pattern.is_match(value) {
+timestamp = parse_func(value)?;
+break;
+}
+}
+
+if timestamp.is_none() {
+if eval_mode == EvalMode::Ansi {
+return Err(CometError::CastInvalidValue {
+value: value.to_string(),
+from_type: "STRING".to_string(),
+to_type: "TIMESTAMP".to_string(),
+});
+} else {
+return Ok(None);
+}
+}
+Ok(Some(timestamp.unwrap()))
+}
+
+fn parse_ymd_timestamp(year: i32, month: u32, day: u32) -> 
CometResult> {
+let datetime = chrono::Utc
+.with_ymd_and_hms(year, month, day, 0, 0, 0)
+.unwrap()
+.with_timezone(::Utc);
+Ok(Some(datetime.timestamp_micros()))
+}
+
+fn parse_hms_timestamp(
+year: i32,
+month: u32,
+day: u32,
+hour: u32,
+minute: u32,
+second: u32,
+microsecond: u32,
+) -> CometResult> {
+let datetime = chrono::Utc
+.with_ymd_and_hms(year, month, day, hour, minute, second)
+.unwrap()
+.with_timezone(::Utc)
+.with_nanosecond(microsecond * 1000);
+Ok(Some(datetime.unwrap().timestamp_micros()))
+}
+
+fn get_timestamp_values(value: , timestamp_type: ) -> 
CometResult> {
+let values: Vec<_> = value
+.split(|c| c == 'T' || c == '-' || c == ':' || c == '.')
+.collect();
+let year = values[0].parse::().unwrap_or_default();
+let month = values.get(1).map_or(1, |m| m.parse::().unwrap_or(1));
+let day = values.get(2).map_or(1, |d| d.parse::().unwrap_or(1));
+let hour = values.get(3).map_or(0, |h| h.parse::().unwrap_or(0));
+let minute = values.get(4).map_or(0, |m| m.parse::().unwrap_or(0));
+let second = values.get(5).map_or(0, |s| s.parse::().unwrap_or(0));
+let microsecond = values.get(6).map_or(0, |ms| 
ms.parse::().unwrap_or(0));
+
+match timestamp_type {
+"year" => parse_ymd_timestamp(year, 1, 1),
+"month" => parse_ymd_timestamp(year, month, 1),
+"day" => parse_ymd_timestamp(year, month, day),
+"hour" => parse_hms_timestamp(year, month, day, hour, 0, 0, 0),
+"minute" => parse_hms_timestamp(year, month, day, hour, minute, 0, 0),
+"second" => parse_hms_timestamp(year, month, day, hour, minute, 
second, 0),
+"microsecond" => parse_hms_timestamp(year, month, day, hour, minute, 
second, microsecond),
+_ => Err(CometError::CastInvalidValue {
+value: value.to_string(),
+from_type: "STRING".to_string(),
+to_type: "TIMESTAMP".to_string(),
+}),
+}
+}
+
+fn parse_str_to_year_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "year")
+}
+
+fn parse_str_to_month_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "month")
+}
+
+fn parse_str_to_day_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "day")
+}
+
+fn parse_str_to_hour_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "hour")
+}
+
+fn parse_str_to_minute_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "minute")
+}
+
+fn parse_str_to_second_timestamp(value: ) -> CometResult> {
+get_timestamp_values(value, "second")
+}
+
+fn

Re: [PR] fix: limit with offset should return correct results [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on code in PR #359:
URL: https://github.com/apache/datafusion-comet/pull/359#discussion_r1586759865


##
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##
@@ -340,7 +340,7 @@ class CometSparkSessionExtensions
   op
   }
 
-case op: LocalLimitExec =>
+case op: LocalLimitExec if getOffset(op) == 0 =>

Review Comment:
   Oh, I missed it. I will create a follow up. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#issuecomment-2089031318

   I made a PR with a few more tests here: 
https://github.com/apache/datafusion/pull/10335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Minor: add a few more dictionary unwrap tests [datafusion]

2024-05-01 Thread via GitHub



alamb opened a new pull request, #10335:
URL: https://github.com/apache/datafusion/pull/10335

   ## Which issue does this PR close?
   
   Small follow on to https://github.com/apache/datafusion/pull/10323
   
   ## Rationale for this change
   
   I noticed a few more cases I thought should be covered while reviewing 
https://github.com/apache/datafusion/pull/10323
   
   ## What changes are included in this PR?
   
   Add a few more tests for unwrapping dictionaries (verify that the order of 
the column and literal does not matter)
   
   
   ## Are these changes tested?
   All tests
   
   
   
   ## Are there any user-facing changes?
   No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10323:
URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586745915


##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   I wonder if ignoring errors will make tracking down unsupported types harder 
(I am imagining when we try and add REEArray or StringViewArray). 



##
datafusion/optimizer/src/unwrap_cast_in_comparison.rs:
##
@@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool {
 )
 }
 
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
string
+fn is_supported_string_type(data_type: ) -> bool {
+matches!(data_type, DataType::Utf8 | DataType::LargeUtf8)
+}
+
+/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a 
dictionary
+fn is_supported_dictionary_type(data_type: ) -> bool {
+matches!(data_type,
+DataType::Dictionary(_, inner) if is_supported_type(inner))
+}
+
+/// Convert a literal value from one data type to another
 fn try_cast_literal_to_type(
 lit_value: ,
 target_type: ,
-) -> Result> {
+) -> Option {
 let lit_data_type = lit_value.data_type();
-// the rule just support the signed numeric data type now
-if !is_support_data_type(_data_type) || 
!is_support_data_type(target_type) {
-return Ok(None);
+if !is_supported_type(_data_type) || !is_supported_type(target_type) {
+return None;
 }
 if lit_value.is_null() {
 // null value can be cast to any type of null value
-return Ok(Some(ScalarValue::try_from(target_type)?));
+return ScalarValue::try_from(target_type).ok();

Review Comment:
   I wonder if ignoring errors will make tracking down unsupported types harder 
(I am imagining when we try and add REEArray or StringViewArray). 樂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Slow comparisions to dictionary columns with type coercion [datafusion]

2024-05-01 Thread via GitHub



alamb closed issue #10220: Slow comparisions to dictionary columns with type 
coercion
URL: https://github.com/apache/datafusion/issues/10220


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]

2024-05-01 Thread via GitHub



alamb merged PR #10323:
URL: https://github.com/apache/datafusion/pull/10323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: limit with offset should return correct results [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra commented on code in PR #359:
URL: https://github.com/apache/datafusion-comet/pull/359#discussion_r1586750015


##
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##
@@ -340,7 +340,7 @@ class CometSparkSessionExtensions
   op
   }
 
-case op: LocalLimitExec =>
+case op: LocalLimitExec if getOffset(op) == 0 =>

Review Comment:
   Add a `withInfo` for the case when `offset > 0` ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on code in PR #335:
URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586743530


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -86,6 +89,24 @@ macro_rules! cast_utf8_to_int {
 }};
 }
 
+macro_rules! cast_utf8_to_timestamp {
+($array:expr, $eval_mode:expr, $array_type:ty, $cast_method:ident) => {{
+let len = $array.len();
+let mut cast_array = 
PrimitiveArray::<$array_type>::builder(len).with_timezone("UTC");
+for i in 0..len {
+if $array.is_null(i) {
+cast_array.append_null()
+} else if let Ok(Some(cast_value)) = 
$cast_method($array.value(i).trim(), $eval_mode) {
+cast_array.append_value(cast_value);
+} else {
+cast_array.append_null()
+}
+}
+let result: CometResult = Ok(Arc::new(cast_array.finish()) 
as ArrayRef);
+result.unwrap()

Review Comment:
   We can remove the `unwrap` here:
   
   ```suggestion
   let result: ArrayRef = Arc::new(cast_array.finish()) as ArrayRef;
   result
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Add coalesce tests [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10334:
URL: https://github.com/apache/datafusion/pull/10334#discussion_r1586740016


##
datafusion/sqllogictest/test_files/coalesce.slt:
##
@@ -0,0 +1,374 @@
+# Licensed to the Apache Software Foundation (ASF) under one

Review Comment:
   Given the size of these tests I opted to put them in their own file rather 
than `function.slt`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



alamb commented on PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2088998120

   I made a PR with just your wonderful tests in 
https://github.com/apache/datafusion/pull/10334/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]

2024-05-01 Thread via GitHub



ganeshkumar269 commented on issue #367:
URL: 
https://github.com/apache/datafusion-comet/issues/367#issuecomment-2088994122

   ./mvnw test -Dsuites=org.apache.comet.exec.CometShuffleEncryptionSuite
   
   ran the above command, same result


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra commented on code in PR #331:
URL: https://github.com/apache/datafusion-comet/pull/331#discussion_r1586727820


##
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala:
##
@@ -123,7 +134,7 @@ class CometExpressionCoverageSuite extends CometTestBase 
with AdaptiveSparkPlanH
 Seq(("", "No examples found in 
spark.sessionState.functionRegistry"
 }
 
-// TODO: convert results into HTML
+// TODO: convert results into HTML or .md file

Review Comment:
   Perhaps log an issue for this and refer to that instead? 
   It would be great to link the output directly into the documentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Write a guide on contributing a new expression [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove opened a new issue, #370:
URL: https://github.com/apache/datafusion-comet/issues/370

   ### What is the problem the feature request solves?
   
   We should write a detailed guide showing new contributors how to add support 
for a new expression in Comet. This would cover the Scala, Protobuf, and Rust 
changes, as well as testing approaches. We could create this based on a recent 
PR that added a new expression.
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586698773


##
datafusion/expr/src/type_coercion/functions.rs:
##
@@ -53,20 +54,31 @@ pub fn data_types(
 }
 }
 
-let valid_types = get_valid_types(_signature, 
current_types)?;
-
+let mut valid_types = get_valid_types(_signature, 
current_types)?;
 if valid_types
 .iter()
 .any(|data_type| data_type == current_types)
 {
 return Ok(current_types.to_vec());
 }
 
-// Try and coerce the argument types to match the signature, returning the
-// coerced types from the first matching signature.
-for valid_types in valid_types {
-if let Some(types) = maybe_data_types(_types, current_types) {
-return Ok(types);
+// Well-supported signature that returns exact valid types.
+if !valid_types.is_empty()
+&& matches!(signature.type_signature, 
TypeSignature::VariadicEqualOrNull)
+{
+// exact valid types
+assert_eq!(valid_types.len(), 1);
+let valid_types = valid_types.swap_remove(0);

Review Comment:
   is it a problem that this may change the type ordering? So that the last 
type is now the first in the new Vec?



##
datafusion/functions/src/core/coalesce.rs:
##
@@ -60,9 +59,8 @@ impl ScalarUDFImpl for CoalesceFunc {
 }
 
 fn return_type(, arg_types: &[DataType]) -> Result {
-// COALESCE has multiple args and they might get coerced, get a 
preview of this

Review Comment:
    ❤️ 



##
datafusion/sqllogictest/test_files/functions.slt:
##
@@ -1158,3 +1158,347 @@ drop table uuid_table
 
 statement ok
 drop table t
+
+
+# Test coalesce function
+query I
+select coalesce(1, 2, 3);
+
+1
+
+# test with first null
+query ?T
+select coalesce(null, 3, 2, 1), arrow_typeof(coalesce(null, 3, 2, 1));
+
+3 Int64
+
+# test with null values
+query ?
+select coalesce(null, null);
+
+NULL
+
+# cast to float
+query IT
+select 
+  coalesce(1, 2.0),
+  arrow_typeof(coalesce(1, 2.0))
+;
+
+1 Float64
+
+query RT
+select 
+  coalesce(2.0, 1),
+  arrow_typeof(coalesce(2.0, 1))
+;
+
+2 Float64
+
+query IT
+select 
+  coalesce(1, arrow_cast(2.0, 'Float32')),
+  arrow_typeof(coalesce(1, arrow_cast(2.0, 'Float32')))
+;
+
+1 Float32
+
+# test with empty args
+query error
+select coalesce();
+
+# test with different types
+query I
+select coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64'));
+
+1
+
+# test with nulls
+query ?T
+select coalesce(null, null, null), arrow_typeof(coalesce(null, null));
+
+NULL Null
+
+# i32 and u32, cast to wider type i64
+query IT
+select 
+  coalesce(arrow_cast(2, 'Int32'), arrow_cast(3, 'UInt32')),
+  arrow_typeof(coalesce(arrow_cast(2, 'Int32'), arrow_cast(3, 'UInt32')));
+
+2 Int64
+
+query IT
+select
+  coalesce(arrow_cast(2, 'Int16'), arrow_cast(3, 'UInt16')),
+  arrow_typeof(coalesce(arrow_cast(2, 'Int16'), arrow_cast(3, 'UInt16')));
+
+2 Int32
+
+query IT
+select
+  coalesce(arrow_cast(2, 'Int8'), arrow_cast(3, 'UInt8')),
+  arrow_typeof(coalesce(arrow_cast(2, 'Int8'), arrow_cast(3, 'UInt8')));
+
+2 Int16
+
+# i64 and u32, cast to wider type i64
+query IT
+select 
+  coalesce(2, arrow_cast(3, 'UInt32')),
+  arrow_typeof(coalesce(2, arrow_cast(3, 'UInt32')));
+
+2 Int64
+
+# TODO: Got types (i64, u64), casting to decimal or double or even i128 if 
supported
+query IT
+select
+  coalesce(2, arrow_cast(3, 'UInt64')),
+  arrow_typeof(coalesce(2, arrow_cast(3, 'UInt64')));
+
+2 Int64
+
+statement ok
+create table t1 (a bigint, b int, c int) as values (null, null, 1), (null, 2, 
null);
+
+# Follow Postgres and DuckDB behavior, since a is bigint, although the value 
is null, all args are coerced to bigint
+query IT
+select 
+  coalesce(a, b, c), 
+  arrow_typeof(coalesce(a, b, c))
+from t1;
+
+1 Int64
+2 Int64
+
+# b, c has the same type int, so the result is int
+query IT
+select 
+  coalesce(b, c),
+  arrow_typeof(coalesce(b, c))
+from t1;
+
+1 Int32
+2 Int32
+
+statement ok
+drop table t1;
+
+# test multi rows
+statement ok
+CREATE TABLE t1(
+  c1 int,
+  c2 int
+) as VALUES
+(1, 2),
+(NULL, 2),
+(1, NULL),
+(NULL, NULL);
+
+query I
+SELECT COALESCE(c1, c2) FROM t1
+
+1
+2
+1
+NULL
+
+statement ok
+drop table t1;
+
+# Decimal128(7, 2) and int64 are coerced to common wider type Decimal128(22, 2)
+query RT
+select 
+  coalesce(arrow_cast(2, 'Decimal128(7, 2)'), 0),
+  arrow_typeof(coalesce(arrow_cast(2, 'Decimal128(7, 2)'), 0))
+
+2 Decimal128(22, 2)
+
+query RT
+select 
+  coalesce(arrow_cast(2, 'Decimal256(7, 2)'), 0),
+  arrow_typeof(coalesce(arrow_cast(2, 'Decimal256(7, 2)'), 0));
+
+2 Decimal256(22, 2)
+
+# coalesce string
+query T?
+select 
+  coalesce('', 'test'),
+  coalesce(null, 'test');
+
+(empty) test
+
+# coalesce utf8 and large utf8
+query TT
+select 
+  coalesce('a', arrow_cast('b', 'LargeUtf8')),
+  arrow_typeof(coalesce('a',

[I] Plan first release [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove opened a new issue, #369:
URL: https://github.com/apache/datafusion-comet/issues/369

   ### What is the problem the feature request solves?
   
   During the Comet public meeting this morning, there were questions about 
when the first official release would be. We do not really have an answer to 
that yet, but we can use this issue to discuss it.
   
   Here are some ideas for milestones that we may want to achieve before 
creating an official source release (note that we do not necessarily need to 
create binary releases right away).
   
   - Ensure that currently supported operators and expressions are fully 
compatible with all supported Spark versions
   - Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear 
performance advantage
   - Create release process and scripts for publishing a source release
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Enable GitHub discussions [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove opened a new issue, #368:
URL: https://github.com/apache/datafusion-comet/issues/368

   ### What is the problem the feature request solves?
   
   During the public Comet meeting this morning we discussed enabling GitHub 
discussions as a place for people to ask questions and also potentially as a 
place to record notes from meetings. 
   
   We would be careful not to use discussions to make any decisions because 
discussions do not get sent to the mailing list (but all activity on issues and 
PRs do go to the mailing list, so this is where we make most decisions).
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] remove expr node accumulation [datafusion]

2024-05-01 Thread via GitHub



MohamedAbdeen21 opened a new pull request, #10333:
URL: https://github.com/apache/datafusion/pull/10333

   ## Which issue does this PR close?
   
   
   
   Closes #10280.
   
   Thanks @JasonLi-cn for pointing out the relevant code snippet and the 
detailed description, saved me some time ❤️
   
   ## Rationale for this change
   
   
   
   Making the plans human-readable when common sub expression elimination is 
applied (check example in ticket and edited tests in this PR).
   
   ## What changes are included in this PR?
   
   
   
   
   ## Are these changes tested?
   
   
   
   Changed tests to reflect the change.
   
   ## Are there any user-facing changes?
   
   
   
   
   
   Readable plans.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add a test for supported types of SortMergeJoin in DataFusion [datafusion-comet]

2024-05-01 Thread via GitHub



planga82 commented on code in PR #365:
URL: https://github.com/apache/datafusion-comet/pull/365#discussion_r1586685070


##
common/src/main/scala/org/apache/comet/CometConf.scala:
##
@@ -383,6 +383,13 @@ object CometConf {
 .booleanConf
 .createWithDefault(false)
 
+  val COMET_SORTMERGEJOIN_CHECK_KEY_TYPES: ConfigEntry[Boolean] =

Review Comment:
   thanks @comphead . Any idea how to do this in a cleaner way? We could move 
this constant to other place and avoid putting it in CometConf. In that way we 
avoid to expose a testing configuration. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Use non-comparison coercion for `Coalesce` or even avoid implicit casting for `Coalesce` [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10261:
URL: https://github.com/apache/datafusion/issues/10261#issuecomment-2088944901

   BTW this is a really nicely written ticket


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Finalize SIGMOD 2024 paper ~(if accepted)~ [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #8373:
URL: https://github.com/apache/datafusion/issues/8373#issuecomment-2088935249

   I think it is finally accepted  
   
   ```
   
   
   -- Forwarded message -
   From: Tim Pollitt 
   Date: Wed, May 1, 2024 at 2:44 PM
   Subject: ACM Proceedings (sigmodpods TP) - Submission Follow Up, ID No: 
modIP001
   To: 
   Cc: , , , 
, , 
   
   
   Dear Author(s),
   For the re-submission of "Apache Arrow DataFusion: A Fast, Embeddable, 
Modular Analytic Query Engine" to the proceedings and ACM Digital Library.
   
   See below  for info related to your final pdf.
   
   In addition, authors are requested by the chairs to submit an optional 
presentation video.
   
   (1) For more info, see instructions for details, and inquiry link to chairs: 
  https://www.scomminc.com/pp/acmsig/sigmod-video.htm by May 6th.
   
   (2) You will need the DOI assigned to your submission: 
10.1145/3626246.3653368
   
   (3) Video descriptions over the 1,024 ACM character count will be 
deleted/removed by ACM.
   
   *
   
We received notice of your completed ACM form, so everything appears to be 
in order with this submission, and will be utilized for the ACM DL publication. 
No revisions are needed, nor will be accepted. This submission is being moved 
onto the next stages of production (but kindly note the signed ACM forms still 
go through a double-check by ACM for various permissions).
   
   
   Thank you,
   
   Tim Pollitt
   
   Sheridan Communications
  
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Tracking Upgrade to Datafusion 37 [datafusion-python]

2024-05-01 Thread via GitHub



Michael-J-Ward opened a new issue, #663:
URL: https://github.com/apache/datafusion-python/issues/663

   I sketched out some of the upgrade in #662 and wanted to share what I 
encountered.
   
   Some major changes in between datafusion 36 and 37.
   
   ## https://github.com/apache/datafusion/issues/9285
   
   That means much of  `datafusion-python::functions.rs` needs an update. An 
incomplete first pass is 
https://github.com/apache/datafusion-python/pull/662/commits/83249fe9a4e809bb6065671bcaee543065d2d356
   
   ## `SectionContext::tables` was removed in 
https://github.com/apache/datafusion/pull/9627
   
   Many of the tests rely on this method. Does each test get updated to a new 
API or do we implement our own version to start and then update the tests until 
it's not necessary?
   
   ## new trait methods for `ExecutionPlan` and `ExecutionPlanProperties` for 
`DatasetExec` need implementing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on issue #367:
URL: 
https://github.com/apache/datafusion-comet/issues/367#issuecomment-2088932628

   Have you tried to run it in command line?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Only allow supported and tested cast operations to be converted to native [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on code in PR #362:
URL: https://github.com/apache/datafusion-comet/pull/362#discussion_r158786


##
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##
@@ -1042,10 +1042,10 @@ object CometSparkSessionExtensions extends Logging {
*   The node with information (if any) attached
*/
   def withInfo[T <: TreeNode[_]](node: T, info: String, exprs: T*): T = {
-val exprInfo = exprs
-  .flatMap { e => Seq(e.getTagValue(CometExplainInfo.EXTENSION_INFO)) }
-  .flatten
-  .mkString("\n")
+val exprInfo =
+  (Seq(node.getTagValue(CometExplainInfo.EXTENSION_INFO)) ++ exprs
+.flatMap(e => 
Seq(e.getTagValue(CometExplainInfo.EXTENSION_INFO.flatten
+.mkString("\n")

Review Comment:
   I updated this to avoid duplicates



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] DataFusion `38.0.0` Release [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10217:
URL: https://github.com/apache/datafusion/issues/10217#issuecomment-2088930080

   Here is a PR witha proposed API 
https://github.com/apache/datafusion/pull/10330 that would close 
https://github.com/apache/datafusion/issues/10181
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Error "entered unreachable code: NamedStructField should be rewritten in OperatorToFunction" after upgrade to 37 [datafusion]

2024-05-01 Thread via GitHub



alamb commented on issue #10181:
URL: https://github.com/apache/datafusion/issues/10181#issuecomment-2088929490

   Here is a PR with a proposed new API: 
https://github.com/apache/datafusion/pull/10330


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add APIs to create `PhysicalExpressions` from `Expr`s: `SessionContext::create_physical_expr()` and `SessionState::create_physical_expr()` [datafusion]

2024-05-01 Thread via GitHub



alamb commented on code in PR #10330:
URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586649331


##
datafusion-examples/examples/expr_api.rs:
##
@@ -92,7 +90,8 @@ fn evaluate_demo() -> Result<()> {
 let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
 
 // First, you make a "physical expression" from the logical `Expr`
-let physical_expr = physical_expr((), expr)?;
+let df_schema = DFSchema::try_from(batch.schema())?;

Review Comment:
   This shows how the new API is used - 
`SessionContext::new().create_physical_expr`



##
datafusion/common/src/dfschema.rs:
##
@@ -806,6 +820,12 @@ impl From<> for Schema {
 }
 }
 
+impl AsRef for DFSchema {

Review Comment:
   This allows DFSchema to be treated like a , which is now possible 
after https://github.com/apache/datafusion/pull/9595



##
datafusion-examples/examples/expr_api.rs:
##
@@ -248,21 +251,6 @@ fn make_ts_field(name: ) -> Field {
 make_field(name, DataType::Timestamp(TimeUnit::Nanosecond, tz))
 }
 
-/// Build a physical expression from a logical one, after applying 
simplification and type coercion
-pub fn physical_expr(schema: , expr: Expr) -> Result> {

Review Comment:
   This PR basically ports this code out of the examples and into 
`SessionState` and adds documentation and tests 



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -510,6 +515,34 @@ impl SessionContext {
 }
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type

Review Comment:
   New public API with example



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -2024,6 +2089,35 @@ impl SessionState {
 }
 }
 
+struct SessionSimpifyProvider<'a> {

Review Comment:
   This avoids cloning schema / schema refs



##
datafusion/core/tests/expr_api/mod.rs:
##
@@ -0,0 +1,181 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::util::pretty::pretty_format_columns;
+use arrow_array::builder::{ListBuilder, StringBuilder};
+use arrow_array::{ArrayRef, RecordBatch, StringArray, StructArray};
+use arrow_schema::{DataType, Field};
+use datafusion::prelude::*;
+use datafusion_common::DFSchema;
+/// Tests of using and evaluating `Expr`s outside the context of a LogicalPlan
+use std::sync::{Arc, OnceLock};
+
+#[test]
+fn test_eq() {
+// id = '2'
+evaluate_expr_test(
+col("id").eq(lit("2")),
+vec![
+"+---+",
+"| expr  |",
+"+---+",
+"| false |",
+"| true  |",
+"| false |",
+"+---+",
+],
+);
+}
+
+#[test]
+fn test_eq_with_coercion() {
+// id = 2 (need to coerce the 2 to '2' to evaluate)
+evaluate_expr_test(
+col("id").eq(lit(2i32)),
+vec![
+"+---+",
+"| expr  |",
+"+---+",
+"| false |",
+"| true  |",
+"| false |",
+"+---+",
+],
+);
+}
+
+#[test]
+fn test_get_field() {

Review Comment:
   this tests fail without function rewrites applied



##
datafusion/core/src/execution/context/mod.rs:
##
@@ -1947,6 +1982,36 @@ impl SessionState {
 .await
 }
 
+/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
+/// coercion, and function rewrites.
+///
+/// Note: The expression is not [simplified]
+///
+/// # See Also:
+/// * [`SessionContext::create_physical_expr`] for a higher-level API
+/// * [`create_physical_expr`] for a lower-level API
+///
+/// [simplified]: datafusion_optimizer::simplify_expressions
+pub fn create_physical_expr(
+,
+expr: Expr,
+df_schema: ,
+) -> Result> {
+let simplifier =
+ExprSimplifier::new(SessionSimpifyProvider::new(self, df_schema));
+// apply type coercion here to ensure types match
+let mut expr = simplifier.coerce(expr, df_schema)?;
+
+// rewrite Exprs to functions if necessary
+let config_options = self.config_options();

Review Comment:
   Applying these rewrites is a new feature (and what

Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove commented on code in PR #335:
URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586652474


##
spark/src/test/scala/org/apache/comet/CometCastSuite.scala:
##
@@ -528,14 +528,29 @@ class CometCastSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   "spark.comet.cast.stringToTimestamp is disabled")
   }
 
-  ignore("cast StringType to TimestampType") {
-// https://github.com/apache/datafusion-comet/issues/328
-withSQLConf((CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key, "true")) {
-  val values = Seq("2020-01-01T12:34:56.123456", "T2") ++ 
generateStrings(timestampPattern, 8)
-  castTest(values.toDF("a"), DataTypes.TimestampType)
+  test("cast StringType to TimestampType") {
+withSQLConf(
+  SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC",
+  CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key -> "true") {
+  val values = Seq(
+"2020",
+"2020-01",
+"2020-01-01",
+"2020-01-01T12",
+"2020-01-01T12:34",
+"2020-01-01T12:34:56",
+"2020-01-01T12:34:56.123456",
+"T2",
+"-9?")
+  castTimestampTest(values.toDF("a"), DataTypes.TimestampType)
 }
   }
 
+  test("cast StringType to TimestampType with invalid timezone") {

Review Comment:
   Thanks for adding this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] doc: Clean up supported JDKs in README [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove merged PR #366:
URL: https://github.com/apache/datafusion-comet/pull/366


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] doc: Fix target typo in development.md [datafusion-comet]

2024-05-01 Thread via GitHub



andygrove merged PR #364:
URL: https://github.com/apache/datafusion-comet/pull/364


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra commented on code in PR #345:
URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1586638726


##
docs/source/contributor-guide/plugin_overview.md:
##
@@ -0,0 +1,50 @@
+
+
+# Comet Plugin Overview
+
+The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` 
class, which can be registered with Spark by adding the following setting to 
the Spark configuration when launching `spark-shell` or `spark-submit`:
+
+```
+--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions
+```
+
+On initialization, this class registers two physical plan optimization rules 
with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a 
query stage is being planned.
+
+## CometScanRule
+
+`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes.
+
+When the V1 data source API is being used, `FileSourceScanExec` is replaced 
with `CometScanExec`.
+
+When the V2 data source API is being used, `BatchScanExec` is replaced with 
`CometBatchScanExec`.
+
+## CometExecRule
+
+`CometExecRule` attempts to transform a Spark physical plan into a Comet plan.
+
+This rule traverses bottom-up from the original Spark plan and attempts to 
replace each node with a Comet equivalent. For example, a `ProjectExec` will be 
replaced by `CometProjectExec`.
+
+When replacing a node, various checks are performed to determine if Comet can 
support the operator and its expressions. If an operator or expression is not 
supported by Comet then the reason will be stored in a tag on the underlying 
Spark node. Running `explain` on a query will show any reasons that prevented 
the plan from being executed natively in Comet. If any part of the plan is not 
supported in Comet then the original Spark plan will be returned.

Review Comment:
   Sorry, just seeing this today. The `explain` information will show up in the 
UI or `EXPLAIN` output only from Spark 4.0.0 onwards as the 
`ExtendedExplainGenerator` trait was only added in Spark 4.0. Internally, we 
can always call `ExtendedExplainInfo.generateExtendedInfo(plan)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Setup regular meetups for Comet development [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra commented on issue #217:
URL: 
https://github.com/apache/datafusion-comet/issues/217#issuecomment-2088897629

   For those who might be reading this in future: 
   
   Meetups links:  
https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit?usp=sharing
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Setup regular meetups for Comet development [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra closed issue #217: Setup regular meetups for Comet development
URL: https://github.com/apache/datafusion-comet/issues/217


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Add a test for supported types of SortMergeJoin in DataFusion [datafusion-comet]

2024-05-01 Thread via GitHub



comphead commented on code in PR #365:
URL: https://github.com/apache/datafusion-comet/pull/365#discussion_r1586622441


##
common/src/main/scala/org/apache/comet/CometConf.scala:
##
@@ -383,6 +383,13 @@ object CometConf {
 .booleanConf
 .createWithDefault(false)
 
+  val COMET_SORTMERGEJOIN_CHECK_KEY_TYPES: ConfigEntry[Boolean] =

Review Comment:
   thanks @planga82 
   I'm not sure if we need to change the conf for the unit test purpose



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST from string to integral types [datafusion-comet]

2024-05-01 Thread via GitHub



parthchandra commented on code in PR #307:
URL: https://github.com/apache/datafusion-comet/pull/307#discussion_r1586606093


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -142,6 +233,202 @@ impl Cast {
 }
 }
 
+/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toByte
+fn cast_string_to_i8(str: , eval_mode: EvalMode) -> 
CometResult> {
+Ok(cast_string_to_int_with_range_check(
+str,
+eval_mode,
+"TINYINT",
+i8::MIN as i32,
+i8::MAX as i32,
+)?
+.map(|v| v as i8))
+}
+
+/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toShort
+fn cast_string_to_i16(str: , eval_mode: EvalMode) -> 
CometResult> {
+Ok(cast_string_to_int_with_range_check(
+str,
+eval_mode,
+"SMALLINT",
+i16::MIN as i32,
+i16::MAX as i32,
+)?
+.map(|v| v as i16))
+}
+
+/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toInt(IntWrapper 
intWrapper)
+fn cast_string_to_i32(str: , eval_mode: EvalMode) -> 
CometResult> {
+do_cast_string_to_int::(str, eval_mode, "INT", i32::MIN)
+}
+
+/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toLong(LongWrapper 
intWrapper)
+fn cast_string_to_i64(str: , eval_mode: EvalMode) -> 
CometResult> {
+do_cast_string_to_int::(str, eval_mode, "BIGINT", i64::MIN)
+}
+
+fn cast_string_to_int_with_range_check(
+str: ,
+eval_mode: EvalMode,
+type_name: ,
+min: i32,
+max: i32,
+) -> CometResult> {
+match do_cast_string_to_int(str, eval_mode, type_name, i32::MIN)? {
+None => Ok(None),
+Some(v) if v >= min && v <= max => Ok(Some(v)),
+_ if eval_mode == EvalMode::Ansi => Err(invalid_value(str, "STRING", 
type_name)),
+_ => Ok(None),
+}
+}
+
+#[derive(PartialEq)]
+enum State {
+SkipLeadingWhiteSpace,
+SkipTrailingWhiteSpace,
+ParseSignAndDigits,
+ParseFractionalDigits,
+}
+
+/// Equivalent to
+/// - org.apache.spark.unsafe.types.UTF8String.toInt(IntWrapper intWrapper, 
boolean allowDecimal)
+/// - org.apache.spark.unsafe.types.UTF8String.toLong(LongWrapper longWrapper, 
boolean allowDecimal)
+fn do_cast_string_to_int<
+T: Num + PartialOrd + Integer + CheckedSub + CheckedNeg + From + Copy,
+>(
+str: ,
+eval_mode: EvalMode,
+type_name: ,
+min_value: T,
+) -> CometResult> {
+let len = str.len();
+if str.is_empty() {
+return none_or_err(eval_mode, type_name, str);
+}
+
+let mut result: T = T::zero();
+let mut negative = false;
+let radix = T::from(10);
+let stop_value = min_value / radix;
+let mut state = State::SkipLeadingWhiteSpace;
+let mut parsed_sign = false;
+
+for (i, ch) in str.char_indices() {
+// skip leading whitespace
+if state == State::SkipLeadingWhiteSpace {
+if ch.is_whitespace() {
+// consume this char
+continue;
+}
+// change state and fall through to next section
+state = State::ParseSignAndDigits;
+}
+
+if state == State::ParseSignAndDigits {
+if !parsed_sign {
+negative = ch == '-';
+let positive = ch == '+';
+parsed_sign = true;
+if negative || positive {
+if i + 1 == len {
+// input string is just "+" or "-"
+return none_or_err(eval_mode, type_name, str);
+}
+// consume this char
+continue;
+}
+}
+
+if ch == '.' {
+if eval_mode == EvalMode::Legacy {
+// truncate decimal in legacy mode
+state = State::ParseFractionalDigits;
+continue;
+} else {
+return none_or_err(eval_mode, type_name, str);
+}
+}
+
+let digit = if ch.is_ascii_digit() {
+(ch as u32) - ('0' as u32)
+} else {
+return none_or_err(eval_mode, type_name, str);
+};
+
+// We are going to process the new digit and accumulate the 
result. However, before

Review Comment:
   A comment to explain why we're using subtraction instead of addition would 
make it easier to understand this part of the code. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]

2024-05-01 Thread via GitHub



ganeshkumar269 opened a new issue, #367:
URL: https://github.com/apache/datafusion-comet/issues/367

   ### Describe the bug
   
   when ever I execute the test class CometShuffleEncryptionSuite, the tests 
fails/crashes with no meaningful error message, though I get this warning 
message
   
   WARNING: 
/Library/Java/JavaVirtualMachines/openjdk-11.jdk/Contents/Home/bin/java is 
loading libcrypto in an unsafe way
   
   In Intellij IDEA, I enabled "Log - Stack trace" for all the exceptions, I 
received the below error message,
   
   ```
   Exception 'sun.nio.fs.UnixException' occurred in thread 'Executor task 
launch worker for task 0.0 in stage 1.0 (TID 1)' at 
sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:335)
at sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:335)
at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72)
at 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232)
at 
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110)
at java.nio.file.Files.deleteIfExists(Files.java:1181)
at java.nio.file.Files.copy(Files.java:3055)
at 
org.apache.commons.crypto.NativeCodeLoader.extractLibraryFile(NativeCodeLoader.java:149)
at 
org.apache.commons.crypto.NativeCodeLoader.findNativeLibrary(NativeCodeLoader.java:237)
at 
org.apache.commons.crypto.NativeCodeLoader.loadLibrary(NativeCodeLoader.java:279)
at 
org.apache.commons.crypto.NativeCodeLoader.(NativeCodeLoader.java:52)
at org.apache.commons.crypto.Crypto.isNativeCodeLoaded(Crypto.java:140)
at 
org.apache.commons.crypto.random.OpenSslCryptoRandom.(OpenSslCryptoRandom.java:54)
at java.lang.Class.forName0(Class.java:-1)
at java.lang.Class.forName(Class.java:398)
at 
org.apache.commons.crypto.utils.ReflectionUtils.getClassByNameOrNull(ReflectionUtils.java:134)
at 
org.apache.commons.crypto.utils.ReflectionUtils.getClassByName(ReflectionUtils.java:101)
at 
org.apache.commons.crypto.random.CryptoRandomFactory.getCryptoRandom(CryptoRandomFactory.java:197)
at 
org.apache.spark.security.CryptoStreamUtils$.createInitializationVector(CryptoStreamUtils.scala:138)
at 
org.apache.spark.security.CryptoStreamUtils$.createCryptoOutputStream(CryptoStreamUtils.scala:56)
at 
org.apache.spark.serializer.SerializerManager.$anonfun$wrapForEncryption$3(SerializerManager.scala:151)
at 
org.apache.spark.serializer.SerializerManager$$Lambda$2902.217692044.apply(Unknown
 Source:-1)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.serializer.SerializerManager.wrapForEncryption(SerializerManager.scala:151)
at 
org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:134)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:163)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:306)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at 
org.apache.spark.executor.Executor$TaskRunner$$Lambda$1605.2076642299.apply(Unknown
 Source:-1)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:829)
   Exception 'java.lang.ClassNotFoundException' occurred in thread 
'driver-heartbeater' at 
sun.reflect.misc.MethodUtil.findClass(MethodUtil.java:328)
at sun.reflect.misc.MethodUtil.findClass(MethodUtil.java:328)
at sun.reflect.misc.MethodUtil.loadClass(MethodUtil.java:309)
at java.lang.ClassLoader.loadClass(ClassLoader.java:527)
at jdk.internal.misc.Unsafe.defineClass0(Unsafe.java:-1)
at jdk.internal.misc.Unsafe.defineClass(Unsafe.java:1192)
at jdk.internal.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
at 
jdk.internal.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:400)
at 
jdk.internal.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
at

Re: [PR] chore: add a utils method to getColumnReader with SQLConf [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on PR #360:
URL: https://github.com/apache/datafusion-comet/pull/360#issuecomment-203549

   Hmm, I think @parthchandra meant that by adding SQLConf, `common` will 
depend on Spark. Does it conflict with the direction to make `common` not Spark 
dependent?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] explain plan disables some cases in ScanExecRule [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on issue #322:
URL: 
https://github.com/apache/datafusion-comet/issues/322#issuecomment-200855

   Closed it now. Thanks @jc4x4 for reminding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] explain plan disables some cases in ScanExecRule [datafusion-comet]

2024-05-01 Thread via GitHub



viirya closed issue #322: explain plan disables some cases in ScanExecRule
URL: https://github.com/apache/datafusion-comet/issues/322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Jdk 11 in readme [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on code in PR #366:
URL: https://github.com/apache/datafusion-comet/pull/366#discussion_r1586603568


##
README.md:
##
@@ -63,7 +63,7 @@ Linux, Apple OSX (Intel and M1)
 ## Requirements
 
 - Apache Spark 3.2, 3.3, or 3.4
-- JDK 8 and up
+- JDK 11 and up (JDK 8 should be supported, but development is on JDK 11)

Review Comment:
   I think for sure option is to use supported JDKs with the Spark version you 
are working on. It is good to have it clean here to reduce confusion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Jdk 11 in readme [datafusion-comet]

2024-05-01 Thread via GitHub



viirya commented on PR #366:
URL: https://github.com/apache/datafusion-comet/pull/366#issuecomment-2088874522

   Thanks @edmondop 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Implement Spark-compatible CAST between integer types [datafusion-comet]

2024-05-01 Thread via GitHub



ganeshkumar269 commented on PR #340:
URL: https://github.com/apache/datafusion-comet/pull/340#issuecomment-2088848565

   Hi @andygrove , i have added a check before we fetch sparkInvalidValue, 
defaulting it to EMPTY_STRING if ':' is not present.
   Also added additional comments on why we are checking for the presence of 
'overflow' string. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

1 2 >

1 - 100 of 175 matches

Mail list logo