Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -293,6 +314,32 @@ fn maybe_data_types( Some(new_type) } +/// Check if the current argument types can be coerced to match the given `valid_types` +/// unlike `maybe_data_types`, this function does not coerce the types. Review Comment: I think the role of `can_cast_types` is more like validator to ensure the valid_types given are castable for arrow-cast. Maybe we don't even need `maybe_data_types_without_coercion` or `maybe_data_types` like function at the end 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -293,6 +314,32 @@ fn maybe_data_types( Some(new_type) } +/// Check if the current argument types can be coerced to match the given `valid_types` +/// unlike `maybe_data_types`, this function does not coerce the types. Review Comment: I think the role of `can_cast_types` is more like validator to ensure the valid_types given are castable for arrow-cast. Maybe we don't even need `maybe_data_types_without_coercion` at the end 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -293,6 +314,32 @@ fn maybe_data_types( Some(new_type) } +/// Check if the current argument types can be coerced to match the given `valid_types` +/// unlike `maybe_data_types`, this function does not coerce the types. Review Comment: I think the role of `can_cast_types` is more like validator to ensure the valid_types given are castable for arrow-cast. Maybe we don't even need `maybe_data_types_without_coercion` at the end 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1587073299 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -53,20 +54,31 @@ pub fn data_types( } } -let valid_types = get_valid_types(_signature, current_types)?; - +let mut valid_types = get_valid_types(_signature, current_types)?; if valid_types .iter() .any(|data_type| data_type == current_types) { return Ok(current_types.to_vec()); } -// Try and coerce the argument types to match the signature, returning the -// coerced types from the first matching signature. -for valid_types in valid_types { -if let Some(types) = maybe_data_types(_types, current_types) { -return Ok(types); +// Well-supported signature that returns exact valid types. +if !valid_types.is_empty() +&& matches!(signature.type_signature, TypeSignature::VariadicEqualOrNull) +{ +// exact valid types +assert_eq!(valid_types.len(), 1); +let valid_types = valid_types.swap_remove(0); Review Comment: I assume there is only one valid types vec, so we always get one. swap_remove it just a fancy way to get the first element in valid_types: Vec>. ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -53,20 +54,31 @@ pub fn data_types( } } -let valid_types = get_valid_types(_signature, current_types)?; - +let mut valid_types = get_valid_types(_signature, current_types)?; if valid_types .iter() .any(|data_type| data_type == current_types) { return Ok(current_types.to_vec()); } -// Try and coerce the argument types to match the signature, returning the -// coerced types from the first matching signature. -for valid_types in valid_types { -if let Some(types) = maybe_data_types(_types, current_types) { -return Ok(types); +// Well-supported signature that returns exact valid types. +if !valid_types.is_empty() +&& matches!(signature.type_signature, TypeSignature::VariadicEqualOrNull) +{ +// exact valid types +assert_eq!(valid_types.len(), 1); +let valid_types = valid_types.swap_remove(0); Review Comment: I assume there is only one valid types vec, so we always get one. swap_remove it just a fancy way to get the first element in `valid_types: Vec>`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1587073726 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -53,20 +54,31 @@ pub fn data_types( } } -let valid_types = get_valid_types(_signature, current_types)?; - +let mut valid_types = get_valid_types(_signature, current_types)?; if valid_types .iter() .any(|data_type| data_type == current_types) { return Ok(current_types.to_vec()); } -// Try and coerce the argument types to match the signature, returning the -// coerced types from the first matching signature. -for valid_types in valid_types { -if let Some(types) = maybe_data_types(_types, current_types) { -return Ok(types); +// Well-supported signature that returns exact valid types. +if !valid_types.is_empty() +&& matches!(signature.type_signature, TypeSignature::VariadicEqualOrNull) +{ +// exact valid types +assert_eq!(valid_types.len(), 1); +let valid_types = valid_types.swap_remove(0); Review Comment: equivalent to `valid_types[0].clone` but without clone -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Switch to use Rust stable by default [datafusion-comet]
sunchao opened a new pull request, #373: URL: https://github.com/apache/datafusion-comet/pull/373 ## Which issue does this PR close? Closes #. ## Rationale for this change There is little reason for Comet to use nightly at the moment. The only feature we needed from nightly is `specialization` which is used in a few places that are no longer required. This PR removes a few stale structs: `vector` and `mutable_vector`, which were added during an experiment of implementing Comet's own vector to be used by execution. However, as we've fully moved to use DataFusion, these are no longer needed. With these structs gone, we can now use `stable` as the default Rust toolchain for the project. ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Making Comet Common Module Engine Independent [datafusion-comet]
sunchao commented on issue #329: URL: https://github.com/apache/datafusion-comet/issues/329#issuecomment-2089524991 The original purpose of `comet-common` module is to make it engine-agnostic so it can be used for other use cases like Iceberg. Unfortunately we didn't have time to make it completely isolated, so it is still tightly coupled with Spark in several ways like Parquet -> catalyst schema conversion, ColumnVector, and later on a bunch of shuffle related stuff which are all closely related to Spark. If necessary, we can perhaps consider splitting the module further into `comet-parquet`, `comet-spark-shuffle` etc. For the Parquet part, we may need to define something like `CometDataType` which gets converted from the Parquet schema, and from which we can derive Spark catalyst data type or Iceberg data type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat(7181): cascading loser tree merges [datafusion]
wiedld commented on PR #7379: URL: https://github.com/apache/datafusion/pull/7379#issuecomment-2089478253 Working on other things. If/when we circle back, we'll be recreating differently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat(7181): cascading loser tree merges [datafusion]
wiedld closed pull request #7379: feat(7181): cascading loser tree merges URL: https://github.com/apache/datafusion/pull/7379 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove ScalarFunctionDefinition [datafusion]
jayzhan211 commented on code in PR #10325: URL: https://github.com/apache/datafusion/pull/10325#discussion_r1587001540 ## datafusion/expr/src/expr_schema.rs: ## @@ -138,25 +138,21 @@ impl ExprSchemable for Expr { .iter() .map(|e| e.get_type(schema)) .collect::>>()?; -match func_def { -ScalarFunctionDefinition::UDF(fun) => { // verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` -data_types(_data_types, fun.signature()).map_err(|_| { +data_types(_data_types, func_def.signature()).map_err(|_| { Review Comment: I think it is fine to name it `func` ## datafusion/expr/src/expr_schema.rs: ## @@ -138,25 +138,21 @@ impl ExprSchemable for Expr { .iter() .map(|e| e.get_type(schema)) .collect::>>()?; -match func_def { -ScalarFunctionDefinition::UDF(fun) => { // verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` -data_types(_data_types, fun.signature()).map_err(|_| { +data_types(_data_types, func_def.signature()).map_err(|_| { Review Comment: I think it is fine to name it `func` or `fun` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: add a few more dictionary unwrap tests [datafusion]
jayzhan211 merged PR #10335: URL: https://github.com/apache/datafusion/pull/10335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: reuse the group key row buffer to avoid reallocation [datafusion]
github-actions[bot] commented on PR #7426: URL: https://github.com/apache/datafusion/pull/7426#issuecomment-2089392509 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat(7181): cascading loser tree merges [datafusion]
github-actions[bot] commented on PR #7379: URL: https://github.com/apache/datafusion/pull/7379#issuecomment-2089392557 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use make_array to handle SQLExpr::Array. [datafusion]
github-actions[bot] commented on PR #7427: URL: https://github.com/apache/datafusion/pull/7427#issuecomment-2089392469 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Prepare Spark 4.0 shims [datafusion-comet]
kazuyukitanimura commented on issue #372: URL: https://github.com/apache/datafusion-comet/issues/372#issuecomment-2089389739 I am on it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `SessionContext`/`SessionState::create_physical_expr()` to create `PhysicalExpressions` from `Expr`s [datafusion]
phillipleblanc commented on code in PR #10330: URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586971614 ## datafusion/common/src/dfschema.rs: ## @@ -806,6 +820,12 @@ impl From<> for Schema { } } +impl AsRef for DFSchema { Review Comment: Very nice! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Introducing repartitioning outside [datafusion]
edmondop opened a new pull request, #10338: URL: https://github.com/apache/datafusion/pull/10338 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586960617 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. +// TODO: Temporary Signature, to differentiate existing VariadicEqual. +// After we swtich `make_array` to VariadicEqualOrNull, +// we can reuse VariadicEqual. +VariadicEqualOrNull, Review Comment: Nice idea! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. Review Comment: yes. `Non-comparison coercion` is `type_union_resolution` Actually, I think `type_union_resolution` is what we really need for coercion. Current `comparison_coercion` may not exist in the long term because I think there are many `workaround` solutions included in it, and those are possible to pull out to their **correct places**. For example, List coercion in `string_coercion` or Dict coercion could be handled in `unwrap cast in comparison` Ideally, we could shape `comparison_coercion` to more like `type_union_resolution`. But, adjusting logic inside it is error-prone if the test is not covered. At least I can tell the dict-coercion is different between these two now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. Review Comment: yes. `Non-comparison coercion` is `type_union_resolution` Actually, I think `type_union_resolution` is what we really need for coercion. Current `comparison_coercion` may not exist in the long term because I think there are many `workaround` solutions included in it, and those are possible to pull out to their **correct places**. For example, List coercion in `string_coercion` or Dict coercion that compare value only could be handled in `unwrap cast in comparison` Ideally, we could shape `comparison_coercion` to more like `type_union_resolution`. But, adjusting logic inside it is error-prone if the test is not covered. At least I can tell the dict-coercion is different between these two now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. Review Comment: yes. `Non-comparison coercion` is `type_union_resolution` Actually, I think `type_union_resolution` is what we really need for coercion. Current `comparison_coercion` may not exist in the long term because I think there are many `workaround` solutions included in it, and those are possible to pull out to their **correct places**. Ideally, we could shape `comparison_coercion` to more like `type_union_resolution`. But, adjusting logic inside it is error-prone if the test is not covered. At least I can tell the dict-coercion is different between these two now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. Review Comment: yes. `Non-comparison coercion` is `type_union_resolution` Actually, I think `type_union_resolution` is what we really need for coercion. Current `comparison_coercion` may not exist in the long term because I think there are many `workaround` solutions included in it, and those are possible to pull out to their **correct places**. Ideally, we should shape `comparison_coercion` to more like `type_union_resolution`. But, adjusting logic inside it is error-prone if the test is not covered. At least I can tell the dict-coercion is different between these two now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586956681 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -293,6 +314,32 @@ fn maybe_data_types( Some(new_type) } +/// Check if the current argument types can be coerced to match the given `valid_types` +/// unlike `maybe_data_types`, this function does not coerce the types. Review Comment: I think the role of `can_cast_types` is more like validator to ensure the valid_types given are castable for arrow-cast. Maybe we don't even need this 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586955192 ## datafusion/expr/src/signature.rs: ## @@ -92,14 +92,22 @@ pub enum TypeSignature { /// A function such as `concat` is `Variadic(vec![DataType::Utf8, DataType::LargeUtf8])` Variadic(Vec), /// One or more arguments of an arbitrary but equal type. -/// DataFusion attempts to coerce all argument types to match the first argument's type +/// DataFusion attempts to coerce all argument types to match to the common type with comparision coercion. /// /// # Examples /// Given types in signature should be coercible to the same final type. /// A function such as `make_array` is `VariadicEqual`. /// /// `make_array(i32, i64) -> make_array(i64, i64)` VariadicEqual, +/// One or more arguments of an arbitrary but equal type or Null. +/// Non-comparison coercion is attempted to match the signatures. +/// +/// Functions like `coalesce` is `VariadicEqual`. Review Comment: yes. `Non-comparison coercion` is `type_union_resolution` Actually, I think `type_union_resolution` is what we really need for coercion. Current `comparison_coercion` may not exist in the long term because I think there are many `workaround` solutions included in it, and those are possible to pull out to their **correct places**. Ideally, we should shape `comparison_coercion` to more like `type_union_resolution`. But, adjusting logic inside it is vulnerable if the test is not covered. At least I can tell the dict-coercion is different between these two now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Add additional coalesce tests [datafusion]
jayzhan211 merged PR #10334: URL: https://github.com/apache/datafusion/pull/10334 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Move `create_physical_expr` to `physical-expr-common` [datafusion]
jayzhan211 commented on issue #10074: URL: https://github.com/apache/datafusion/issues/10074#issuecomment-2089325105 @alamb First of all, we plan to pull the aggregate function out from `datafusion-physical-expr`. And, we also split `datafusion-physical-common` from `datafusion-physical`, **so the user can only import the minimum things they need**, but this is not enough. We want to handle `physical-expr` in UDFImpl, which is not possible given the current crate graph, because `UDFImpl` should be in `datafusion-expr` and we can't import `datafusion-physical-expr` to `datafusion-expr`. But, what if we allow `datafusion-expr` to import `physical-expr-common` 樂 We now differentiate **common** as the higher level crate. `datafusion-common` is the highest. Introduce `expr-common` as the second, `datafusion-physical-common` as the third (import common and expr-common), and `aggregate-common` as the fifth, that import all other **common**. Allow `datafusion-expr` to import `datafusion-physical-common`, so we can handle `physical-expr` concept in UDFImpl. As long as we don't introduce things that depend on `datafusion-expr` into `physical-expr-common`, I think we don't need to strictly avoid any physical-expr concept things ported into `datafusion-expr`. > Can you remind me why datafusion-functions-aggregate (the implementation of aggregate functions) shouldn't depend on datafusion-physical-expr and datafusion-expr `datafusion-functions-aggregate` can import `datafusion-expr` and `datafusion-physical-expr-common` but not `datafusion-physical-expr` to keep the minimum dependencies for the user. > I thought the thing we are trying to do is avoid datafusion-expr depending on datafusion-physical-expr This idea is what blocking us! Now, I would allow `datafusion-physical-expr-common` dependency for `datafusion-expr` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix: Sort Merge Join crashes on TPCH Q21 [datafusion]
comphead commented on PR #10304: URL: https://github.com/apache/datafusion/pull/10304#issuecomment-2089299005 Q21 contains both LeftSemi and LeftAnti joins. Hopefully LeftSemi works now, solving LeftAnti -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]
jayzhan211 commented on code in PR #10321: URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586925126 ## datafusion/functions/src/core/mod.rs: ## @@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, get_field); make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce); // Export the functions out of this package, both as expr_fn as well as a list of functions -export_functions!( -(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the COALESCE expression."), -(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given the second argument. This can be used to cast to a specific `arrow_type`."), -(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns value1"), -(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; otherwise, it returns value3."), -(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."), -(r#struct, args, "Returns a struct with the given arguments"), -(named_struct, args, "Returns a struct with the given names and arguments pairs"), -(get_field, arg_1 arg_2, "Returns the value of the field with the given name from the struct"), -(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the value of the first expr which is not NULL") -); +pub mod expr_fn { Review Comment: https://github.com/apache/datafusion/blob/e3487eee1365a37b8651bf5a393ae8d2cc16e653/datafusion/functions-array/src/macros.rs#L47-L109 In array macro, we support two syntaxes, which is surprisingly easy -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]
Omega359 commented on code in PR #10321: URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586921438 ## datafusion/functions/src/core/mod.rs: ## @@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, get_field); make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce); // Export the functions out of this package, both as expr_fn as well as a list of functions -export_functions!( -(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the COALESCE expression."), -(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given the second argument. This can be used to cast to a specific `arrow_type`."), -(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns value1"), -(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; otherwise, it returns value3."), -(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."), -(r#struct, args, "Returns a struct with the given arguments"), -(named_struct, args, "Returns a struct with the given names and arguments pairs"), -(get_field, arg_1 arg_2, "Returns the value of the field with the given name from the struct"), -(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the value of the first expr which is not NULL") -); +pub mod expr_fn { Review Comment: I spent a whole day trying to make it work as well and got lost in the depths of macro programming. I'm not sure it's worth the effort of trying to get it to work tbh - the alternative is not exactly hard work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Avoid inlining non deterministic CTE [datafusion]
tgujar opened a new issue, #10337: URL: https://github.com/apache/datafusion/issues/10337 ### Describe the bug Currently Datafusion will inline all CTE, a non-deterministic expression can be executed multiple times producing different results ### To Reproduce Consider the following query which uses the `aggregate_test_100` data from datafusion-examples. Here, column c11 is a Float64 ``` WITH cte as ( SELECT sum(c4 * c11) as total FROM aggregate_test_100 GROUP BY c1) SELECT total FROM cte WHERE total = (select max(total) from cte) ``` The optimized plan generated will inline the CTE and thus execute it twice ``` Projection: cte.total Inner Join: cte.total = __scalar_sq_1.MAX(cte.total) SubqueryAlias: cte Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS total Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]] TableScan: aggregate_test_100 projection=[c1, c4, c11] SubqueryAlias: __scalar_sq_1 Aggregate: groupBy=[[]], aggr=[[MAX(cte.total)]] SubqueryAlias: cte Projection: SUM(aggregate_test_100.c4 * aggregate_test_100.c11) AS total Aggregate: groupBy=[[aggregate_test_100.c1]], aggr=[[SUM(CAST(aggregate_test_100.c4 AS Float64) * aggregate_test_100.c11)]] TableScan: aggregate_test_100 projection=[c1, c4, c11] ``` ### Expected behavior Since summation here is dependent on ordering, I believe it is incorrect to inline the CTE here and execute it more than once. ### Additional context Related issue, which talks about possible advantages on not inlining CTE in some cases: https://github.com/apache/datafusion/issues/8777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST between integer types [datafusion-comet]
andygrove commented on code in PR #340: URL: https://github.com/apache/datafusion-comet/pull/340#discussion_r1586886509 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -807,11 +833,22 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { } else { // Spark 3.2 and 3.3 have a different error message format so we can't do a direct // comparison between Spark and Comet. + // In the case of CAST_INVALID_INPUT // Spark message is in format `invalid input syntax for type TYPE: VALUE` // Comet message is in format `The value 'VALUE' of the type FROM_TYPE cannot be cast to TO_TYPE` // We just check that the comet message contains the same invalid value as the Spark message - val sparkInvalidValue = sparkMessage.substring(sparkMessage.indexOf(':') + 2) - assert(cometMessage.contains(sparkInvalidValue)) + // In the case of CAST_OVERFLOW + // Spark message is in format `Casting VALUE to TO_TYPE causes overflow` + // Comet message is in format `The value 'VALUE' of the type FROM_TYPE cannot be cast to TO_TYPE + // due to an overflow` + // We check if the comet message contains 'overflow'. + val sparkInvalidValue = if (sparkMessage.indexOf(':') == -1) { +EMPTY_STRING + } else { +sparkMessage.substring(sparkMessage.indexOf(':') + 2) + } + assert( +cometMessage.contains(sparkInvalidValue) || cometMessage.contains("overflow")) Review Comment: If `sparkInvalidValue` is `EMPTY_STRING`, won't `cometMessage.contains(sparkInvalidValue)` always be true? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in EliminateNestedUnion [datafusion]
emgeee commented on PR #10319: URL: https://github.com/apache/datafusion/pull/10319#issuecomment-2089232074 I went ahead and fixed the formatting so all tests pass and managed to remove 1 clone() call from with in the `coerce_plan_expr_for_schema()` call stack. If we want to optimize `coerce_plan_expr_for_schema` further, I'd definitely agree a separate PR makes sense as it is called from a number of other locations as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]
andygrove commented on PR #335: URL: https://github.com/apache/datafusion-comet/pull/335#issuecomment-2089227013 This is looking great @vaibhawvipul. I think this is close to being ready to merge and then have some follow on issues for remaining items. I think the one thing I would like to see before merging is that we can run the fuzz test without causing any panics, so just replacing unwraps with error handling. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `SessionContext`/`SessionState::create_physical_expr()` to create `PhysicalExpressions` from `Expr`s [datafusion]
westonpace commented on code in PR #10330: URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586867994 ## datafusion/core/src/execution/context/mod.rs: ## @@ -510,6 +515,34 @@ impl SessionContext { } } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type +/// coercion, and function rewrites. Review Comment: ```suggestion /// coercion and function rewrites. ``` ## datafusion/core/src/execution/context/mod.rs: ## @@ -510,6 +515,34 @@ impl SessionContext { } } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type +/// coercion, and function rewrites. +/// +/// # Example +/// ``` +/// # use std::sync::Arc; +/// # use arrow::datatypes::{DataType, Field, Schema}; +/// # use datafusion::prelude::*; +/// # use datafusion_common::DFSchema; +/// // a = 1 (i64) +/// let expr = col("a").eq(lit(1i64)); +/// // provide type information that `a` is an Int32 +/// let schema = Schema::new(vec![Field::new("a", DataType::Int32, true)]); +/// let df_schema = DFSchema::try_from(schema).unwrap(); +/// // Create a PhysicalExpr. Note DataFusion automatically coerces (casts) `1i64` to `1i32` Review Comment: Nice, we have some (probably duplicated) type coercion logic on our side we might be able to replace. ## datafusion/core/src/execution/context/mod.rs: ## @@ -1947,6 +1982,36 @@ impl SessionState { .await } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type +/// coercion, and function rewrites. +/// +/// Note: The expression is not [simplified] Review Comment: Why not? ## datafusion/core/src/execution/context/mod.rs: ## @@ -1947,6 +1982,36 @@ impl SessionState { .await } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type +/// coercion, and function rewrites. +/// +/// Note: The expression is not [simplified] +/// +/// # See Also: +/// * [`SessionContext::create_physical_expr`] for a higher-level API Review Comment: This is somewaht unrelated to this PR but It's news to me that "SessionContext" is higher level than "SessionState". The only comparison between the two that I can find is https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#sessioncontext-sessionstate-and-taskcontext which says: > A [SessionContext](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html) can be created from a [SessionConfig](https://docs.rs/datafusion/latest/datafusion/prelude/struct.SessionConfig.html) and stores the state for a particular query session. A single [SessionContext](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html) can run multiple queries. > [SessionState](https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionState.html) contains information available during query planning (creating [LogicalPlan](https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html)s and [ExecutionPlan](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html)s). From this it is not obvious that a "SessionContext" contains a "SessionState". It's also very confusing when I would use "SessionContext" and when I would use "SessionState" in my user application. Currently, I've been going off the philosophy of "if an API method needs a SessionState I'd better create one and if an API method needs a SessionContext I'd better create one." Some kind of long form design documentation on these two types would be interesting (or if something already exists that I've missed or is in a blog post somewhere that would be cool too). ## datafusion/core/src/execution/context/mod.rs: ## @@ -2024,6 +2089,35 @@ impl SessionState { } } +struct SessionSimpifyProvider<'a> { +state: &'a SessionState, +df_schema: &'a DFSchema, +} + +impl<'a> SessionSimpifyProvider<'a> { Review Comment: ```suggestion impl<'a> SessionSimplifyProvider<'a> { ``` ## datafusion/common/src/dfschema.rs: ## @@ -806,6 +820,12 @@ impl From<> for Schema { } } +impl AsRef for DFSchema { Review Comment: :exploding_head: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Determine ordering of file groups [datafusion]
suremarc commented on code in PR #9593: URL: https://github.com/apache/datafusion/pull/9593#discussion_r1586860589 ## datafusion/core/src/datasource/physical_plan/mod.rs: ## @@ -473,15 +468,281 @@ fn get_projected_output_ordering( // since rest of the orderings are violated break; } + // do not push empty entries // otherwise we may have `Some(vec![])` at the output ordering. -if !new_ordering.is_empty() { -all_orderings.push(new_ordering); +if new_ordering.is_empty() { +continue; +} + +// Check if any file groups are not sorted +if base_config.file_groups.iter().any(|group| { +if group.len() <= 1 { +// File groups with <= 1 files are always sorted +return false; +} + +let statistics = match MinMaxStatistics::new_from_files( +_ordering, +projected_schema, +base_config.projection.as_deref(), +group, +) { +Ok(statistics) => statistics, +Err(e) => { +log::trace!("Error fetching statistics for file group: {e}"); +// we can't prove that it's ordered, so we have to reject it +return true; +} +}; + +!statistics.is_sorted() +}) { +debug!( +"Skipping specified output ordering {:?}. \ +Some file groups couldn't be determined to be sorted: {:?}", +base_config.output_ordering[0], base_config.file_groups +); +continue; } + +all_orderings.push(new_ordering); } all_orderings } +/// A normalized representation of file min/max statistics that allows for efficient sorting & comparison. +/// The min/max values are ordered by [`Self::sort_order`]. +/// Furthermore, any columns that are reversed in the sort order have their min/max values swapped. +pub(crate) struct MinMaxStatistics { +min_by_sort_order: arrow::row::Rows, +max_by_sort_order: arrow::row::Rows, +sort_order: Vec, +} + +impl MinMaxStatistics { +#[allow(unused)] +fn sort_order() -> &[PhysicalSortExpr] { +_order +} + +fn new_from_files<'a>( +projected_sort_order: &[PhysicalSortExpr], // Sort order with respect to projected schema +projected_schema: , // Projected schema +projection: Option<&[usize]>, // Indices of projection in full table schema (None = all columns) Review Comment: Oh... I didn't even think about option 1. But I was assuming that the layout of the file statistics should match the table schema and not the individual file's schema. It seems that that's what DataFusion does currently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery` [datafusion]
alamb merged PR #10318: URL: https://github.com/apache/datafusion/pull/10318 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery ` [datafusion]
alamb closed issue #10289: Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery ` URL: https://github.com/apache/datafusion/issues/10289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Stop copying LogicalPlan and Exprs in `DecorrelatePredicateSubquery` [datafusion]
alamb commented on PR #10318: URL: https://github.com/apache/datafusion/pull/10318#issuecomment-2089182854 Thanks for the review @waynexia -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `coalesce`, `struct` and `named_strct` expr_fn function to take multiple arguments [datafusion]
alamb commented on code in PR #10321: URL: https://github.com/apache/datafusion/pull/10321#discussion_r1586856895 ## datafusion/functions/src/core/mod.rs: ## @@ -39,14 +42,68 @@ make_udf_function!(getfield::GetFieldFunc, GET_FIELD, get_field); make_udf_function!(coalesce::CoalesceFunc, COALESCE, coalesce); // Export the functions out of this package, both as expr_fn as well as a list of functions -export_functions!( -(nullif, arg_1 arg_2, "returns NULL if value1 equals value2; otherwise it returns value1. This can be used to perform the inverse operation of the COALESCE expression."), -(arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given the second argument. This can be used to cast to a specific `arrow_type`."), -(nvl, arg_1 arg_2, "returns value2 if value1 is NULL; otherwise it returns value1"), -(nvl2, arg_1 arg_2 arg_3, "Returns value2 if value1 is not NULL; otherwise, it returns value3."), -(arrow_typeof, arg_1, "Returns the Arrow type of the input expression."), -(r#struct, args, "Returns a struct with the given arguments"), -(named_struct, args, "Returns a struct with the given names and arguments pairs"), -(get_field, arg_1 arg_2, "Returns the value of the field with the given name from the struct"), -(coalesce, args, "Returns `coalesce(args...)`, which evaluates to the value of the first expr which is not NULL") -); +pub mod expr_fn { Review Comment: What I could't figure out how to do was make the macro take both syntaxes It would have to look something like ```rust export_functions!( // create a function with arg_1 and arg_2 Expr arguments (arrow_cast, arg_1 arg_2, "returns arg_1 cast to the `arrow_type` given the second argument. This can be used to cast to a specific `arrow_type`."), /// create a function with a single Vec arg (coalesce, args, "Returns `coalesce(args...)`, which evaluates to the value of the first expr which is not NULL") ... } ``` the macro definition is https://github.com/apache/datafusion/blob/7c1c7941a078153173a6c82bf2c8742bbcbdefa8/datafusion/functions/src/macros.rs#L39-L60 there may be some way to do this in rust, but I could't figure it out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Move `create_physical_expr` to `physical-expr-common` [datafusion]
alamb commented on issue #10074: URL: https://github.com/apache/datafusion/issues/10074#issuecomment-2089174413 > The overall idea is to enable us to import common things to Expr::AggregateFunction which lives in datafusion-expr. In the chart you have in https://github.com/apache/datafusion/issues/10074#issuecomment-2088004947 it seems like it will mean that `datafusion-expr` will depend (indirectly) on `datafusion-physical-expr` which I thought was the dependency we are trying to avoid I am sorry I am so slow to respond and getting lost here. Can we take a step back and help figure out what problem we are trying to solve Can you remind me why `datafusion-functions-aggregate` (the implementation of aggregate functions) shouldn't depend on `datafusion-physical-expr` and `datafusion-expr`? I thought the thing we are trying to do is avoid `datafusion-expr` depending on `datafusion-physical-expr` 樂 What I believe we are trying to do is to pull the aggregate functions out of datafusion-physical-expr I am sorry if you have explained this to me already. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
alamb commented on code in PR #10323: URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586823032 ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: > Actually I was mistaken, these errors weren't being ignored. I could add them back in in a new PR? I think that would be really nice if you had the time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic] A Collection of Sort Based Optimizations [datafusion]
alamb commented on issue #10313: URL: https://github.com/apache/datafusion/issues/10313#issuecomment-2089129451 > Would this ticket be an appropriate place to add tickets related to pushing down sorts to federated query engines? I know that this was discussed previously (i.e. #7871) and it seems that writing a custom optimizer is the current way to handle that. I added #7871 to the list above -- thank you. Yes I think this would be a good place to discuss > I will need to do this soon (federated sort pushdown) and it initially wasn't clear to me how to make this work in DataFusion. I can volunteer to write some docs on how to do this once I have an implementation that works. That would be great, thanks @phillipleblanc Right now, once `TableProvider::execute` gets called, the returned `ExecutionPlan` can report how it is already sorted. What we don't have is any way to have the optimizer tell a `ExecutionPlan` that it could reduce the work required in the DataFusion plan if the data was already sorted. I wonder if we could add something to `ExecutionPlan` trait similar to [`ExecutionPlan::repartitioned`](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.repartitioned) like ```rust trait ExecutionPlan { ... /// return other possible orders that this ExecutionPlan could return /// (the DataFusion optimizer will use this information to potentially push Sorts /// into the Node fn pushable_sorts() -> Result>> { return Ok(None) } /// return a node like this one except that it its output is sorted according to exprs fn resorted() -> Result>> { return Ok(None) } ``` And then add a new optimizer pass that tries to push sorts into the plan nodes that report they can provide sorted data 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: update docs CI to install python-311 requirements [datafusion-python]
andygrove merged PR #661: URL: https://github.com/apache/datafusion-python/pull/661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Determine ordering of file groups [datafusion]
alamb commented on PR #9593: URL: https://github.com/apache/datafusion/pull/9593#issuecomment-2089122508 Filed https://github.com/apache/datafusion/issues/10336 to track enable this flag by default -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2089121776 Example test coverage we should add I think: https://github.com/apache/datafusion/pull/9593#discussion_r1585517605 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic] A Collection of Sort Based Optimizations [datafusion]
alamb commented on issue #10313: URL: https://github.com/apache/datafusion/issues/10313#issuecomment-2089116574 Update here: we merged https://github.com/apache/datafusion/pull/9593 and now will work on increasing the test coverage to enable it by default (tracked in https://github.com/apache/datafusion/issues/10336) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Enable `split_file_groups_by_statistics` by default [datafusion]
alamb opened a new issue, #10336: URL: https://github.com/apache/datafusion/issues/10336 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/10313 In https://github.com/apache/datafusion/pull/9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap This feature is behind a feature flag, `split_file_groups_by_statistics` which defaults to `false` as I think there needs to be some more tests in place before we turn it on ### Describe the solution you'd like Add additional tests and then enable `split_file_groups_by_statistics` by default ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
erratic-pattern commented on code in PR #10323: URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586805024 ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: Actually I was mistaken, these errors weren't being ignored. I could add them back in in a new PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Determine ordering of file groups [datafusion]
alamb commented on code in PR #9593: URL: https://github.com/apache/datafusion/pull/9593#discussion_r1586804068 ## datafusion/core/src/datasource/physical_plan/mod.rs: ## @@ -473,15 +468,281 @@ fn get_projected_output_ordering( // since rest of the orderings are violated break; } + // do not push empty entries // otherwise we may have `Some(vec![])` at the output ordering. -if !new_ordering.is_empty() { -all_orderings.push(new_ordering); +if new_ordering.is_empty() { +continue; +} + +// Check if any file groups are not sorted +if base_config.file_groups.iter().any(|group| { +if group.len() <= 1 { +// File groups with <= 1 files are always sorted +return false; +} + +let statistics = match MinMaxStatistics::new_from_files( +_ordering, +projected_schema, +base_config.projection.as_deref(), +group, +) { +Ok(statistics) => statistics, +Err(e) => { +log::trace!("Error fetching statistics for file group: {e}"); +// we can't prove that it's ordered, so we have to reject it +return true; +} +}; + +!statistics.is_sorted() +}) { +debug!( +"Skipping specified output ordering {:?}. \ +Some file groups couldn't be determined to be sorted: {:?}", +base_config.output_ordering[0], base_config.file_groups +); +continue; } + +all_orderings.push(new_ordering); } all_orderings } +/// A normalized representation of file min/max statistics that allows for efficient sorting & comparison. +/// The min/max values are ordered by [`Self::sort_order`]. +/// Furthermore, any columns that are reversed in the sort order have their min/max values swapped. +pub(crate) struct MinMaxStatistics { +min_by_sort_order: arrow::row::Rows, +max_by_sort_order: arrow::row::Rows, +sort_order: Vec, +} + +impl MinMaxStatistics { +#[allow(unused)] +fn sort_order() -> &[PhysicalSortExpr] { +_order +} + +fn new_from_files<'a>( +projected_sort_order: &[PhysicalSortExpr], // Sort order with respect to projected schema +projected_schema: , // Projected schema +projection: Option<&[usize]>, // Indices of projection in full table schema (None = all columns) Review Comment: I guess I was thinking of subtle bugs related to when: 1. The schema of the files is different but compatible (e.g. one file as (`time`, `date`, `symbol`) but the other file had (`date`, `symbol`, `time`) for example 2. The query orders by a subset of the columns (e.g. `ORDER BY time`) 3. The query orders by a subset of the columns that is not the sort order (`ORDER BY date`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Use file statistics in query planning to avoid sorting when unecessary [datafusion]
alamb closed issue #7490: Use file statistics in query planning to avoid sorting when unecessary URL: https://github.com/apache/datafusion/issues/7490 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Determine ordering of file groups [datafusion]
alamb merged PR #9593: URL: https://github.com/apache/datafusion/pull/9593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
erratic-pattern commented on code in PR #10323: URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586797735 ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: see https://doc.rust-lang.org/std/macro.debug_assert.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Determine ordering of file groups [datafusion]
alamb commented on PR #9593: URL: https://github.com/apache/datafusion/pull/9593#issuecomment-2089091928 > @alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first. Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
erratic-pattern commented on code in PR #10323: URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586795632 ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: > I wonder if ignoring errors will make tracking down unsupported types harder (I am imagining when we try and add REEArray or StringViewArray). 樂 yeah we may want to reintroduce a Result type, but as it stands currently the errors were already being ignored anyway so they didn't serve any purpose. also the boolean is_supported_* means that we'll just silently skip those cases anyway. so realistically most of those errors just don't ever occur and if they did occur we wouldn't know about them. we would need to rethink the design of this code to surface those errors. another interesting possibility is debug assertions that panic, but only on 'debug' builds. I've always found this to be an interesting but underutilized tool for uncovering bugs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Stop copying LogicalPlan and Exprs in `CommonSubexprEliminate` [datafusion]
alamb closed issue #10211: Stop copying LogicalPlan and Exprs in `CommonSubexprEliminate` URL: https://github.com/apache/datafusion/issues/10211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Stop copying LogicalPlan and Exprs in `CommonSubexprEliminate` [datafusion]
alamb commented on issue #10211: URL: https://github.com/apache/datafusion/issues/10211#issuecomment-2089088807 > @alamb, why did you reopen this? I am not sure. Sorry about that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]
alamb closed pull request #10221: fix: preserve more dictionaries when coercing types URL: https://github.com/apache/datafusion/pull/10221 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: preserve more dictionaries when coercing types [datafusion]
alamb commented on PR #10221: URL: https://github.com/apache/datafusion/pull/10221#issuecomment-2089086256 I believe this is superceded by https://github.com/apache/datafusion/pull/10323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]
andygrove commented on code in PR #335: URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586766148 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -510,9 +558,246 @@ impl PhysicalExpr for Cast { } } +fn timestamp_parser(value: , eval_mode: EvalMode) -> CometResult> { +let value = value.trim(); +if value.is_empty() { +return Ok(None); +} + +// Define regex patterns and corresponding parsing functions +let patterns = &[ +( +Regex::new(r"^\d{4}$").unwrap(), +parse_str_to_year_timestamp as fn() -> CometResult>, +), +( +Regex::new(r"^\d{4}-\d{2}$").unwrap(), +parse_str_to_month_timestamp, +), +( +Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(), +parse_str_to_day_timestamp, +), +( +Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{1,2}$").unwrap(), +parse_str_to_hour_timestamp, +), +( +Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}$").unwrap(), +parse_str_to_minute_timestamp, +), +( +Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}$").unwrap(), +parse_str_to_second_timestamp, +), +( + Regex::new(r"^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{1,6}$").unwrap(), +parse_str_to_microsecond_timestamp, +), +( +Regex::new(r"^T\d{1,2}$").unwrap(), +parse_str_to_time_only_timestamp, +), +]; + +let mut timestamp = None; + +// Iterate through patterns and try matching +for (pattern, parse_func) in patterns { +if pattern.is_match(value) { +timestamp = parse_func(value)?; +break; +} +} + +if timestamp.is_none() { +if eval_mode == EvalMode::Ansi { +return Err(CometError::CastInvalidValue { +value: value.to_string(), +from_type: "STRING".to_string(), +to_type: "TIMESTAMP".to_string(), +}); +} else { +return Ok(None); +} +} +Ok(Some(timestamp.unwrap())) +} + +fn parse_ymd_timestamp(year: i32, month: u32, day: u32) -> CometResult> { +let datetime = chrono::Utc +.with_ymd_and_hms(year, month, day, 0, 0, 0) +.unwrap() +.with_timezone(::Utc); +Ok(Some(datetime.timestamp_micros())) +} + +fn parse_hms_timestamp( +year: i32, +month: u32, +day: u32, +hour: u32, +minute: u32, +second: u32, +microsecond: u32, +) -> CometResult> { +let datetime = chrono::Utc +.with_ymd_and_hms(year, month, day, hour, minute, second) +.unwrap() +.with_timezone(::Utc) +.with_nanosecond(microsecond * 1000); +Ok(Some(datetime.unwrap().timestamp_micros())) +} + +fn get_timestamp_values(value: , timestamp_type: ) -> CometResult> { +let values: Vec<_> = value +.split(|c| c == 'T' || c == '-' || c == ':' || c == '.') +.collect(); +let year = values[0].parse::().unwrap_or_default(); +let month = values.get(1).map_or(1, |m| m.parse::().unwrap_or(1)); +let day = values.get(2).map_or(1, |d| d.parse::().unwrap_or(1)); +let hour = values.get(3).map_or(0, |h| h.parse::().unwrap_or(0)); +let minute = values.get(4).map_or(0, |m| m.parse::().unwrap_or(0)); +let second = values.get(5).map_or(0, |s| s.parse::().unwrap_or(0)); +let microsecond = values.get(6).map_or(0, |ms| ms.parse::().unwrap_or(0)); + +match timestamp_type { +"year" => parse_ymd_timestamp(year, 1, 1), +"month" => parse_ymd_timestamp(year, month, 1), +"day" => parse_ymd_timestamp(year, month, day), +"hour" => parse_hms_timestamp(year, month, day, hour, 0, 0, 0), +"minute" => parse_hms_timestamp(year, month, day, hour, minute, 0, 0), +"second" => parse_hms_timestamp(year, month, day, hour, minute, second, 0), +"microsecond" => parse_hms_timestamp(year, month, day, hour, minute, second, microsecond), +_ => Err(CometError::CastInvalidValue { +value: value.to_string(), +from_type: "STRING".to_string(), +to_type: "TIMESTAMP".to_string(), +}), +} +} + +fn parse_str_to_year_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "year") +} + +fn parse_str_to_month_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "month") +} + +fn parse_str_to_day_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "day") +} + +fn parse_str_to_hour_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "hour") +} + +fn parse_str_to_minute_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "minute") +} + +fn parse_str_to_second_timestamp(value: ) -> CometResult> { +get_timestamp_values(value, "second") +} + +fn
Re: [PR] fix: limit with offset should return correct results [datafusion-comet]
viirya commented on code in PR #359: URL: https://github.com/apache/datafusion-comet/pull/359#discussion_r1586759865 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -340,7 +340,7 @@ class CometSparkSessionExtensions op } -case op: LocalLimitExec => +case op: LocalLimitExec if getOffset(op) == 0 => Review Comment: Oh, I missed it. I will create a follow up. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
alamb commented on PR #10323: URL: https://github.com/apache/datafusion/pull/10323#issuecomment-2089031318 I made a PR with a few more tests here: https://github.com/apache/datafusion/pull/10335 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: add a few more dictionary unwrap tests [datafusion]
alamb opened a new pull request, #10335: URL: https://github.com/apache/datafusion/pull/10335 ## Which issue does this PR close? Small follow on to https://github.com/apache/datafusion/pull/10323 ## Rationale for this change I noticed a few more cases I thought should be covered while reviewing https://github.com/apache/datafusion/pull/10323 ## What changes are included in this PR? Add a few more tests for unwrapping dictionaries (verify that the order of the column and literal does not matter) ## Are these changes tested? All tests ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
alamb commented on code in PR #10323: URL: https://github.com/apache/datafusion/pull/10323#discussion_r1586745915 ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: I wonder if ignoring errors will make tracking down unsupported types harder (I am imagining when we try and add REEArray or StringViewArray). ## datafusion/optimizer/src/unwrap_cast_in_comparison.rs: ## @@ -298,19 +306,47 @@ fn is_support_data_type(data_type: ) -> bool { ) } +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a string +fn is_supported_string_type(data_type: ) -> bool { +matches!(data_type, DataType::Utf8 | DataType::LargeUtf8) +} + +/// Returns true if [UnwrapCastExprRewriter] supports casting this value as a dictionary +fn is_supported_dictionary_type(data_type: ) -> bool { +matches!(data_type, +DataType::Dictionary(_, inner) if is_supported_type(inner)) +} + +/// Convert a literal value from one data type to another fn try_cast_literal_to_type( lit_value: , target_type: , -) -> Result> { +) -> Option { let lit_data_type = lit_value.data_type(); -// the rule just support the signed numeric data type now -if !is_support_data_type(_data_type) || !is_support_data_type(target_type) { -return Ok(None); +if !is_supported_type(_data_type) || !is_supported_type(target_type) { +return None; } if lit_value.is_null() { // null value can be cast to any type of null value -return Ok(Some(ScalarValue::try_from(target_type)?)); +return ScalarValue::try_from(target_type).ok(); Review Comment: I wonder if ignoring errors will make tracking down unsupported types harder (I am imagining when we try and add REEArray or StringViewArray). 樂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Slow comparisions to dictionary columns with type coercion [datafusion]
alamb closed issue #10220: Slow comparisions to dictionary columns with type coercion URL: https://github.com/apache/datafusion/issues/10220 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: unwrap casts of string and dictionary columns [datafusion]
alamb merged PR #10323: URL: https://github.com/apache/datafusion/pull/10323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: limit with offset should return correct results [datafusion-comet]
parthchandra commented on code in PR #359: URL: https://github.com/apache/datafusion-comet/pull/359#discussion_r1586750015 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -340,7 +340,7 @@ class CometSparkSessionExtensions op } -case op: LocalLimitExec => +case op: LocalLimitExec if getOffset(op) == 0 => Review Comment: Add a `withInfo` for the case when `offset > 0` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]
andygrove commented on code in PR #335: URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586743530 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -86,6 +89,24 @@ macro_rules! cast_utf8_to_int { }}; } +macro_rules! cast_utf8_to_timestamp { +($array:expr, $eval_mode:expr, $array_type:ty, $cast_method:ident) => {{ +let len = $array.len(); +let mut cast_array = PrimitiveArray::<$array_type>::builder(len).with_timezone("UTC"); +for i in 0..len { +if $array.is_null(i) { +cast_array.append_null() +} else if let Ok(Some(cast_value)) = $cast_method($array.value(i).trim(), $eval_mode) { +cast_array.append_value(cast_value); +} else { +cast_array.append_null() +} +} +let result: CometResult = Ok(Arc::new(cast_array.finish()) as ArrayRef); +result.unwrap() Review Comment: We can remove the `unwrap` here: ```suggestion let result: ArrayRef = Arc::new(cast_array.finish()) as ArrayRef; result ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Add coalesce tests [datafusion]
alamb commented on code in PR #10334: URL: https://github.com/apache/datafusion/pull/10334#discussion_r1586740016 ## datafusion/sqllogictest/test_files/coalesce.slt: ## @@ -0,0 +1,374 @@ +# Licensed to the Apache Software Foundation (ASF) under one Review Comment: Given the size of these tests I opted to put them in their own file rather than `function.slt` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
alamb commented on PR #10268: URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2088998120 I made a PR with just your wonderful tests in https://github.com/apache/datafusion/pull/10334/files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]
ganeshkumar269 commented on issue #367: URL: https://github.com/apache/datafusion-comet/issues/367#issuecomment-2088994122 ./mvnw test -Dsuites=org.apache.comet.exec.CometShuffleEncryptionSuite ran the above command, same result -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Coverage: Add a manual test to show what Spark built in expression the DF can support directly [datafusion-comet]
parthchandra commented on code in PR #331: URL: https://github.com/apache/datafusion-comet/pull/331#discussion_r1586727820 ## spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala: ## @@ -123,7 +134,7 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH Seq(("", "No examples found in spark.sessionState.functionRegistry" } -// TODO: convert results into HTML +// TODO: convert results into HTML or .md file Review Comment: Perhaps log an issue for this and refer to that instead? It would be great to link the output directly into the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Write a guide on contributing a new expression [datafusion-comet]
andygrove opened a new issue, #370: URL: https://github.com/apache/datafusion-comet/issues/370 ### What is the problem the feature request solves? We should write a detailed guide showing new contributors how to add support for a new expression in Comet. This would cover the Scala, Protobuf, and Rust changes, as well as testing approaches. We could create this based on a recent PR that added a new expression. ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
alamb commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1586698773 ## datafusion/expr/src/type_coercion/functions.rs: ## @@ -53,20 +54,31 @@ pub fn data_types( } } -let valid_types = get_valid_types(_signature, current_types)?; - +let mut valid_types = get_valid_types(_signature, current_types)?; if valid_types .iter() .any(|data_type| data_type == current_types) { return Ok(current_types.to_vec()); } -// Try and coerce the argument types to match the signature, returning the -// coerced types from the first matching signature. -for valid_types in valid_types { -if let Some(types) = maybe_data_types(_types, current_types) { -return Ok(types); +// Well-supported signature that returns exact valid types. +if !valid_types.is_empty() +&& matches!(signature.type_signature, TypeSignature::VariadicEqualOrNull) +{ +// exact valid types +assert_eq!(valid_types.len(), 1); +let valid_types = valid_types.swap_remove(0); Review Comment: is it a problem that this may change the type ordering? So that the last type is now the first in the new Vec? ## datafusion/functions/src/core/coalesce.rs: ## @@ -60,9 +59,8 @@ impl ScalarUDFImpl for CoalesceFunc { } fn return_type(, arg_types: &[DataType]) -> Result { -// COALESCE has multiple args and they might get coerced, get a preview of this Review Comment: ❤️ ## datafusion/sqllogictest/test_files/functions.slt: ## @@ -1158,3 +1158,347 @@ drop table uuid_table statement ok drop table t + + +# Test coalesce function +query I +select coalesce(1, 2, 3); + +1 + +# test with first null +query ?T +select coalesce(null, 3, 2, 1), arrow_typeof(coalesce(null, 3, 2, 1)); + +3 Int64 + +# test with null values +query ? +select coalesce(null, null); + +NULL + +# cast to float +query IT +select + coalesce(1, 2.0), + arrow_typeof(coalesce(1, 2.0)) +; + +1 Float64 + +query RT +select + coalesce(2.0, 1), + arrow_typeof(coalesce(2.0, 1)) +; + +2 Float64 + +query IT +select + coalesce(1, arrow_cast(2.0, 'Float32')), + arrow_typeof(coalesce(1, arrow_cast(2.0, 'Float32'))) +; + +1 Float32 + +# test with empty args +query error +select coalesce(); + +# test with different types +query I +select coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')); + +1 + +# test with nulls +query ?T +select coalesce(null, null, null), arrow_typeof(coalesce(null, null)); + +NULL Null + +# i32 and u32, cast to wider type i64 +query IT +select + coalesce(arrow_cast(2, 'Int32'), arrow_cast(3, 'UInt32')), + arrow_typeof(coalesce(arrow_cast(2, 'Int32'), arrow_cast(3, 'UInt32'))); + +2 Int64 + +query IT +select + coalesce(arrow_cast(2, 'Int16'), arrow_cast(3, 'UInt16')), + arrow_typeof(coalesce(arrow_cast(2, 'Int16'), arrow_cast(3, 'UInt16'))); + +2 Int32 + +query IT +select + coalesce(arrow_cast(2, 'Int8'), arrow_cast(3, 'UInt8')), + arrow_typeof(coalesce(arrow_cast(2, 'Int8'), arrow_cast(3, 'UInt8'))); + +2 Int16 + +# i64 and u32, cast to wider type i64 +query IT +select + coalesce(2, arrow_cast(3, 'UInt32')), + arrow_typeof(coalesce(2, arrow_cast(3, 'UInt32'))); + +2 Int64 + +# TODO: Got types (i64, u64), casting to decimal or double or even i128 if supported +query IT +select + coalesce(2, arrow_cast(3, 'UInt64')), + arrow_typeof(coalesce(2, arrow_cast(3, 'UInt64'))); + +2 Int64 + +statement ok +create table t1 (a bigint, b int, c int) as values (null, null, 1), (null, 2, null); + +# Follow Postgres and DuckDB behavior, since a is bigint, although the value is null, all args are coerced to bigint +query IT +select + coalesce(a, b, c), + arrow_typeof(coalesce(a, b, c)) +from t1; + +1 Int64 +2 Int64 + +# b, c has the same type int, so the result is int +query IT +select + coalesce(b, c), + arrow_typeof(coalesce(b, c)) +from t1; + +1 Int32 +2 Int32 + +statement ok +drop table t1; + +# test multi rows +statement ok +CREATE TABLE t1( + c1 int, + c2 int +) as VALUES +(1, 2), +(NULL, 2), +(1, NULL), +(NULL, NULL); + +query I +SELECT COALESCE(c1, c2) FROM t1 + +1 +2 +1 +NULL + +statement ok +drop table t1; + +# Decimal128(7, 2) and int64 are coerced to common wider type Decimal128(22, 2) +query RT +select + coalesce(arrow_cast(2, 'Decimal128(7, 2)'), 0), + arrow_typeof(coalesce(arrow_cast(2, 'Decimal128(7, 2)'), 0)) + +2 Decimal128(22, 2) + +query RT +select + coalesce(arrow_cast(2, 'Decimal256(7, 2)'), 0), + arrow_typeof(coalesce(arrow_cast(2, 'Decimal256(7, 2)'), 0)); + +2 Decimal256(22, 2) + +# coalesce string +query T? +select + coalesce('', 'test'), + coalesce(null, 'test'); + +(empty) test + +# coalesce utf8 and large utf8 +query TT +select + coalesce('a', arrow_cast('b', 'LargeUtf8')), + arrow_typeof(coalesce('a',
[I] Plan first release [datafusion-comet]
andygrove opened a new issue, #369: URL: https://github.com/apache/datafusion-comet/issues/369 ### What is the problem the feature request solves? During the Comet public meeting this morning, there were questions about when the first official release would be. We do not really have an answer to that yet, but we can use this issue to discuss it. Here are some ideas for milestones that we may want to achieve before creating an official source release (note that we do not necessarily need to create binary releases right away). - Ensure that currently supported operators and expressions are fully compatible with all supported Spark versions - Achieve 100% coverage for TPC-H and/or TPC-DS benchmark with a clear performance advantage - Create release process and scripts for publishing a source release ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Enable GitHub discussions [datafusion-comet]
andygrove opened a new issue, #368: URL: https://github.com/apache/datafusion-comet/issues/368 ### What is the problem the feature request solves? During the public Comet meeting this morning we discussed enabling GitHub discussions as a place for people to ask questions and also potentially as a place to record notes from meetings. We would be careful not to use discussions to make any decisions because discussions do not get sent to the mailing list (but all activity on issues and PRs do go to the mailing list, so this is where we make most decisions). ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] remove expr node accumulation [datafusion]
MohamedAbdeen21 opened a new pull request, #10333: URL: https://github.com/apache/datafusion/pull/10333 ## Which issue does this PR close? Closes #10280. Thanks @JasonLi-cn for pointing out the relevant code snippet and the detailed description, saved me some time ❤️ ## Rationale for this change Making the plans human-readable when common sub expression elimination is applied (check example in ticket and edited tests in this PR). ## What changes are included in this PR? ## Are these changes tested? Changed tests to reflect the change. ## Are there any user-facing changes? Readable plans. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add a test for supported types of SortMergeJoin in DataFusion [datafusion-comet]
planga82 commented on code in PR #365: URL: https://github.com/apache/datafusion-comet/pull/365#discussion_r1586685070 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -383,6 +383,13 @@ object CometConf { .booleanConf .createWithDefault(false) + val COMET_SORTMERGEJOIN_CHECK_KEY_TYPES: ConfigEntry[Boolean] = Review Comment: thanks @comphead . Any idea how to do this in a cleaner way? We could move this constant to other place and avoid putting it in CometConf. In that way we avoid to expose a testing configuration. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Use non-comparison coercion for `Coalesce` or even avoid implicit casting for `Coalesce` [datafusion]
alamb commented on issue #10261: URL: https://github.com/apache/datafusion/issues/10261#issuecomment-2088944901 BTW this is a really nicely written ticket -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Finalize SIGMOD 2024 paper ~(if accepted)~ [datafusion]
alamb commented on issue #8373: URL: https://github.com/apache/datafusion/issues/8373#issuecomment-2088935249 I think it is finally accepted ``` -- Forwarded message - From: Tim Pollitt Date: Wed, May 1, 2024 at 2:44 PM Subject: ACM Proceedings (sigmodpods TP) - Submission Follow Up, ID No: modIP001 To: Cc: , , , , , Dear Author(s), For the re-submission of "Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine" to the proceedings and ACM Digital Library. See below for info related to your final pdf. In addition, authors are requested by the chairs to submit an optional presentation video. (1) For more info, see instructions for details, and inquiry link to chairs: https://www.scomminc.com/pp/acmsig/sigmod-video.htm by May 6th. (2) You will need the DOI assigned to your submission: 10.1145/3626246.3653368 (3) Video descriptions over the 1,024 ACM character count will be deleted/removed by ACM. * We received notice of your completed ACM form, so everything appears to be in order with this submission, and will be utilized for the ACM DL publication. No revisions are needed, nor will be accepted. This submission is being moved onto the next stages of production (but kindly note the signed ACM forms still go through a double-check by ACM for various permissions). Thank you, Tim Pollitt Sheridan Communications ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Tracking Upgrade to Datafusion 37 [datafusion-python]
Michael-J-Ward opened a new issue, #663: URL: https://github.com/apache/datafusion-python/issues/663 I sketched out some of the upgrade in #662 and wanted to share what I encountered. Some major changes in between datafusion 36 and 37. ## https://github.com/apache/datafusion/issues/9285 That means much of `datafusion-python::functions.rs` needs an update. An incomplete first pass is https://github.com/apache/datafusion-python/pull/662/commits/83249fe9a4e809bb6065671bcaee543065d2d356 ## `SectionContext::tables` was removed in https://github.com/apache/datafusion/pull/9627 Many of the tests rely on this method. Does each test get updated to a new API or do we implement our own version to start and then update the tests until it's not necessary? ## new trait methods for `ExecutionPlan` and `ExecutionPlanProperties` for `DatasetExec` need implementing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]
viirya commented on issue #367: URL: https://github.com/apache/datafusion-comet/issues/367#issuecomment-2088932628 Have you tried to run it in command line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Only allow supported and tested cast operations to be converted to native [datafusion-comet]
andygrove commented on code in PR #362: URL: https://github.com/apache/datafusion-comet/pull/362#discussion_r158786 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -1042,10 +1042,10 @@ object CometSparkSessionExtensions extends Logging { * The node with information (if any) attached */ def withInfo[T <: TreeNode[_]](node: T, info: String, exprs: T*): T = { -val exprInfo = exprs - .flatMap { e => Seq(e.getTagValue(CometExplainInfo.EXTENSION_INFO)) } - .flatten - .mkString("\n") +val exprInfo = + (Seq(node.getTagValue(CometExplainInfo.EXTENSION_INFO)) ++ exprs +.flatMap(e => Seq(e.getTagValue(CometExplainInfo.EXTENSION_INFO.flatten +.mkString("\n") Review Comment: I updated this to avoid duplicates -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DataFusion `38.0.0` Release [datafusion]
alamb commented on issue #10217: URL: https://github.com/apache/datafusion/issues/10217#issuecomment-2088930080 Here is a PR witha proposed API https://github.com/apache/datafusion/pull/10330 that would close https://github.com/apache/datafusion/issues/10181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Error "entered unreachable code: NamedStructField should be rewritten in OperatorToFunction" after upgrade to 37 [datafusion]
alamb commented on issue #10181: URL: https://github.com/apache/datafusion/issues/10181#issuecomment-2088929490 Here is a PR with a proposed new API: https://github.com/apache/datafusion/pull/10330 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add APIs to create `PhysicalExpressions` from `Expr`s: `SessionContext::create_physical_expr()` and `SessionState::create_physical_expr()` [datafusion]
alamb commented on code in PR #10330: URL: https://github.com/apache/datafusion/pull/10330#discussion_r1586649331 ## datafusion-examples/examples/expr_api.rs: ## @@ -92,7 +90,8 @@ fn evaluate_demo() -> Result<()> { let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8))); // First, you make a "physical expression" from the logical `Expr` -let physical_expr = physical_expr((), expr)?; +let df_schema = DFSchema::try_from(batch.schema())?; Review Comment: This shows how the new API is used - `SessionContext::new().create_physical_expr` ## datafusion/common/src/dfschema.rs: ## @@ -806,6 +820,12 @@ impl From<> for Schema { } } +impl AsRef for DFSchema { Review Comment: This allows DFSchema to be treated like a , which is now possible after https://github.com/apache/datafusion/pull/9595 ## datafusion-examples/examples/expr_api.rs: ## @@ -248,21 +251,6 @@ fn make_ts_field(name: ) -> Field { make_field(name, DataType::Timestamp(TimeUnit::Nanosecond, tz)) } -/// Build a physical expression from a logical one, after applying simplification and type coercion -pub fn physical_expr(schema: , expr: Expr) -> Result> { Review Comment: This PR basically ports this code out of the examples and into `SessionState` and adds documentation and tests ## datafusion/core/src/execution/context/mod.rs: ## @@ -510,6 +515,34 @@ impl SessionContext { } } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type Review Comment: New public API with example ## datafusion/core/src/execution/context/mod.rs: ## @@ -2024,6 +2089,35 @@ impl SessionState { } } +struct SessionSimpifyProvider<'a> { Review Comment: This avoids cloning schema / schema refs ## datafusion/core/tests/expr_api/mod.rs: ## @@ -0,0 +1,181 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow::util::pretty::pretty_format_columns; +use arrow_array::builder::{ListBuilder, StringBuilder}; +use arrow_array::{ArrayRef, RecordBatch, StringArray, StructArray}; +use arrow_schema::{DataType, Field}; +use datafusion::prelude::*; +use datafusion_common::DFSchema; +/// Tests of using and evaluating `Expr`s outside the context of a LogicalPlan +use std::sync::{Arc, OnceLock}; + +#[test] +fn test_eq() { +// id = '2' +evaluate_expr_test( +col("id").eq(lit("2")), +vec![ +"+---+", +"| expr |", +"+---+", +"| false |", +"| true |", +"| false |", +"+---+", +], +); +} + +#[test] +fn test_eq_with_coercion() { +// id = 2 (need to coerce the 2 to '2' to evaluate) +evaluate_expr_test( +col("id").eq(lit(2i32)), +vec![ +"+---+", +"| expr |", +"+---+", +"| false |", +"| true |", +"| false |", +"+---+", +], +); +} + +#[test] +fn test_get_field() { Review Comment: this tests fail without function rewrites applied ## datafusion/core/src/execution/context/mod.rs: ## @@ -1947,6 +1982,36 @@ impl SessionState { .await } +/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type +/// coercion, and function rewrites. +/// +/// Note: The expression is not [simplified] +/// +/// # See Also: +/// * [`SessionContext::create_physical_expr`] for a higher-level API +/// * [`create_physical_expr`] for a lower-level API +/// +/// [simplified]: datafusion_optimizer::simplify_expressions +pub fn create_physical_expr( +, +expr: Expr, +df_schema: , +) -> Result> { +let simplifier = +ExprSimplifier::new(SessionSimpifyProvider::new(self, df_schema)); +// apply type coercion here to ensure types match +let mut expr = simplifier.coerce(expr, df_schema)?; + +// rewrite Exprs to functions if necessary +let config_options = self.config_options(); Review Comment: Applying these rewrites is a new feature (and what
Re: [PR] feat: Implement Spark-compatible CAST from string to timestamp types [datafusion-comet]
andygrove commented on code in PR #335: URL: https://github.com/apache/datafusion-comet/pull/335#discussion_r1586652474 ## spark/src/test/scala/org/apache/comet/CometCastSuite.scala: ## @@ -528,14 +528,29 @@ class CometCastSuite extends CometTestBase with AdaptiveSparkPlanHelper { "spark.comet.cast.stringToTimestamp is disabled") } - ignore("cast StringType to TimestampType") { -// https://github.com/apache/datafusion-comet/issues/328 -withSQLConf((CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key, "true")) { - val values = Seq("2020-01-01T12:34:56.123456", "T2") ++ generateStrings(timestampPattern, 8) - castTest(values.toDF("a"), DataTypes.TimestampType) + test("cast StringType to TimestampType") { +withSQLConf( + SQLConf.SESSION_LOCAL_TIMEZONE.key -> "UTC", + CometConf.COMET_CAST_STRING_TO_TIMESTAMP.key -> "true") { + val values = Seq( +"2020", +"2020-01", +"2020-01-01", +"2020-01-01T12", +"2020-01-01T12:34", +"2020-01-01T12:34:56", +"2020-01-01T12:34:56.123456", +"T2", +"-9?") + castTimestampTest(values.toDF("a"), DataTypes.TimestampType) } } + test("cast StringType to TimestampType with invalid timezone") { Review Comment: Thanks for adding this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] doc: Clean up supported JDKs in README [datafusion-comet]
andygrove merged PR #366: URL: https://github.com/apache/datafusion-comet/pull/366 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] doc: Fix target typo in development.md [datafusion-comet]
andygrove merged PR #364: URL: https://github.com/apache/datafusion-comet/pull/364 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] docs: Add a plugin overview page to the contributors guide [datafusion-comet]
parthchandra commented on code in PR #345: URL: https://github.com/apache/datafusion-comet/pull/345#discussion_r1586638726 ## docs/source/contributor-guide/plugin_overview.md: ## @@ -0,0 +1,50 @@ + + +# Comet Plugin Overview + +The entry point to Comet is the `org.apache.comet.CometSparkSessionExtensions` class, which can be registered with Spark by adding the following setting to the Spark configuration when launching `spark-shell` or `spark-submit`: + +``` +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions +``` + +On initialization, this class registers two physical plan optimization rules with Spark: `CometScanRule` and `CometExecRule`. These rules run whenever a query stage is being planned. + +## CometScanRule + +`CometScanRule` replaces any Parquet scans with Comet Parquet scan classes. + +When the V1 data source API is being used, `FileSourceScanExec` is replaced with `CometScanExec`. + +When the V2 data source API is being used, `BatchScanExec` is replaced with `CometBatchScanExec`. + +## CometExecRule + +`CometExecRule` attempts to transform a Spark physical plan into a Comet plan. + +This rule traverses bottom-up from the original Spark plan and attempts to replace each node with a Comet equivalent. For example, a `ProjectExec` will be replaced by `CometProjectExec`. + +When replacing a node, various checks are performed to determine if Comet can support the operator and its expressions. If an operator or expression is not supported by Comet then the reason will be stored in a tag on the underlying Spark node. Running `explain` on a query will show any reasons that prevented the plan from being executed natively in Comet. If any part of the plan is not supported in Comet then the original Spark plan will be returned. Review Comment: Sorry, just seeing this today. The `explain` information will show up in the UI or `EXPLAIN` output only from Spark 4.0.0 onwards as the `ExtendedExplainGenerator` trait was only added in Spark 4.0. Internally, we can always call `ExtendedExplainInfo.generateExtendedInfo(plan)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Setup regular meetups for Comet development [datafusion-comet]
parthchandra commented on issue #217: URL: https://github.com/apache/datafusion-comet/issues/217#issuecomment-2088897629 For those who might be reading this in future: Meetups links: https://docs.google.com/document/d/1NBpkIAuU7O9h8Br5CbFksDhX-L9TyO9wmGLPMe0Plc8/edit?usp=sharing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Setup regular meetups for Comet development [datafusion-comet]
parthchandra closed issue #217: Setup regular meetups for Comet development URL: https://github.com/apache/datafusion-comet/issues/217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add a test for supported types of SortMergeJoin in DataFusion [datafusion-comet]
comphead commented on code in PR #365: URL: https://github.com/apache/datafusion-comet/pull/365#discussion_r1586622441 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -383,6 +383,13 @@ object CometConf { .booleanConf .createWithDefault(false) + val COMET_SORTMERGEJOIN_CHECK_KEY_TYPES: ConfigEntry[Boolean] = Review Comment: thanks @planga82 I'm not sure if we need to change the conf for the unit test purpose -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST from string to integral types [datafusion-comet]
parthchandra commented on code in PR #307: URL: https://github.com/apache/datafusion-comet/pull/307#discussion_r1586606093 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -142,6 +233,202 @@ impl Cast { } } +/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toByte +fn cast_string_to_i8(str: , eval_mode: EvalMode) -> CometResult> { +Ok(cast_string_to_int_with_range_check( +str, +eval_mode, +"TINYINT", +i8::MIN as i32, +i8::MAX as i32, +)? +.map(|v| v as i8)) +} + +/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toShort +fn cast_string_to_i16(str: , eval_mode: EvalMode) -> CometResult> { +Ok(cast_string_to_int_with_range_check( +str, +eval_mode, +"SMALLINT", +i16::MIN as i32, +i16::MAX as i32, +)? +.map(|v| v as i16)) +} + +/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toInt(IntWrapper intWrapper) +fn cast_string_to_i32(str: , eval_mode: EvalMode) -> CometResult> { +do_cast_string_to_int::(str, eval_mode, "INT", i32::MIN) +} + +/// Equivalent to org.apache.spark.unsafe.types.UTF8String.toLong(LongWrapper intWrapper) +fn cast_string_to_i64(str: , eval_mode: EvalMode) -> CometResult> { +do_cast_string_to_int::(str, eval_mode, "BIGINT", i64::MIN) +} + +fn cast_string_to_int_with_range_check( +str: , +eval_mode: EvalMode, +type_name: , +min: i32, +max: i32, +) -> CometResult> { +match do_cast_string_to_int(str, eval_mode, type_name, i32::MIN)? { +None => Ok(None), +Some(v) if v >= min && v <= max => Ok(Some(v)), +_ if eval_mode == EvalMode::Ansi => Err(invalid_value(str, "STRING", type_name)), +_ => Ok(None), +} +} + +#[derive(PartialEq)] +enum State { +SkipLeadingWhiteSpace, +SkipTrailingWhiteSpace, +ParseSignAndDigits, +ParseFractionalDigits, +} + +/// Equivalent to +/// - org.apache.spark.unsafe.types.UTF8String.toInt(IntWrapper intWrapper, boolean allowDecimal) +/// - org.apache.spark.unsafe.types.UTF8String.toLong(LongWrapper longWrapper, boolean allowDecimal) +fn do_cast_string_to_int< +T: Num + PartialOrd + Integer + CheckedSub + CheckedNeg + From + Copy, +>( +str: , +eval_mode: EvalMode, +type_name: , +min_value: T, +) -> CometResult> { +let len = str.len(); +if str.is_empty() { +return none_or_err(eval_mode, type_name, str); +} + +let mut result: T = T::zero(); +let mut negative = false; +let radix = T::from(10); +let stop_value = min_value / radix; +let mut state = State::SkipLeadingWhiteSpace; +let mut parsed_sign = false; + +for (i, ch) in str.char_indices() { +// skip leading whitespace +if state == State::SkipLeadingWhiteSpace { +if ch.is_whitespace() { +// consume this char +continue; +} +// change state and fall through to next section +state = State::ParseSignAndDigits; +} + +if state == State::ParseSignAndDigits { +if !parsed_sign { +negative = ch == '-'; +let positive = ch == '+'; +parsed_sign = true; +if negative || positive { +if i + 1 == len { +// input string is just "+" or "-" +return none_or_err(eval_mode, type_name, str); +} +// consume this char +continue; +} +} + +if ch == '.' { +if eval_mode == EvalMode::Legacy { +// truncate decimal in legacy mode +state = State::ParseFractionalDigits; +continue; +} else { +return none_or_err(eval_mode, type_name, str); +} +} + +let digit = if ch.is_ascii_digit() { +(ch as u32) - ('0' as u32) +} else { +return none_or_err(eval_mode, type_name, str); +}; + +// We are going to process the new digit and accumulate the result. However, before Review Comment: A comment to explain why we're using subtraction instead of addition would make it easier to understand this part of the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Unable to Run tests of CometShuffleEncryptionSuite [datafusion-comet]
ganeshkumar269 opened a new issue, #367: URL: https://github.com/apache/datafusion-comet/issues/367 ### Describe the bug when ever I execute the test class CometShuffleEncryptionSuite, the tests fails/crashes with no meaningful error message, though I get this warning message WARNING: /Library/Java/JavaVirtualMachines/openjdk-11.jdk/Contents/Home/bin/java is loading libcrypto in an unsafe way In Intellij IDEA, I enabled "Log - Stack trace" for all the exceptions, I received the below error message, ``` Exception 'sun.nio.fs.UnixException' occurred in thread 'Executor task launch worker for task 0.0 in stage 1.0 (TID 1)' at sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:335) at sun.nio.fs.UnixNativeDispatcher.lstat(UnixNativeDispatcher.java:335) at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:72) at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:232) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:110) at java.nio.file.Files.deleteIfExists(Files.java:1181) at java.nio.file.Files.copy(Files.java:3055) at org.apache.commons.crypto.NativeCodeLoader.extractLibraryFile(NativeCodeLoader.java:149) at org.apache.commons.crypto.NativeCodeLoader.findNativeLibrary(NativeCodeLoader.java:237) at org.apache.commons.crypto.NativeCodeLoader.loadLibrary(NativeCodeLoader.java:279) at org.apache.commons.crypto.NativeCodeLoader.(NativeCodeLoader.java:52) at org.apache.commons.crypto.Crypto.isNativeCodeLoaded(Crypto.java:140) at org.apache.commons.crypto.random.OpenSslCryptoRandom.(OpenSslCryptoRandom.java:54) at java.lang.Class.forName0(Class.java:-1) at java.lang.Class.forName(Class.java:398) at org.apache.commons.crypto.utils.ReflectionUtils.getClassByNameOrNull(ReflectionUtils.java:134) at org.apache.commons.crypto.utils.ReflectionUtils.getClassByName(ReflectionUtils.java:101) at org.apache.commons.crypto.random.CryptoRandomFactory.getCryptoRandom(CryptoRandomFactory.java:197) at org.apache.spark.security.CryptoStreamUtils$.createInitializationVector(CryptoStreamUtils.scala:138) at org.apache.spark.security.CryptoStreamUtils$.createCryptoOutputStream(CryptoStreamUtils.scala:56) at org.apache.spark.serializer.SerializerManager.$anonfun$wrapForEncryption$3(SerializerManager.scala:151) at org.apache.spark.serializer.SerializerManager$$Lambda$2902.217692044.apply(Unknown Source:-1) at scala.Option.map(Option.scala:230) at org.apache.spark.serializer.SerializerManager.wrapForEncryption(SerializerManager.scala:151) at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:134) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:163) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:306) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.executor.Executor$TaskRunner$$Lambda$1605.2076642299.apply(Unknown Source:-1) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:829) Exception 'java.lang.ClassNotFoundException' occurred in thread 'driver-heartbeater' at sun.reflect.misc.MethodUtil.findClass(MethodUtil.java:328) at sun.reflect.misc.MethodUtil.findClass(MethodUtil.java:328) at sun.reflect.misc.MethodUtil.loadClass(MethodUtil.java:309) at java.lang.ClassLoader.loadClass(ClassLoader.java:527) at jdk.internal.misc.Unsafe.defineClass0(Unsafe.java:-1) at jdk.internal.misc.Unsafe.defineClass(Unsafe.java:1192) at jdk.internal.reflect.ClassDefiner.defineClass(ClassDefiner.java:63) at jdk.internal.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:400) at jdk.internal.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394) at
Re: [PR] chore: add a utils method to getColumnReader with SQLConf [datafusion-comet]
viirya commented on PR #360: URL: https://github.com/apache/datafusion-comet/pull/360#issuecomment-203549 Hmm, I think @parthchandra meant that by adding SQLConf, `common` will depend on Spark. Does it conflict with the direction to make `common` not Spark dependent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] explain plan disables some cases in ScanExecRule [datafusion-comet]
viirya commented on issue #322: URL: https://github.com/apache/datafusion-comet/issues/322#issuecomment-200855 Closed it now. Thanks @jc4x4 for reminding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] explain plan disables some cases in ScanExecRule [datafusion-comet]
viirya closed issue #322: explain plan disables some cases in ScanExecRule URL: https://github.com/apache/datafusion-comet/issues/322 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Jdk 11 in readme [datafusion-comet]
viirya commented on code in PR #366: URL: https://github.com/apache/datafusion-comet/pull/366#discussion_r1586603568 ## README.md: ## @@ -63,7 +63,7 @@ Linux, Apple OSX (Intel and M1) ## Requirements - Apache Spark 3.2, 3.3, or 3.4 -- JDK 8 and up +- JDK 11 and up (JDK 8 should be supported, but development is on JDK 11) Review Comment: I think for sure option is to use supported JDKs with the Spark version you are working on. It is good to have it clean here to reduce confusion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Jdk 11 in readme [datafusion-comet]
viirya commented on PR #366: URL: https://github.com/apache/datafusion-comet/pull/366#issuecomment-2088874522 Thanks @edmondop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement Spark-compatible CAST between integer types [datafusion-comet]
ganeshkumar269 commented on PR #340: URL: https://github.com/apache/datafusion-comet/pull/340#issuecomment-2088848565 Hi @andygrove , i have added a check before we fetch sparkInvalidValue, defaulting it to EMPTY_STRING if ':' is not present. Also added additional comments on why we are checking for the presence of 'overflow' string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org