Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2135163952 > > Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell). > > Thanks @alamb, I agree with you. Fixing it in a follow-up ticket is better. Filed https://github.com/apache/datafusion/issues/10695 to track -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2135141343 Thanks again @alamb @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
jayzhan211 merged PR #10661: URL: https://github.com/apache/datafusion/pull/10661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
jayzhan211 commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2135103078 Thanks, @goldmedal and @alam. I think we can merge this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2133438677 > Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell). Thanks @alamb, I aggress with you. Fix it in a follow-up ticket is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2133374806 > So, how about setting UDAF names to lowercase (e.g., median, first_value)? This way, we don't need another uppercase alias. What do you think? This makes sense to me Shall we file a follow on ticket for it? I don't think we need to fix it in this PR (as this PR doesn't make the situation any better or worse, from what I can tell). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2133291003 > `first_value` is not added in alias because the function is not fully converted to UDAF yet, so it will fail some tests. #10648 has alias for lower case > > so as `approx_median` since it is not UDAF yet. > Thanks for the information! @jayzhan211 > > I think adding a lowercase alias for MEDIAN is a workaround to handle the function name case insensitivity issue for UDAF > > Why adding a lowercase alias is a workaround for UDAF? Without alias, we will need to check the lowercase in many places. Since I thought we needed to list all case patterns for the function name (e.g., `Median`, `MEDian`), I called it a workaround. However, I found I was wrong. After some testing, I discovered something interesting. For DataFusion, if we query a function ``` SELECT Median(xx) FROM tt ``` The function call will be transferred to `median`. Even query it use ``` SELECT MEDIAN(xx) FROM tt ``` It will also be changed to `median`. The uppercase name is only be matched when invoke this function with qutoes ``` SELECT "MEDIAN"(xx) FROM tt ``` So, how about setting UDAF names to lowercase (e.g., `median`, `first_value`)? This way, we don't need another uppercase alias. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
jayzhan211 commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2132459487 > That's why we don't need to add a lowercase alias for `FIRST_VALUE` in `first_last.rs`. It seems `FIRST_VALUE` has been redesigned as a UDAF but hasn't been removed from the AggregationFunction list. I'm not sure if this is an issue. `first_value` is not added in alias because the function is not fully converted to UDAF yet, so it will fail some tests. #10648 has alias for lower case so as `approx_median` since it is not UDAF yet. > I think adding a lowercase alias for MEDIAN is a workaround to handle the function name case insensitivity issue for UDAF Why adding a lowercase alias is a workaround for UDAF? Without alias, we will need to check the lowercase in many places. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2132256147 I think I got it now. In #10644, `MEDIAN` was removed from the AggregationFunction list and reimplemented as a UDAF. It looks like the list of aggregation functions handles some name mapping for case insensitivity. See: https://github.com/apache/datafusion/blob/52c4f3cce68afc45f25599b3917e06f0c6b63192/datafusion/expr/src/aggregate_function.rs#L117 https://github.com/apache/datafusion/blob/52c4f3cce68afc45f25599b3917e06f0c6b63192/datafusion/expr/src/aggregate_function.rs#L171 That's why we don't need to add a lowercase alias for `FIRST_VALUE` in `first_last.rs`. It seems `FIRST_VALUE` has been redesigned as a UDAF but hasn't been removed from the AggregationFunction list. I'm not sure if this is an issue. I reverted the change for `median` and removed the name testing for case insensitivity. However, I think adding a lowercase alias for `MEDIAN` is a workaround to handle the function name case insensitivity issue for UDAF. Maybe we need a formal way to handle this, but I'm not sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2132241107 I'm confused about the new `median` function. I'm not sure why the name is case-sensitive. ``` Planning query: Plan("Invalid function 'median'.\nDid you mean 'MEDIAN'?\n\n ``` I removed the duplicate name as I did for other functions in 4086688. However, the planner can't resolve the lowercase function name. I have tried and confirmed that other functions (e.g., `first_value` and `approx_median`) are case-insensitive. Only `median` is case-sensitive. Do you have any idea? @alamb I found this function was created in #10644. Gentle ping @jayzhan211 , do you have any idea about it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1615197488 ## datafusion/functions-aggregate/src/lib.rs: ## @@ -96,3 +96,31 @@ pub fn register_all(registry: &mut dyn FunctionRegistry) -> Result<()> { Ok(()) } + +#[cfg(test)] +mod tests { +use crate::all_default_aggregate_functions; +use datafusion_common::Result; +use std::collections::HashSet; + +#[test] +fn test_no_duplicate_name() -> Result<()> { +let mut names = HashSet::new(); +for func in all_default_aggregate_functions() { +assert!( +names.insert(func.name().to_string().to_lowercase()), Review Comment: I think the function name and alias should be case-insensitive. Therefore, lowercase them here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1615170917 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2860,6 +2863,57 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_register_default_functions() -> Result<()> { +let config = SessionConfig::new(); Review Comment: Good idea. I'll add this test and remove the test for SessionState::new. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1615146751 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2860,6 +2863,57 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_register_default_functions() -> Result<()> { +let config = SessionConfig::new(); Review Comment: While I understand the proposal to test with code like `SessionState::new` , I think it is problematic because if SessionState::new is changed, then this test will no longer match what it is doing. I think the core of this test is verifying that there are no duplicate names in the alias lists. Thus, perhaps we could write tests for `all_default_functions`, etc like ```rust let mut names = HashSet::new(); for func in all_default_functions() { assert(names.insert(func.name()).is_none(), "func.name duplicated") for alias in func.aliases() { assert(names.insert(alias).is_none(), "alias duplicated") } } ``` What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2132157444 > I'm not sure, but maybe we should also insert aliases in MemoryFunctionRegistry. What do you think? I think this would make sense -- though maybe we can do it as a follow on PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131238078 Thanks, @alamb! Actually, I tried to add some tests for `MemoryFunctionRegistry` before I fixed it, but I found everything was fine. I guess the reason is that we don't insert aliases when using `MemoryFunctionRegistry#register_udf`. https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/execution/src/registry.rs#L174-L176 Then, I found that the aliases are only inserted by `SessionState` in https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/core/src/execution/context/mod.rs#L2278-L2283 That's why I added tests for `SessionState`. I'm not sure, but maybe we should also insert aliases in `MemoryFunctionRegistry`. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131195735 Thank you @goldmedal 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614546056 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2860,6 +2863,57 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_register_default_functions() -> Result<()> { +let config = SessionConfig::new(); +let catalog_list = +Arc::new(MemoryCatalogProviderList::new()) as Arc; +let mut new_state = SessionState { +session_id: Uuid::new_v4().to_string(), +analyzer: Analyzer::new(), +optimizer: Optimizer::new(), +physical_optimizers: PhysicalOptimizer::new(), +query_planner: Arc::new(DefaultQueryPlanner {}), +catalog_list, +table_functions: HashMap::new(), +scalar_functions: HashMap::new(), +aggregate_functions: HashMap::new(), +window_functions: HashMap::new(), +serializer_registry: Arc::new(EmptySerializerRegistry), +table_option_namespace: TableOptions::default_from_session_config( +config.options(), +), +config, +execution_props: ExecutionProps::new(), +runtime_env: Arc::new(RuntimeEnv::default()), +table_factories: HashMap::new(), +function_factory: None, +}; + +for function in all_default_functions() { Review Comment: This is a nice test -- though it is basically a copy of SessionContext::new I wonder if we could make a test that is a bit simpler for example by creating a new https://docs.rs/datafusion/latest/datafusion/execution/registry/struct.MemoryFunctionRegistry.html Something like (untested) ```rust let registry = MemoryFunctionRegistry::new(); for function in all_default_array_functions() { let existing_function = new_state.register_udf(function); assert!(existing_function.is_none(), "{} was already registered", function.name() } // and similarly for the aggregate and array functins ``` 🤔 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2893,21 +2893,21 @@ mod tests { for function in all_default_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_array_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_aggregate_functions() { let udaf = new_state.register_udaf(function).unwrap(); if let Some(udaf) = udaf { -assert!(false, "Function {} already registered", udaf.name()); +unreachable!("Function {} already registered", udaf.name()); Review Comment: Clippy suggests that I use `panic` or `unreachable` instead of `assert`. I'm not sure which one is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2893,21 +2893,21 @@ mod tests { for function in all_default_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_array_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_aggregate_functions() { let udaf = new_state.register_udaf(function).unwrap(); if let Some(udaf) = udaf { -assert!(false, "Function {} already registered", udaf.name()); +unreachable!("Function {} already registered", udaf.name()); Review Comment: Clippy suggests me to use `panic` or `unreachable` instead of `assert`. I'm not sure which one is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539494 ## datafusion/optimizer/src/push_down_limit.rs: ## @@ -65,7 +65,6 @@ impl OptimizerRule for PushDownLimit { }; let Limit { skip, fetch, input } = limit; -let input = input; Review Comment: I'm not sure why the clippy check in CI didn't detect this but it's showed when I ran clippy in my local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal opened a new pull request, #10661: URL: https://github.com/apache/datafusion/pull/10661 ## Which issue does this PR close? Closes #10658 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? yes Also checked starting the CLI with the DEBUG level. ``` RUST_LOG=DEBUG ./target/debug/datafusion-cli ``` ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org