Re: [PR] Introduce expr builder for aggregate function [datafusion]
jayzhan211 commented on code in PR #10560: URL: https://github.com/apache/datafusion/pull/10560#discussion_r1611638831 ## datafusion-examples/examples/udaf_expr.rs: ## @@ -0,0 +1,45 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::{ +execution::{ +config::SessionConfig, +context::{SessionContext, SessionState}, +}, +functions_aggregate::{expr_fn::first_value, register_all}, +}; + +use datafusion_common::Result; +use datafusion_expr::col; + +#[tokio::main] +async fn main() -> Result<()> { +let ctx = SessionContext::new(); +let config = SessionConfig::new(); +let mut state = SessionState::new_with_config_rt(config, ctx.runtime_env()); +let _ = register_all(&mut state); + +let first_value_udaf = state.aggregate_functions().get("FIRST_VALUE").unwrap(); +let first_value_builder = first_value_udaf Review Comment: The downside (?) of this version is that it needs an internal function `first_value_udaf` (non-public). If the user can't get `first_value_udaf ` easily, it is not so trivial to use this expr style, and another small issue is `call` always expects `Vec` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce expr builder for aggregate function [datafusion]
jayzhan211 commented on code in PR #10560: URL: https://github.com/apache/datafusion/pull/10560#discussion_r1611638831 ## datafusion-examples/examples/udaf_expr.rs: ## @@ -0,0 +1,45 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::{ +execution::{ +config::SessionConfig, +context::{SessionContext, SessionState}, +}, +functions_aggregate::{expr_fn::first_value, register_all}, +}; + +use datafusion_common::Result; +use datafusion_expr::col; + +#[tokio::main] +async fn main() -> Result<()> { +let ctx = SessionContext::new(); +let config = SessionConfig::new(); +let mut state = SessionState::new_with_config_rt(config, ctx.runtime_env()); +let _ = register_all(&mut state); + +let first_value_udaf = state.aggregate_functions().get("FIRST_VALUE").unwrap(); +let first_value_builder = first_value_udaf Review Comment: The downside (?) of this version is that it needs an internal function `first_value_udaf` (non-public). If the user can't get `first_value_udaf ` easily, it is not so trivial to use this expr style, and another small issue is `call` always expects vec of expr now. But, there are no builder functions like `first_value_builder`, so it may make less noise to the user (?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: simplify converting List DataTypes to `ScalarValue` [datafusion]
jonahgao commented on code in PR #10675: URL: https://github.com/apache/datafusion/pull/10675#discussion_r1615025038 ## datafusion/common/src/scalar/mod.rs: ## @@ -3089,46 +3089,21 @@ impl TryFrom<&DataType> for ScalarValue { Box::new(value_type.as_ref().try_into()?), ), // `ScalaValue::List` contains single element `ListArray`. -DataType::List(field) => ScalarValue::List( -new_null_array( -&DataType::List(Arc::new(Field::new( -"item", -field.data_type().clone(), -true, -))), -1, -) -.as_list::() -.to_owned() -.into(), -), -// 'ScalarValue::LargeList' contains single element `LargeListArray -DataType::LargeList(field) => ScalarValue::LargeList( -new_null_array( -&DataType::LargeList(Arc::new(Field::new( -"item", -field.data_type().clone(), -true, -))), -1, -) -.as_list::() -.to_owned() -.into(), -), +DataType::List(field_ref) => ScalarValue::List(Arc::new( +GenericListArray::new_null(field_ref.clone(), 1), +)), +// `ScalarValue::LargeList` contains single element `LargeListArray`. +DataType::LargeList(field_ref) => ScalarValue::LargeList(Arc::new( +GenericListArray::new_null(field_ref.clone(), 1), +)), // `ScalaValue::FixedSizeList` contains single element `FixedSizeList`. -DataType::FixedSizeList(field, _) => ScalarValue::FixedSizeList( -new_null_array( -&DataType::FixedSizeList( -Arc::new(Field::new("item", field.data_type().clone(), true)), -1, Review Comment: Here, it should be the fixed length instead of always being `1`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] refactor: simplify converting List DataTypes to `ScalarValue` [datafusion]
jonahgao opened a new pull request, #10675: URL: https://github.com/apache/datafusion/pull/10675 ## Which issue does this PR close? N/A ## Rationale for this change `ScalarValue::try_from(dt: &DataType)` is commonly used to create a NULL `ScalarValue` for a given DataType. The processing of List DataTypes is a bit complex and can be simplified. ## What changes are included in this PR? ## Are these changes tested? Yes ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Upgrade object_store to 0.10.1 [datafusion]
Samrose-Ahmed commented on issue #10673: URL: https://github.com/apache/datafusion/issues/10673#issuecomment-2132048170 I think `parquet` needs to be updated too, I tried it quickly locally and needed to update `pyo3` to 0.21 but get some weird type errors I couldn't understand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce Sum UDAF [datafusion]
jayzhan211 commented on code in PR #10651: URL: https://github.com/apache/datafusion/pull/10651#discussion_r1614978844 ## datafusion/physical-expr-common/src/aggregate/mod.rs: ## @@ -292,16 +307,38 @@ impl AggregateExpr for AggregateFunctionExpr { is_distinct: self.is_distinct, input_type: &self.input_type, args_num: self.args.len(), +name: &self.name, }; self.fun.groups_accumulator_supported(args) } fn create_groups_accumulator(&self) -> Result> { -self.fun.create_groups_accumulator() +let args = AccumulatorArgs { +data_type: &self.data_type, +schema: &self.schema, +ignore_nulls: self.ignore_nulls, +sort_exprs: &self.sort_exprs, +is_distinct: self.is_distinct, +input_type: &self.input_type, +args_num: self.args.len(), +name: &self.name, +}; +self.fun.create_groups_accumulator(args) } fn order_bys(&self) -> Option<&[PhysicalSortExpr]> { -(!self.ordering_req.is_empty()).then_some(&self.ordering_req) +if self.fun.has_ordering_requirements() && !self.ordering_req.is_empty() { +return Some(&self.ordering_req); +} +None Review Comment: I found that most of the functions like Sum should not expect to get `ordering_req`, even it is non-empty. If `ordering_req` is returned, I will get unexpected expression here https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/physical-plan/src/aggregates/mod.rs#L1011-L1013 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
tshauck commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1614977230 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,20 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +Seq(true, false).foreach { dictionaryEnabled => + withTempDir { dir => +val path = new Path(dir.toURI.toString, "hex.parquet") +makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 1) + +withParquetTable(path.toString, "tbl") { + // _9 and _10 (uint8 and uint16) not supported + checkSparkAnswerAndOperator( +"SELECT hex(_1), hex(_2), hex(_3), hex(_4), hex(_5), hex(_6), hex(_7), hex(_8), hex(_11), hex(_12), hex(_13), hex(_14), hex(_15), hex(_16), hex(_17), hex(_18), hex(_19), hex(_20) FROM tbl") Review Comment: I tried that already and got test failures because the native function isn't being recognized w/ scalars. The same output I mentioned w/ `trim` above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add hex scalar function [datafusion-comet]
tshauck commented on code in PR #449: URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1614977230 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1038,6 +1038,20 @@ class CometExpressionSuite extends CometTestBase with AdaptiveSparkPlanHelper { } } } + test("hex") { +Seq(true, false).foreach { dictionaryEnabled => + withTempDir { dir => +val path = new Path(dir.toURI.toString, "hex.parquet") +makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 1) + +withParquetTable(path.toString, "tbl") { + // _9 and _10 (uint8 and uint16) not supported + checkSparkAnswerAndOperator( +"SELECT hex(_1), hex(_2), hex(_3), hex(_4), hex(_5), hex(_6), hex(_7), hex(_8), hex(_11), hex(_12), hex(_13), hex(_14), hex(_15), hex(_16), hex(_17), hex(_18), hex(_19), hex(_20) FROM tbl") Review Comment: I initially tried and got test failures because the native function isn't being recognized w/ scalars. The same issue I mentioned w/ `trim` above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] `test_sort_1k_mem` failed with too many open file error when I run locally [datafusion]
jayzhan211 opened a new issue, #10674: URL: https://github.com/apache/datafusion/issues/10674 ### Describe the bug It has been annoyed me a while I get this error when run locally, and the failure **is not shown in CI** ``` failures: fuzz_cases::sort_fuzz::test_sort_1k_mem stdout thread 'fuzz_cases::sort_fuzz::test_sort_1k_mem' panicked at datafusion/core/tests/fuzz_cases/sort_fuzz.rs:148:63: called `Result::unwrap()` on an `Err` value: Execution("Failed to create partition file at \"/var/folders/73/gkp858j550z_tt8bx1m8k67wgn/T/.tmpXo6QCe/.tmpv6tET2\": Os { code: 24, kind: Uncategorized, message: \"Too many open files\" }") note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace failures: fuzz_cases::sort_fuzz::test_sort_1k_mem test result: FAILED. 23 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 11.56s ``` ### To Reproduce `cargo test --lib --tests --bins --features avro,json` ### Expected behavior I expect to run the test pass locally. what setting should I use? ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Upgrade object_store to 0.10.1 [datafusion]
Samrose-Ahmed opened a new issue, #10673: URL: https://github.com/apache/datafusion/issues/10673 ### Is your feature request related to a problem or challenge? object_store 0.10.1 includes some updates for better retrying, with multipart uplads ### Describe the solution you'd like Upgrade object_store to 0.10.1 ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]
jayzhan211 commented on code in PR #10644: URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614957716 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -871,6 +871,33 @@ select median(distinct c) from t; statement ok drop table t; +# optimize distinct median to group by +statement ok +create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3); + +query TT +explain select median(distinct c) from t; + +logical_plan +01)Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT t.c) +02)--Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]] +03)Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]] Review Comment: This shows distinct is optimzed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] ColumnReader.loadVector should initiate CometDictionary after re-import arrays [datafusion-comet]
viirya closed issue #474: ColumnReader.loadVector should initiate CometDictionary after re-import arrays URL: https://github.com/apache/datafusion-comet/issues/474 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
viirya commented on PR #473: URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131456605 Merged. Thanks @andygrove @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
viirya merged PR #473: URL: https://github.com/apache/datafusion-comet/pull/473 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
codecov-commenter commented on PR #473: URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131417599 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `0%` with `4 lines` in your changes are missing coverage. Please review. > Project coverage is 34.05%. Comparing base [(`9125e6a`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/9125e6a04dce7c86bb0e4f4f297c8c2b70a39559?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`9a2a851`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/9a2a851dfd80d30242f323b9a7bd96385df183e9?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 2 commits behind head on main. | [Files](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...in/java/org/apache/comet/parquet/ColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbHVtblJlYWRlci5qYXZh) | 0.00% | [4 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@Coverage Diff@@ ## main #473 +/- ## = Coverage 34.05% 34.05% Complexity 859 859 = Files 116 116 Lines 3868038679-1 Branches 8568 8567-1 = Hits 1317113171 + Misses2274622745-1 Partials 2763 2763 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump mimalloc from 0.1.39 to 0.1.41 [datafusion-python]
dependabot[bot] commented on PR #657: URL: https://github.com/apache/datafusion-python/pull/657#issuecomment-2131417589 Superseded by #719. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump mimalloc from 0.1.39 to 0.1.41 [datafusion-python]
dependabot[bot] closed pull request #657: build(deps): bump mimalloc from 0.1.39 to 0.1.41 URL: https://github.com/apache/datafusion-python/pull/657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] build(deps): bump mimalloc from 0.1.39 to 0.1.42 [datafusion-python]
dependabot[bot] opened a new pull request, #719: URL: https://github.com/apache/datafusion-python/pull/719 Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 0.1.39 to 0.1.42. Release notes Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's releases. Version 0.1.42 Changes MiMalloc v2.1.6 Expose usable_size and version. Credits https://github.com/nathaniel-daniel";>@nathaniel-daniel. Link with libatomic on armv6-linux. Credits https://github.com/notorca";>@notorca. Add no_thp option for Linux/Android. Credits https://github.com/devnexen";>@devnexen. Version 0.1.41 Changes Fix _mi_option_last Feature gate arena in extended Version 0.1.40 Changes Mimalloc v2.1.4. Add arena support. Commits https://github.com/purpleprotocol/mimalloc_rust/commit/a9c410c248859d3f55fbb7d29a88f28cdf296f6b";>a9c410c v0.1.42 https://github.com/purpleprotocol/mimalloc_rust/commit/3ba384fe63b9b4096bf427bc1cafd3a2059f1853";>3ba384f Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/117";>#117 from nathaniel-daniel/fix-pr-106 https://github.com/purpleprotocol/mimalloc_rust/commit/310ffbef61e6647e134b849a5616475ad7475c98";>310ffbe Gate usable_size behind extended feature https://github.com/purpleprotocol/mimalloc_rust/commit/bb3dce149307594f7bcfbbe7e564c9345db6c88f";>bb3dce1 Merge branch 'purpleprotocol:master' into fix-pr-106 https://github.com/purpleprotocol/mimalloc_rust/commit/679b170bdfdd0659df0e0fa8977a9b93222054af";>679b170 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/115";>#115 from notorca/master https://github.com/purpleprotocol/mimalloc_rust/commit/b541013625430f3bfa193448470b27bee7b801c3";>b541013 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/118";>#118 from nathaniel-daniel/fix-issue-114 https://github.com/purpleprotocol/mimalloc_rust/commit/4ae150e5fe8066f6e92c5ca79a6b167282325995";>4ae150e Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/119";>#119 from nathaniel-daniel/version https://github.com/purpleprotocol/mimalloc_rust/commit/7a15df134eeb87379bb67b65aa4fdf2d76935e16";>7a15df1 Test extended feature on ci https://github.com/purpleprotocol/mimalloc_rust/commit/e1b2306728ac67354d2301bc040d877bb31cd391";>e1b2306 Run cargo fmt https://github.com/purpleprotocol/mimalloc_rust/commit/7d1366d56e8b4857382740f293d8ae58fd97f1a4";>7d1366d Expose version function Additional commits viewable in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.39...v0.1.42";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mimalloc&package-manager=cargo&previous-version=0.1.39&new-version=0.1.42)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this
[PR] build(deps): bump parking_lot from 0.12.2 to 0.12.3 [datafusion-python]
dependabot[bot] opened a new pull request, #718: URL: https://github.com/apache/datafusion-python/pull/718 Bumps [parking_lot](https://github.com/Amanieu/parking_lot) from 0.12.2 to 0.12.3. Changelog Sourced from https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md";>parking_lot's changelog. parking_lot 0.12.3 (2024-05-24) Export types provided by arc_lock feature (https://redirect.github.com/Amanieu/parking_lot/issues/442";>#442) Commits https://github.com/Amanieu/parking_lot/commit/a29dd3d4ff661f73fd5cf638584bc4c5de2ad347";>a29dd3d Release parking_lot 0.12.3 https://github.com/Amanieu/parking_lot/commit/f7efcae51161c25f7839cddd20cc1b809441e5e8";>f7efcae Merge pull request https://redirect.github.com/Amanieu/parking_lot/issues/442";>#442 from iwanders/add-arc_lock-feature-top-level-exports https://github.com/Amanieu/parking_lot/commit/c357017decfa0f21ca597a4958745ad0999cf9eb";>c357017 Export types provided by arc_lock feature. See full diff in https://github.com/Amanieu/parking_lot/compare/0.12.2...0.12.3";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=parking_lot&package-manager=cargo&previous-version=0.12.2&new-version=0.12.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
comphead commented on code in PR #10647: URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614830679 ## datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs: ## @@ -20,35 +20,38 @@ use arrow_schema::{Schema, SchemaRef}; use std::fmt::Debug; use std::sync::Arc; -/// Factory of schema adapters. +/// Factory for creating [`SchemaAdapter`] /// -/// Provides means to implement custom schema adaptation. +/// This interface provides a way to implement custom schema adaptation logic +/// for ParquetExec (for example, to fill missing columns with default value +/// other than null) pub trait SchemaAdapterFactory: Debug + Send + Sync + 'static { /// Provides `SchemaAdapter` for the ParquetExec. fn create(&self, schema: SchemaRef) -> Box; } -/// A utility which can adapt file-level record batches to a table schema which may have a schema +/// Adapt file-level [`RecordBatch`]es to a table schema, which may have a schema /// obtained from merging multiple file-level schemas. /// /// This is useful for enabling schema evolution in partitioned datasets. /// /// This has to be done in two stages. /// -/// 1. Before reading the file, we have to map projected column indexes from the table schema to -///the file schema. +/// 1. Before reading the file, we have to map projected column indexes from the +///table schema to the file schema. /// -/// 2. After reading a record batch we need to map the read columns back to the expected columns -///indexes and insert null-valued columns wherever the file schema was missing a colum present -///in the table schema. +/// 2. After reading a record batch map the read columns back to the expected +///columns indexes and insert null-valued columns wherever the file schema was +///missing a colum present in the table schema. Review Comment: ```suggestion ///missing a column present in the table schema. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
viirya commented on code in PR #473: URL: https://github.com/apache/datafusion-comet/pull/473#discussion_r1614819511 ## common/src/main/java/org/apache/comet/parquet/ColumnReader.java: ## @@ -230,15 +230,15 @@ public CometDecodedVector loadVector() { // return plain vector. currentVector = cometVector; return currentVector; - } else if (dictionary == null) { -// There is dictionary from native side but the Java side dictionary hasn't been -// initialized yet. -Dictionary arrowDictionary = dictionaryProvider.lookup(dictionaryEncoding.getId()); -CometPlainVector dictionaryVector = -new CometPlainVector(arrowDictionary.getVector(), useDecimal128, isUuid); -dictionary = new CometDictionary(dictionaryVector); } + // We should already re-initiate `CometDictionary` here because `Data.importVector` API will + // release the previous dictionary vector and create a new one. + Dictionary arrowDictionary = dictionaryProvider.lookup(dictionaryEncoding.getId()); + CometPlainVector dictionaryVector = + new CometPlainVector(arrowDictionary.getVector(), useDecimal128, isUuid); + dictionary = new CometDictionary(dictionaryVector); + Review Comment: The test failure is in TPCDSQuerySuite in #437. I don't have a simple reproducible test case for it. But I think the change is straightforward and safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
viirya commented on PR #473: URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131403965 cc @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] ColumnReader.loadVector should initiate CometDictionary after re-import arrays [datafusion-comet]
viirya opened a new issue, #474: URL: https://github.com/apache/datafusion-comet/issues/474 ### Describe the bug During debugging some test failures in #437, I found that `CometScanExec` returns empty dictionary values after first batch in failed queries. It is because in `Data.importVector` API call, it will clear existing dictionary vector in the given `dictionaryProvider`. ### Steps to reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]
viirya opened a new pull request, #473: URL: https://github.com/apache/datafusion-comet/pull/473 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove merged PR #461: URL: https://github.com/apache/datafusion-comet/pull/461 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Add quote-style parameter for CSV options [datafusion]
DDtKey commented on issue #10669: URL: https://github.com/apache/datafusion/issues/10669#issuecomment-2131388179 I think this might be labeled with `good first issue`, there are links to the code that needs to be changed and it is also possible to write `sqllogictest` similar to https://github.com/apache/datafusion/pull/10671. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] ArrayList data is leaked to outside unnest logical plan from the inner unnest logical plan that comes before a Limit [datafusion]
duongcongtoai commented on issue #10672: URL: https://github.com/apache/datafusion/issues/10672#issuecomment-2131375997 I'm taking it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] ArrayList data is leaked to outside unnest logical plan from the inner unnest logical plan with limit applied [datafusion]
duongcongtoai opened a new issue, #10672: URL: https://github.com/apache/datafusion/issues/10672 ### Describe the bug Given this query ``` select unnest(struct_c1c0), unnest(list_c2c0) from ( select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as list_c2c0 from temp limit 1 ); ``` where list_c2c0 is the result of an unnest logical plan followed by a limit ``` | logical_plan | Unnest: lists[unnest(list_c2c0)] structs[unnest(struct_c1c0)] | | | Projection: unnest(temp.column1) AS unnest(struct_c1c0), get_field(unnest(column2), Utf8("c0")) AS unnest(list_c2c0)| | | Limit: skip=0, fetch=1 | | | Unnest: lists[unnest(temp.column1), unnest(column2)] structs[] | | | Projection: temp.column1 AS unnest(temp.column1), temp.column2 AS unnest(column2) | | | TableScan: temp projection=[column1, column2] ``` then unnesting on this column "list_c2c0" again will cause the error ``` Arrow error: Invalid argument error: all columns in a record batch must have the same length ``` ### To Reproduce ``` > create table temp as values ([struct('a')], [struct([1,2]),struct([2,3])]); 0 row(s) fetched. Elapsed 0.008 seconds. > select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as list_c2c0 from temp limit 1; +-+---+ | struct_c1c0 | list_c2c0 | +-+---+ | {c0: a} | [1, 2]| +-+---+ 1 row(s) fetched. Elapsed 0.010 seconds. > select unnest(struct_c1c0), unnest(list_c2c0) from ( select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as list_c2c0 from temp limit 1 ); Arrow error: Invalid argument error: all columns in a record batch must have the same length ``` ### Expected behavior Correct result returned ``` > select unnest(struct_c1c0), unnest(list_c2c0) from ( select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as list_c2c0 from temp limit 1 ); ++---+ | unnest(struct_c1c0).c0 | unnest(list_c2c0) | ++---+ | a | 1 | | a | 2 | ++---+ ``` ### Additional context The problem may be somewhere around this line ``` https://github.com/apache/datafusion/blob/34eda15b73a9e278af8844b30ed2f1c21c10359c/datafusion/physical-plan/src/unnest.rs#L431 ``` The context of this function is it's trying to downcast the generic Array into underlying type and use the underlying values as the unnested column values. However this downcast somehow leak all the values which were previously skipped at the Limit operator, thus leak the extra elements [2,3]. If we add these debug lines ``` println!( "dyn array length: {} and data {:?}", list_arrays[0].len(), list_arrays[0] ); println!( "down casted array length {} and data {:?}", typed_arrays[0].values().len(), typed_arrays[0].values() ); ``` The result will be ``` dyn array length: 1 and data ListArray [ PrimitiveArray [ 1, 2, ], ] down casted array length 4 and data PrimitiveArray [ 1, 2, 2, 3, ] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: pass `quote` parameter to CSV writer [datafusion]
DDtKey opened a new pull request, #10671: URL: https://github.com/apache/datafusion/pull/10671 ## Which issue does this PR close? Closes #10670 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? `sqllogictest` has been added which fails without these changes ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614789127 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: This test relies on a cast that we do not yet support and enables `COMET_CAST_ALLOW_INCOMPATIBLE` to allow it. I will revert the last change and add a comment about this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614788921 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 Review Comment: Removing that case causes a test failure: ``` - scalar subquery *** FAILED *** (8 seconds, 253 milliseconds) Cause: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 410.0 failed 1 times, most recent failure: Lost task 0.0 in stage 410.0 (TID 1286) (192.168.64.23 executor driver): org.apache.comet.CometNativeException: Execution error: Comet Internal Error: Native cast invoked for unsupported cast from Int32 to Decimal128(38, 10) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Add "Comet Fuzz" fuzz-testing utility [datafusion-comet]
andygrove opened a new pull request, #472: URL: https://github.com/apache/datafusion-comet/pull/472 ## Which issue does this PR close? N/A ## Rationale for this change Comet Fuzz is a standalone project for generating random data and queries and executing queries against Spark with Comet disabled and enabled and checking for incompatibilities. Although it is a simple tool it has already been useful in finding many bugs. Comet Fuzz is inspired by the [SparkFuzz](https://ir.cwi.nl/pub/30222) paper from Databricks and CWI. ## What changes are included in this PR? New standalone `comet-fuzz` project. ## How are these changes tested? Manually tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] CSV `quote` parameter is ignored [datafusion]
DDtKey opened a new issue, #10670: URL: https://github.com/apache/datafusion/issues/10670 ### Describe the bug quote` parameter of CSV writer is not getting passed to `arrow-csv` writer: https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/file_options/csv_writer.rs#L48-L75 ### To Reproduce ```sql COPY (select '12,3', 'abc,d' ) TO 'test.csv' STORED AS csv OPTIONS ('quote' '|') ``` results in ```csv "Utf8(""12,3"")","Utf8(""abc,d"")" "12,3","abc,d" ``` Result is the same if we use `write_csv` with passed option ### Expected behavior Writer should use appropriate quote char. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add quote-style parameter for CSV options [datafusion]
DDtKey opened a new issue, #10669: URL: https://github.com/apache/datafusion/issues/10669 ### Is your feature request related to a problem or challenge? CSV writers usually supports configuration of quote style/mode with the following options: - `Always` - `Necessary` - `Never` - `NonNumeric` Sometimes this just need to be controlled, and for now only way to change that is to re-iterate through result file(s) in order to store the content with desired quote style. You can find such configs in many libraries: - `csv` crate ([`QuoteStyle`](https://docs.rs/csv/latest/csv/enum.QuoteStyle.html)), - `csv` from python (constants, like [`QUOTE_ALL`](https://docs.python.org/3/library/csv.html#csv.QUOTE_ALL) - in Appach Commons CSV for Java ([`QuoteMode`](https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)) ### Describe the solution you'd like Just expose a way to pass the `QuoteStyle` enum along with other properties like `quote`, `delimiter` and etc (as part of `CsvOptions`). However, need to keep in mind that the configuration only makes sense for writers, not readers. That shouldn't be an issue to support, because `datafusion` relies on `arrow-csv` which uses `csv` crate under the hood. - requires to update `arrow-csv` to accept quote-style param (sub-issue for `arrow-rs`?) - add to `WriterBuilder`: https://github.com/apache/arrow-rs/blob/4b5d9bfc958c06fb1ff71d90ba58497e965eff40/arrow-csv/src/writer.rs#L191-L214 - pass to `csv::Writer`: https://github.com/apache/arrow-rs/blob/4b5d9bfc958c06fb1ff71d90ba58497e965eff40/arrow-csv/src/writer.rs#L402-L408 - update `datafusion` - add parameter to `CsvOptions`: https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/config.rs#L1554-L1570 - pass to `arrow-csv`: https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/file_options/csv_writer.rs#L48-L75 ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve ObjectStoreUrl docs + examples [datafusion]
goldmedal commented on PR #10619: URL: https://github.com/apache/datafusion/pull/10619#issuecomment-2131302454 On a related note, I found that the usage flow of ObjectStore in DataFusion involves registering the source as a table and then querying the table. This approach makes sense. However, I think it might be possible to provide something like a table function or other methods to allow users to dynamically decide which path they want to scan. There are similar features in DuckDB and Spark: - DuckDB: `SELECT * FROM read_parquet('file-path/test.parquet')` - Spark: `spark.read.parquet('')` Does DataFusion already have this feature, or did I miss it? If not, I think it would be a valuable feature for users of object storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Library Guide: Building LogicalPlans [datafusion]
edmondop commented on issue #7306: URL: https://github.com/apache/datafusion/issues/7306#issuecomment-2131289288 > > What do you think we should do? I can add that paragraph, but maybe we also want to link the examples? What else do you think it's needed to close this > > Those both sound like excellent ideas to close off this issue. Thank you @edmondop I noticed that the original work from @andygrove referred examples in the docs and these seems to be pretty "unique" in the sense that no other doc page seems to take this approach , see https://github.com/apache/datafusion/tree/main/docs/src What do you think about moving those into examples and linking them from the docs? Do we have a preprocessor macro to includes the Rust code in a markdown upon compilation, or maybe to ensure that the embedded snippets compile and execute correctly ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] feat: Implement ANSI support for UnaryMinus [datafusion-comet]
vaibhawvipul opened a new pull request, #471: URL: https://github.com/apache/datafusion-comet/pull/471 ## Which issue does this PR close? Closes #465 . ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Substrait's VirtualTables [datafusion]
jonahgao commented on code in PR #10531: URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614673020 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -685,6 +685,19 @@ async fn roundtrip_literal_struct() -> Result<()> { .await } +#[tokio::test] +async fn roundtrip_values() -> Result<()> { +assert_expected_plan( +"VALUES (1, 'a', [[-213.1, NULL, 5.5, 2.0, 1.0], []], STRUCT(true, 1, CAST(NULL AS STRING))), (NULL, NULL, NULL, NULL)", +"Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]), Struct({c0:true,c1:1,c2:})), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:,c2:}))") +.await +} + +#[tokio::test] +async fn roundtrip_empty_relation() -> Result<()> { +roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await Review Comment: We might need a test to verify the schemas. For this test, the plan str on both sides would just be `EmptyRelation`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Convert Variance Population to UDAF [datafusion]
jayzhan211 opened a new issue, #10668: URL: https://github.com/apache/datafusion/issues/10668 ### Is your feature request related to a problem or challenge? Similar to #10667 but for Variance Population function ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Substrait's VirtualTables [datafusion]
jonahgao commented on code in PR #10531: URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614671030 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -685,6 +685,19 @@ async fn roundtrip_literal_struct() -> Result<()> { .await } +#[tokio::test] +async fn roundtrip_values() -> Result<()> { +assert_expected_plan( +"VALUES (1, 'a', [[-213.1, NULL, 5.5, 2.0, 1.0], []], STRUCT(true, 1, CAST(NULL AS STRING))), (NULL, NULL, NULL, NULL)", +"Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], []]), Struct({c0:true,c1:1,c2:})), (Int64(NULL), Utf8(NULL), List(), Struct({c0:,c1:,c2:}))") +.await +} + +#[tokio::test] +async fn roundtrip_empty_relation() -> Result<()> { +roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await Review Comment: I've slightly modified this test and printed out the schema on the consumer side, but it seems to be incorrect. ```rust #[tokio::test] async fn roundtrip_empty_relation() -> Result<()> { roundtrip("SELECT * FROM (VALUES ([STRUCT(1 as a)], 2)) LIMIT 0").await } ``` ```sh plan2 schema: DFSchema { inner: Schema { fields: [ Field { name: "column1", data_type: List(Field { name: "item", data_type: Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, # Its name should be "column2" Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None, None], functional_dependencies: FunctionalDependencies { deps: [] } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Convert `Variance Sample` to UDAF [datafusion]
jayzhan211 opened a new issue, #10667: URL: https://github.com/apache/datafusion/issues/10667 ### Is your feature request related to a problem or challenge? Part of #8708 1. Move variance aggregate expression in `datafusion/physical-expr/src/aggregate/variance.rs` to `functions-aggregate`. 2. Add expression api in `roundtrip_expr_api` in `datafusion/proto/tests/cases/roundtrip_logical_plan.rs` Similar to #10372 #10418 ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]
alamb commented on PR #10543: URL: https://github.com/apache/datafusion/pull/10543#issuecomment-2131268529 Awesome! I plan to merge this PR tomorrow unless anyone else would like time to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve `ParquetExec` and related documentation [datafusion]
alamb commented on code in PR #10647: URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614666750 ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -642,11 +714,22 @@ fn should_enable_page_index( .unwrap_or(false) } -/// Factory of parquet file readers. +/// Interface for creating [`AsyncFileReader`]s to read parquet files. +/// +/// This interface is used by [`ParquetOpener`] in order to create readers for +/// parquet files. Implementations of this trait can be used to provide custom Review Comment: Excellent idea. I did so ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics; pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper}; pub use statistics::{RequestedStatistics, StatisticsConverter}; -/// Execution plan for scanning one or more Parquet partitions +/// Execution plan for reading one or more Parquet files. +/// +/// ```text +/// ▲ +/// │ +/// │ Produce a stream of +/// │ RecordBatches +/// │ +/// ┌───┐ +/// │ │ +/// │ ParquetExec │ +/// │ │ +/// └───┘ +/// ▲ +/// │ Asynchronously read from one +/// │ or more parquet files via +/// │ ObjectStore interface +/// │ +/// │ +/// .───. +/// │ ) +/// │`───'│ +/// │ObjectStore │ +/// │.───.│ +/// │ ) +/// `───' +/// +/// ``` +/// # Features +/// +/// Supports the following optimizations: +/// +/// * Multi-threaded (aka multi-partition): read from one or more files in +/// parallel. Can read concurrently from multiple row groups from a single file. +/// +/// * Predicate push down: skips row groups and pages based on +/// min/max/null_counts in the row group metadata, the page index and bloom +/// filters. +/// +/// * Projection pushdown: reads and decodes only the columns required. +/// +/// * Limit pushdown: stop execution early after some number of rows are read. +/// +/// * Custom readers: controls I/O for accessing pages. See Review Comment: good call -- updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support Substrait's VirtualTables [datafusion]
jonahgao commented on code in PR #10531: URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614666495 ## datafusion/substrait/src/logical_plan/producer.rs: ## @@ -512,6 +572,62 @@ pub fn to_substrait_rel( } } +fn to_substrait_named_struct(schema: &DFSchemaRef) -> Result { +// Substrait wants a list of all field names, including nested fields from structs, +// also from within lists and maps. However, it does not want the list and map field names +// themselves - only structs are considered to have useful names. +fn names_dfs(dtype: &DataType) -> Result> { +match dtype { +DataType::Struct(fields) => { +let mut names = Vec::new(); +for field in fields { +names.push(field.name().to_string()); +names.extend(names_dfs(field.data_type())?); +} +Ok(names) +} +DataType::List(l) => names_dfs(l.data_type()), Review Comment: I think we also need to consider `DataType::LargeList` here because `to_substrait_type` supports it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic] Unify `AggregateFunction` Interface (remove built in list of `AggregateFunction` s), improve the system [datafusion]
jayzhan211 commented on issue #8708: URL: https://github.com/apache/datafusion/issues/8708#issuecomment-2131265763 > After #10648 and #10389 I think we have a pretty good set of examples of how to move aggregates out of the core (thanks to all the foundations layed by @jayzhan211 ) > > Would it be helpful to file a few "good first issue" type tickets for some of the more straightforward aggregates (I am thinking the statistical variance etc)? Before this, I think it would be nice to determine the API of aggregate expr first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]
peter-toth commented on code in PR #10543: URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614657172 ## datafusion/physical-plan/src/work_table.rs: ## @@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec { &self.cache } -fn children(&self) -> Vec> { +fn children(&self) -> Vec<&Arc> { Review Comment: I opened a PR as returning a slice in `ConcreteTreeNode` is possible: https://github.com/apache/datafusion/pull/10666 But it will not work for `LogicalPlan` or `ExecutionPlan` or `PhysicalExpr`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614656536 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -1778,7 +1778,7 @@ AS VALUES ('BB', 6, 1); query TII -select col1, col2, coalesce(sum_col3, 0) as sum_col3 +select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3 Review Comment: I think this will fail 🤔 since I disable the coercion between i64 and u64, but I'm not 100% sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
jayzhan211 commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614656536 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -1778,7 +1778,7 @@ AS VALUES ('BB', 6, 1); query TII -select col1, col2, coalesce(sum_col3, 0) as sum_col3 +select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3 Review Comment: I think this will fail 🤔 since I disable the coercion between i64 and u64, but I'm not 100% sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Remove `GetFieldAccessSchema` [datafusion]
alamb merged PR #10665: URL: https://github.com/apache/datafusion/pull/10665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Minor: Use slice in `ConcreteTreeNode` [datafusion]
peter-toth opened a new pull request, #10666: URL: https://github.com/apache/datafusion/pull/10666 The idea of using slices to return a node's children came up here: https://github.com/apache/datafusion/pull/10543#discussion_r1614057653. While it is not possible in all `TreeNode` implementations, the 2 `ConcreteTreeNode` implementations (`ExprContext` and `PlanContext`) own their children in `Vec` form, so there is a way to return a slice like `&[Self]` (instead of the current `Vec<&Self>`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time [datafusion]
alamb commented on issue #10602: URL: https://github.com/apache/datafusion/issues/10602#issuecomment-2131254591 Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow for the thoughts. From my perspective, the current (non timezone aware) `date_bin` function has the benefit of being (relatively) simple to implement and fast (as it is some arithmetic calculation on the underlying integer value without having to do DST conversions per row) Given the differences in underlying timestamp representation between arrow and postgres I do think some difference is likely inevitable and thus likely not a deal breaker. Here are my suggested next steps @appletreeisyellow tries to prototype one or both proposals and see if we can get it to produce the desired results: 1. create a `to_local_time` function 2. Modify to the `date_bin` function to make it timezone aware I think the `to_local_time` might be the simplest approach. * A ScalarUDF * a `Signature` that takes `Timestamp(.., *)` * produces a `Timestamp(.., None)`. * The `invoke` function would do the oppositte of whatever `cast(Timestamp(.., None) --> Timestamp(.., TZ))` does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
alamb commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614654486 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -1778,7 +1778,7 @@ AS VALUES ('BB', 6, 1); query TII -select col1, col2, coalesce(sum_col3, 0) as sum_col3 +select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3 Review Comment: maybe also we can change this test back too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix typo in Cargo.toml (unused manifest key: dependencies.regex.worksapce) [datafusion]
alamb commented on PR #10662: URL: https://github.com/apache/datafusion/pull/10662#issuecomment-2131244635 > oops.. my mistake. :P Many thanks. No worries! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] build(deps): bump syn from 2.0.63 to 2.0.64 [datafusion-python]
dependabot[bot] commented on PR #706: URL: https://github.com/apache/datafusion-python/pull/706#issuecomment-2131243514 Superseded by #717. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] build(deps): bump syn from 2.0.63 to 2.0.66 [datafusion-python]
dependabot[bot] opened a new pull request, #717: URL: https://github.com/apache/datafusion-python/pull/717 Bumps [syn](https://github.com/dtolnay/syn) from 2.0.63 to 2.0.66. Release notes Sourced from https://github.com/dtolnay/syn/releases";>syn's releases. 2.0.66 Allow braced structs when parsing ExprLet (https://redirect.github.com/dtolnay/syn/issues/1671";>#1671) 2.0.65 Optimize the implementation of Fold to compile faster (https://redirect.github.com/dtolnay/syn/issues/1666";>#1666, https://redirect.github.com/dtolnay/syn/issues/1667";>#1667, https://redirect.github.com/dtolnay/syn/issues/1668";>#1668) 2.0.64 Support using ParseBuffer across catch_unwind (https://redirect.github.com/dtolnay/syn/issues/1646";>#1646) Validate that the expression in a let-else ends in brace as required by rustc (https://redirect.github.com/dtolnay/syn/issues/1648";>#1648, https://redirect.github.com/dtolnay/syn/issues/1649";>#1649) Legalize invalid const generic arguments by wrapping in braces (https://redirect.github.com/dtolnay/syn/issues/1654";>#1654, https://redirect.github.com/dtolnay/syn/issues/1655";>#1655) Fix some expression precedence edge cases involving break and return in loop headers (https://redirect.github.com/dtolnay/syn/issues/1656";>#1656) Always print closure bodies with a brace when the closure has an explicit return type (https://redirect.github.com/dtolnay/syn/issues/1658";>#1658) Automatically insert necessary parentheses in ToTokens for Expr when required by expression precedence (https://redirect.github.com/dtolnay/syn/issues/1659";>#1659) Support struct literal syntax in match guard expressions (https://redirect.github.com/dtolnay/syn/issues/1662";>#1662) Commits https://github.com/dtolnay/syn/commit/b992916bdac459d279bb15098507d43b1febc50e";>b992916 Release 2.0.66 https://github.com/dtolnay/syn/commit/4f0a23f7e147e801a996975662b0d6e6d8a7d13b";>4f0a23f Merge pull request https://redirect.github.com/dtolnay/syn/issues/1671";>#1671 from dtolnay/exprlet https://github.com/dtolnay/syn/commit/c6d87a7aa0fda4a8ef867e833d14d105d07ca62b";>c6d87a7 Allow braced structs when parsing ExprLet https://github.com/dtolnay/syn/commit/747f42f235f8e1b5551a6aeca1d2779dce413408";>747f42f Update with proc-macro2 1.0.83's syntax tree sizes https://github.com/dtolnay/syn/commit/9f2371eefa6f681b53e4d74458d86dd41673227f";>9f2371e Release 2.0.65 https://github.com/dtolnay/syn/commit/4cd181325f3488c47866f15966977682be610da1";>4cd1813 Merge pull request https://redirect.github.com/dtolnay/syn/issues/1668";>#1668 from dtolnay/foldhelper https://github.com/dtolnay/syn/commit/ed54092bcea6798ab0b5ed7aca6755f8918fc79e";>ed54092 Eliminate gen::helper module https://github.com/dtolnay/syn/commit/eacc8ab1b98b590df3ce9462510fd755cddf6762";>eacc8ab Eliminate FoldHelper trait https://github.com/dtolnay/syn/commit/6e20bb8d7799d0f4c34c144e80b3bd1b6e9afd27";>6e20bb8 Merge pull request https://redirect.github.com/dtolnay/syn/issues/1667";>#1667 from dtolnay/punctuatedfold https://github.com/dtolnay/syn/commit/9d95cab6d332d08903538d5ce3d6e47c1598912e";>9d95cab Optimize punctuated::fold Additional commits viewable in https://github.com/dtolnay/syn/compare/2.0.63...2.0.66";>compare view [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=syn&package-manager=cargo&previous-version=2.0.63&new-version=2.0.66)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabo
Re: [PR] build(deps): bump syn from 2.0.63 to 2.0.64 [datafusion-python]
dependabot[bot] closed pull request #706: build(deps): bump syn from 2.0.63 to 2.0.64 URL: https://github.com/apache/datafusion-python/pull/706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]
jayzhan211 commented on code in PR #10644: URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614640283 ## datafusion/functions-aggregate/Cargo.toml: ## @@ -39,6 +39,7 @@ path = "src/lib.rs" [dependencies] arrow = { workspace = true } +arrow-schema = { workspace = true } Review Comment: yes, some macro has inevitable `arrow-schema` dependency https://github.com/apache/arrow-rs/issues/5676. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] change version to 38.0.1 [datafusion-python]
andygrove merged PR #716: URL: https://github.com/apache/datafusion-python/pull/716 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614639502 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +), +DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Decimal256(_, _) +), +DataType::Utf8 => matches!(to_type, DataType::Binary), +DataType::Date32 => matches!(to_type, DataType::Utf8), +DataType::Timestamp(_, _) => { +matches!( +to_type, +DataType::Int64 | DataType::Date32 | DataType::Utf8 | DataType::Timestamp(_, _) +) +} +DataType::Binary => { +// note that this is not completely Spark compatible because +// DataFusion only supports binary data containing valid UTF-8 strings +matches!(to_type, DataType::Utf8) +} +_ => false, Review Comment: Casting from `Int64` to `Int32` for `Try` is covered here: ```rust DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( to_type, DataType::Boolean | DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 | DataType::Float32 | DataType::Float64 | DataType::Decimal128(_, _) | DataType::Utf8 ), ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org -
Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]
andygrove commented on code in PR #461: URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614638361 ## core/src/execution/datafusion/expressions/cast.rs: ## @@ -622,14 +590,89 @@ impl Cast { self.eval_mode, from_type, to_type, -)? +) +} +_ if Self::is_datafusion_spark_compatible(from_type, to_type) => { +// use DataFusion cast only when we know that it is compatible with Spark +Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?) } _ => { -// when we have no Spark-specific casting we delegate to DataFusion -cast_with_options(&array, to_type, &CAST_OPTIONS)? +// we should never reach this code because the Scala code should be checking +// for supported cast operations and falling back to Spark for anything that +// is not yet supported +Err(CometError::Internal(format!( +"Native cast invoked for unsupported cast from {from_type:?} to {to_type:?}" +))) } }; -Ok(spark_cast(cast_result, from_type, to_type)) +Ok(spark_cast(cast_result?, from_type, to_type)) +} + +/// Determines if DataFusion supports the given cast in a way that is +/// compatible with Spark +fn is_datafusion_spark_compatible(from_type: &DataType, to_type: &DataType) -> bool { +if from_type == to_type { +return true; +} +match from_type { +DataType::Boolean => matches!( +to_type, +DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Utf8 +), +DataType::Int8 | DataType::Int16 | DataType::Int32 | DataType::Int64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 +| DataType::Decimal128(_, _) +| DataType::Utf8 +), +DataType::Float32 | DataType::Float64 => matches!( +to_type, +DataType::Boolean +| DataType::Int8 +| DataType::Int16 +| DataType::Int32 +| DataType::Int64 +| DataType::Float32 +| DataType::Float64 Review Comment: Yes, that is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Remove `GetFieldAccessSchema` [datafusion]
jayzhan211 opened a new pull request, #10665: URL: https://github.com/apache/datafusion/pull/10665 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131238078 Thanks, @alamb! Actually, I tried to add some tests for `MemoryFunctionRegistry` before I fixed it, but I found everything was fine. I guess the reason is that we don't insert aliases when using `MemoryFunctionRegistry#register_udf`. https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/execution/src/registry.rs#L174-L176 Then, I found that the aliases are only inserted by `SessionState` in https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/core/src/execution/context/mod.rs#L2278-L2283 That's why I added tests for `SessionState`. I'm not sure, but maybe we should also insert aliases in `MemoryFunctionRegistry`. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Add tests showing aggregate behavior for NaNs [datafusion]
alamb commented on code in PR #10634: URL: https://github.com/apache/datafusion/pull/10634#discussion_r1614628009 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -4374,6 +4374,42 @@ GROUP BY dummy text1, text1, text1 +# Tests for aggregating with NaN values +statement ok +CREATE TABLE float_table ( +col_f32 FLOAT, +col_f32_nan FLOAT, +col_f64 DOUBLE, +col_f64_nan DOUBLE +) as VALUES +( -128.2, -128.2, -128.2, -128.2 ), +( 32768.3, arrow_cast('NAN','Float32'), 32768.3, 32768.3 ), +( 27.3,arrow_cast('NAN','Float64'), 27.3,27.3 ); Review Comment: Yes, excellent catch -- fixed in 43cb27e669d5ff28f76e70bf54447cbb76df16f2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] More properly handle nullability of types/literals in Substrait [datafusion]
alamb commented on PR #10640: URL: https://github.com/apache/datafusion/pull/10640#issuecomment-2131235829 Thanks @Blizzara and @jonahgao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]
alamb commented on PR #10573: URL: https://github.com/apache/datafusion/pull/10573#issuecomment-2131235504 > This shouldn't have passed checks. > > ``` > + cargo fmt --all -- --check > `cargo metadata` exited with an error: error: failed to load manifest for workspace member `/opt/dev/datafusion/datafusion/core` > referenced by workspace at `/opt/dev/datafusion/Cargo.toml` > > Caused by: > failed to load manifest for dependency `datafusion-functions` > > Caused by: > failed to parse manifest at `/opt/dev/datafusion/datafusion/functions/Cargo.toml` > > Caused by: > dependency (regex) specified without providing a local path, Git repository, version, or workspace dependency to use > ``` > > functions/Cargo.toml > > ``` > regex = { worksapce = true, optional = true } > ``` Yeah, I don't know why that is a warning and not an error -- here is a PR to fix it: https://github.com/apache/datafusion/pull/10662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Csv Options Clean-up [datafusion]
alamb commented on PR #10650: URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131235103 > Maybe we can do the following: Instead of making every TableProviderFactory responsible for implementing this behavior in create, we can add some general mechanism at the trait level (maybe via a default method implementation?) to make this fallback automatic for all TableProviderFactory implementations unless overridden. @alamb, what do you think? Any ideas come to mind? Thank you for the explanation -- what you describe makes sense -- maybe we can update the documentation to add some of this context and that would be good enough. I haven't thought through all the nuances of general fallbacks, but given what you say perhaps it is a corner case we can worry about if/when somone actually hits it. From my perspective the most important thing is for the behavior to be clearly documented. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Library Guide: Building LogicalPlans [datafusion]
alamb commented on issue #7306: URL: https://github.com/apache/datafusion/issues/7306#issuecomment-2131234075 > What do you think we should do? I can add that paragraph, but maybe we also want to link the examples? What else do you think it's needed to close this Those both sound like excellent ideas to close off this issue. Thank you @edmondop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]
timsaucer commented on issue #715: URL: https://github.com/apache/datafusion-python/issues/715#issuecomment-2131229316 My statement above about testing on rust side is likely incorrect. I ran the same test above but loading the dataframe from a parquet file instead of creating in memory and the expected behavior is reproduced. If you amend these lines to the bottom of the minimal example ``` df.write_parquet("save_out.parquet") df_reread = ctx.read_parquet("save_out.parquet") df_reread.show() df_reread.select(col("a")["outer_1"]["inner_2"]).show() ``` You get the expected result ``` DataFrame() +-+ | a | +-+ | {outer_1: {inner_1: 1, inner_2: 2}} | | {outer_1: {inner_1: 1, inner_2: }} | | {outer_1: } | +-+ DataFrame() +-+ | ?table?.a[outer_1][inner_2] | +-+ | 2 | | | | | +-+ ``` It also shows the original table is reproduced. I'll continue digging but I no longer am convinced this is a python binding issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time [datafusion]
Abdullahsab3 commented on issue #10602: URL: https://github.com/apache/datafusion/issues/10602#issuecomment-2131223669 Thanks for filing the ticket and for all the detailed explanations! very enriching I wonder whether the Postgres behavior is actually that bad. Though it looks weird, it still is generic enough to make it widely applicable. The issue with making `date_bin` specifically timezone-aware in my opinion is the fact that future or similar grouping functions (or rolling windows functions) will also have to be implemented in a timezone-aware way, accounting for similar pitfalls and edge cases. The same problem will also occur in other functionalities; for example if you would like to return local time. I personally think that the problem might be better tackled as an additional time functionality, which may be the same way that Postgres does it. The postgres way of converting UTC to local time of a given timezone is: ```sql select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 'Europe/Brussels'; timezone -- 2024-05-21 14:00:00 ``` As already mentioned by Andrew, the `AT TIME ZONE` operator in postgres converts a timezone-aware timestamp to local time (with no offset or timezone information), and local time to a timezone-aware timestamp. Though the overloaded functionalities of the `AT TIME ZONE` operator in Postgres are weird, they are definitely usable in my opinion. I also like the idea of having a separate `to_local_time`/`strip_timezone` function, though this would break Datafusions compatibility with Postgres. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [EPIC] Improved support for nested / structured types (`Struct` , `List`, `ListArray`, and other Composite types) [datafusion]
alamb commented on issue #2326: URL: https://github.com/apache/datafusion/issues/2326#issuecomment-2131214534 > I added an issue to support recursive unnest: #10660, i think it shoul belong to this epic Added -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FileScanConfig::new()` API [datafusion]
alamb commented on PR #10623: URL: https://github.com/apache/datafusion/pull/10623#issuecomment-2131213174 Thanks everyone for the reviews! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `FileScanConfig::new()` API [datafusion]
alamb merged PR #10623: URL: https://github.com/apache/datafusion/pull/10623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support convert LogicalPlan JOIN with `Using` constraint to SQL String [datafusion]
alamb commented on issue #10652: URL: https://github.com/apache/datafusion/issues/10652#issuecomment-2131211386 > Is there any plan to support them? I think we should make plans to support them! I started collecting issues on https://github.com/apache/datafusion/issues/8661 I also filed https://github.com/apache/datafusion/issues/10663 https://github.com/apache/datafusion/issues/10664 to track distint and window Looking at https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L83-L91 it actually looks like there are seveal other types of plans not yet supported (like EXPLAIN, INSERT, etc...) Maybe we should file tickets for them as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Suport unparsing `LogicalPlan::Window` to SQL [datafusion]
alamb opened a new issue, #10664: URL: https://github.com/apache/datafusion/issues/10664 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/9726 to complete the LogialPlan --> SQL conversion Converting `LogicalPlan` back to `SQL` is valuable for several usecases such as using DataFusion to programatically create SQL -- see the [`plan_to_sql.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs) example ### Describe the solution you'd like Support converting SQL with window functions like this: ```sql SELECT first_value OVER (x ORDER BY y) FROM foo; ``` ### Describe alternatives you've considered The basic pattern is this (see https://github.com/apache/datafusion/pull/10371 for an example): 1. Implement the LogicalPlan --> AST reverse code in `Unparser::plan_to_sql`([source link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L411-L413)) 3. Add a test in `roundtrip_statement` in `sql_integration.rs` [source link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/tests/sql_integration.rs#L4619) Note you can run the tests like ```shell cargo test -p datafusion-sql -- roundtrip_statement ``` ### Additional context I think this is a good first issue as the pattern is well established and there are explicit instructions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Suport unparsing `LogicalPlan::Distinct` to `DISTINCT` [datafusion]
alamb opened a new issue, #10663: URL: https://github.com/apache/datafusion/issues/10663 ### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/9726 to complete the LogialPlan --> SQL conversion Converting `LogicalPlan` back to `SQL` is valuable for several usecases such as using DataFusion to programatically create SQL -- see the [`plan_to_sql.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs) example ### Describe the solution you'd like Support converting SQL like this: ```sql SELECT DISTINCT x FROM foo; ``` ### Describe alternatives you've considered The basic pattern is this (see https://github.com/apache/datafusion/pull/10371 for an example): 1. Implement the LogicalPlan --> AST reverse code in `Unparser::plan_to_sql`([source link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L275)) 3. Add a test in `roundtrip_statement` in `sql_integration.rs` [source link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/tests/sql_integration.rs#L4619) Note you can run the tests like ```shell cargo test -p datafusion-sql -- roundtrip_statement ``` ### Additional context I think this is a good first issue as the pattern is well established and there are explicit instructions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix typo in Cargo.toml (unused manifest key: dependencies.regex.worksapce) [datafusion]
jonahgao merged PR #10662: URL: https://github.com/apache/datafusion/pull/10662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make TaskContext wrap SessionState [datafusion]
tustvold commented on issue #10631: URL: https://github.com/apache/datafusion/issues/10631#issuecomment-2131201943 IIRC SessionConfig is the static configuration used to create a SessionContext, which is an interior mutable wrapper around SessionState. The idea was a query is planned against an immutable SessionState to avoiding inconsistency during planning. RuntimeEnv is then for things shared across multiple sessions. As for TaskContext, I believe it was added for Ballista where the execution takes place on different nodes to planning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Make TaskContext wrap SessionState [datafusion]
alamb commented on issue #10631: URL: https://github.com/apache/datafusion/issues/10631#issuecomment-2131197268 > I see TaskContext is just part of SessionState. Could we just make TaskContext wrap SessionState? I can't remember why TaskContext doesn't wrap SessionState -- maybe @tustvold does 🤔 At some point `SessionState` was in `datafusion` (the core crate) so it couldn't be referenced by subcrates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Implement protobuf serialization for LogicalPlan::Unnest [datafusion]
alamb commented on issue #10639: URL: https://github.com/apache/datafusion/issues/10639#issuecomment-2131196191 Thanks @jamesmcm -- this would be a good first issue I think for someone with Rust experience looking to help with DataFusion. The protobuf serialization code is a bit hairy I find, but there are many examples of how to do it that we can follow -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on PR #10661: URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131195735 Thank you @goldmedal 🙏 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
alamb commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614546056 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2860,6 +2863,57 @@ mod tests { Ok(()) } +#[tokio::test] +async fn test_register_default_functions() -> Result<()> { +let config = SessionConfig::new(); +let catalog_list = +Arc::new(MemoryCatalogProviderList::new()) as Arc; +let mut new_state = SessionState { +session_id: Uuid::new_v4().to_string(), +analyzer: Analyzer::new(), +optimizer: Optimizer::new(), +physical_optimizers: PhysicalOptimizer::new(), +query_planner: Arc::new(DefaultQueryPlanner {}), +catalog_list, +table_functions: HashMap::new(), +scalar_functions: HashMap::new(), +aggregate_functions: HashMap::new(), +window_functions: HashMap::new(), +serializer_registry: Arc::new(EmptySerializerRegistry), +table_option_namespace: TableOptions::default_from_session_config( +config.options(), +), +config, +execution_props: ExecutionProps::new(), +runtime_env: Arc::new(RuntimeEnv::default()), +table_factories: HashMap::new(), +function_factory: None, +}; + +for function in all_default_functions() { Review Comment: This is a nice test -- though it is basically a copy of SessionContext::new I wonder if we could make a test that is a bit simpler for example by creating a new https://docs.rs/datafusion/latest/datafusion/execution/registry/struct.MemoryFunctionRegistry.html Something like (untested) ```rust let registry = MemoryFunctionRegistry::new(); for function in all_default_array_functions() { let existing_function = new_state.register_udf(function); assert!(existing_function.is_none(), "{} was already registered", function.name() } // and similarly for the aggregate and array functins ``` 🤔 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update substrait requirement from 0.33.3 to 0.34.0 [datafusion]
alamb merged PR #10632: URL: https://github.com/apache/datafusion/pull/10632 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]
alamb commented on code in PR #10644: URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614539590 ## datafusion/functions-aggregate/src/median.rs: ## @@ -15,71 +15,105 @@ // specific language governing permissions and limitations // under the License. -//! # Median - -use crate::aggregate::utils::{down_cast_any_ref, Hashable}; -use crate::expressions::format_state_name; -use crate::{AggregateExpr, PhysicalExpr}; -use arrow::array::{Array, ArrayRef}; -use arrow::datatypes::{DataType, Field}; -use arrow_array::cast::AsArray; -use arrow_array::{downcast_integer, ArrowNativeTypeOp, ArrowNumericType}; -use arrow_buffer::ArrowNativeType; -use datafusion_common::{DataFusionError, Result, ScalarValue}; -use datafusion_expr::Accumulator; -use std::any::Any; use std::collections::HashSet; use std::fmt::Formatter; -use std::sync::Arc; +use std::{fmt::Debug, sync::Arc}; + +use arrow::array::{downcast_integer, ArrowNumericType}; +use arrow::{ +array::{ArrayRef, AsArray}, +datatypes::{ +DataType, Decimal128Type, Decimal256Type, Field, Float16Type, Float32Type, +Float64Type, +}, +}; + +use arrow::array::Array; +use arrow::array::ArrowNativeTypeOp; +use arrow::datatypes::ArrowNativeType; + +use datafusion_common::{DataFusionError, Result, ScalarValue}; +use datafusion_expr::function::StateFieldsArgs; +use datafusion_expr::{ +function::AccumulatorArgs, utils::format_state_name, Accumulator, AggregateUDFImpl, +Signature, Volatility, +}; +use datafusion_physical_expr_common::aggregate::utils::Hashable; + +make_udaf_expr_and_func!( +Median, +median, +expression, +"Computes the median of a set of numbers", +median_udaf +); -/// MEDIAN aggregate expression. If using the non-distinct variation, then this uses a Review Comment: I think this comment still holds and would be valuable to bring to the new `Median` UDF Specifically that median uses substantial memory ## datafusion/expr/src/signature.rs: ## @@ -119,6 +119,8 @@ pub enum TypeSignature { OneOf(Vec), /// Specifies Signatures for array functions ArraySignature(ArrayFunctionSignature), +/// Fixed number of arguments of numeric types Review Comment: Could we please document (or link to documentation) about what types are "numeric"? Is it https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric ? ## datafusion/optimizer/src/single_distinct_to_groupby.rs: ## @@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy { ))) } } +Expr::AggregateFunction(AggregateFunction { Review Comment: Can we please add a test (if not already done) showing that the distinct aggregate is rewritten? Or perhaps there is already a test for median 🤔 ## datafusion/functions-aggregate/Cargo.toml: ## @@ -39,6 +39,7 @@ path = "src/lib.rs" [dependencies] arrow = { workspace = true } +arrow-schema = { workspace = true } Review Comment: is the issue that not re-exported in arrow? ## datafusion/optimizer/src/single_distinct_to_groupby.rs: ## @@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy { ))) } } +Expr::AggregateFunction(AggregateFunction { Review Comment: From what I can tell this is now a general optimization (as in it will rewrite *any* distinct user defined aggregate as well). If so, that is quite cool. Is it correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2893,21 +2893,21 @@ mod tests { for function in all_default_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_array_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_aggregate_functions() { let udaf = new_state.register_udaf(function).unwrap(); if let Some(udaf) = udaf { -assert!(false, "Function {} already registered", udaf.name()); +unreachable!("Function {} already registered", udaf.name()); Review Comment: Clippy suggests that I use `panic` or `unreachable` instead of `assert`. I'm not sure which one is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989 ## datafusion/core/src/execution/context/mod.rs: ## @@ -2893,21 +2893,21 @@ mod tests { for function in all_default_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_array_functions() { let udf = new_state.register_udf(function).unwrap(); if let Some(udf) = udf { -assert!(false, "Function {} already registered", udf.name()); +unreachable!("Function {} already registered", udf.name()); } } for function in all_default_aggregate_functions() { let udaf = new_state.register_udaf(function).unwrap(); if let Some(udaf) = udaf { -assert!(false, "Function {} already registered", udaf.name()); +unreachable!("Function {} already registered", udaf.name()); Review Comment: Clippy suggests me to use `panic` or `unreachable` instead of `assert`. I'm not sure which one is better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Remove duplicate function name in its aliases list [datafusion]
goldmedal commented on code in PR #10661: URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539494 ## datafusion/optimizer/src/push_down_limit.rs: ## @@ -65,7 +65,6 @@ impl OptimizerRule for PushDownLimit { }; let Limit { skip, fetch, input } = limit; -let input = input; Review Comment: I'm not sure why the clippy check in CI didn't detect this but it's showed when I ran clippy in my local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]
alamb commented on PR #10655: URL: https://github.com/apache/datafusion/pull/10655#issuecomment-2131186851 Thank you for the speedy review @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]
alamb merged PR #10655: URL: https://github.com/apache/datafusion/pull/10655 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Error on `NULL["field_name"]`: The expression to get an indexed field is only valid for `List`, `Struct`, or `Map` types, got Null [datafusion]
alamb closed issue #10654: Error on `NULL["field_name"]`: The expression to get an indexed field is only valid for `List`, `Struct`, or `Map` types, got Null URL: https://github.com/apache/datafusion/issues/10654 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]
alamb commented on code in PR #10655: URL: https://github.com/apache/datafusion/pull/10655#discussion_r1614536504 ## datafusion/functions/src/core/getfield.rs: ## @@ -106,6 +106,9 @@ impl ScalarUDFImpl for GetFieldFunc { }; let access_schema = GetFieldAccessSchema::NamedStructField { name: name.clone() }; let arg_dt = args[0].get_type(schema)?; +if arg_dt.is_null() { +return Ok(DataType::Null); +} access_schema Review Comment: Thanks @jayzhan211 I agree the change in https://github.com/apache/datafusion/commit/6cb45b5a9f2b53688a9754b1a7d06665ebd73a6d to inline things looks nice -- shall we make another PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Incorrect statistics read for binary columns in parquet [datafusion]
alamb closed issue #10605: Incorrect statistics read for binary columns in parquet URL: https://github.com/apache/datafusion/issues/10605 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix incorrect statistics read for binary columns in parquet [datafusion]
alamb merged PR #10645: URL: https://github.com/apache/datafusion/pull/10645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: add substrait support for Interval types and literals [datafusion]
alamb commented on code in PR #10646: URL: https://github.com/apache/datafusion/pull/10646#discussion_r1614534288 ## datafusion/substrait/src/variation_const.rs: ## @@ -37,3 +38,58 @@ pub const DEFAULT_CONTAINER_TYPE_REF: u32 = 0; pub const LARGE_CONTAINER_TYPE_REF: u32 = 1; pub const DECIMAL_128_TYPE_REF: u32 = 0; pub const DECIMAL_256_TYPE_REF: u32 = 1; + +// For custom types +/// For [`DataType::Interval`] with [`IntervalUnit::YearMonth`]. +/// +/// An `i32` for elapsed whole months. See also [`ScalarValue::IntervalYearMonth`] +/// for the literal definition in DataFusion. +/// +/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval +/// [`IntervalUnit::YearMonth`]: datafusion::arrow::datatypes::IntervalUnit::YearMonth +/// [`ScalarValue::IntervalYearMonth`]: datafusion::common::ScalarValue::IntervalYearMonth +pub const INTERVAL_YEAR_MONTH_TYPE_REF: u32 = 1; + +/// For [`DataType::Interval`] with [`IntervalUnit::DayTime`]. +/// +/// An `i64` as: +/// - days: `i32` +/// - milliseconds: `i32` +/// +/// See also [`ScalarValue::IntervalDayTime`] for the literal definition in DataFusion. +/// +/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval +/// [`IntervalUnit::DayTime`]: datafusion::arrow::datatypes::IntervalUnit::DayTime +/// [`ScalarValue::IntervalDayTime`]: datafusion::common::ScalarValue::IntervalDayTime +pub const INTERVAL_DAY_TIME_TYPE_REF: u32 = 2; + +/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`]. +/// +/// An `i128` as: +/// - months: `i32` +/// - days: `i32` +/// - nanoseconds: `i64` +/// +/// See also [`ScalarValue::IntervalMonthDayNano`] for the literal definition in DataFusion. Review Comment: FYI the interval implementation is changing in the next arrow I think: https://github.com/apache/arrow-rs/pull/5769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix wrong type validation on unnest expr [datafusion]
alamb merged PR #10657: URL: https://github.com/apache/datafusion/pull/10657 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Wrong error thrown when unnesting a list of struct [datafusion]
alamb closed issue #10656: Wrong error thrown when unnesting a list of struct URL: https://github.com/apache/datafusion/issues/10656 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]
advancedxy commented on code in PR #455: URL: https://github.com/apache/datafusion-comet/pull/455#discussion_r1614532699 ## spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala: ## @@ -217,6 +325,25 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH str shouldBe s"${getLicenseHeader()}\n# Supported Spark Expressions\n\n### group1\n - [x] f1\n - [ ] f2\n\n### group2\n - [x] f3\n - [ ] f4\n\n### group3\n - [x] f5" } + test("get sql function arguments") { +// getSqlFunctionArguments("SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z'))") shouldBe Seq("TIMESTAMP('1970-01-01 00:00:01Z')") +// getSqlFunctionArguments("SELECT decode(unhex('537061726B2053514C'), 'UTF-8')") shouldBe Seq("unhex('537061726B2053514C')", "'UTF-8'") +// getSqlFunctionArguments("SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456')") shouldBe Seq("'YEAR'", "TIMESTAMP '2019-08-12 01:00:00.123456'") +// getSqlFunctionArguments("SELECT exists(array(1, 2, 3), x -> x % 2 == 0)") shouldBe Seq("array(1, 2, 3)") +getSqlFunctionArguments("select to_char(454, '999')") shouldBe Seq("array(1, 2, 3)") Review Comment: this test is wrong? the arguments are not correct. ## spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala: ## @@ -54,16 +57,79 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH private val valuesPattern = """(?i)FROM VALUES(.+?);""".r private val selectPattern = """(i?)SELECT(.+?)FROM""".r + // exclude funcs Comet has no plans to support streaming in near future + // like spark streaming functions, java calls + private val outofRoadmapFuncs = +List("window", "session_window", "window_time", "java_method", "reflect") + private val sqlConf = Seq( +"spark.comet.exec.shuffle.enabled" -> "true", +"spark.sql.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding", +"spark.sql.adaptive.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding") + + // Tests to run manually as its syntax is different from usual or nested + val manualTests: Map[String, (String, String)] = Map( +"!" -> ("select true a", "select ! true from tbl"), +"%" -> ("select 1 a, 2 b", "select a + b from tbl"), Review Comment: the mapped should be `select a % b from the tbl`? ## spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala: ## @@ -54,16 +57,79 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH private val valuesPattern = """(?i)FROM VALUES(.+?);""".r private val selectPattern = """(i?)SELECT(.+?)FROM""".r + // exclude funcs Comet has no plans to support streaming in near future + // like spark streaming functions, java calls + private val outofRoadmapFuncs = +List("window", "session_window", "window_time", "java_method", "reflect") + private val sqlConf = Seq( +"spark.comet.exec.shuffle.enabled" -> "true", +"spark.sql.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding", +"spark.sql.adaptive.optimizer.excludedRules" -> "org.apache.spark.sql.catalyst.optimizer.ConstantFolding") + + // Tests to run manually as its syntax is different from usual or nested + val manualTests: Map[String, (String, String)] = Map( +"!" -> ("select true a", "select ! true from tbl"), +"%" -> ("select 1 a, 2 b", "select a + b from tbl"), Review Comment: Or maybe you can just generate the binary operators and its mappings in a pragmatic way? Such as: ```scala Seq("%", "&", ..., "|").map(x => x -> ("select 1 a, 2 b", s"select a $x b from tbl") ``` ## spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala: ## @@ -116,20 +182,62 @@ class CometExpressionCoverageSuite extends CometTestBase with AdaptiveSparkPlanH // ConstantFolding is a operator optimization rule in Catalyst that replaces expressions // that can be statically evaluated with their equivalent literal values. dfMessage = runDatafusionCli(q) - testSingleLineQuery( -"select 'dummy' x", -s"${q.dropRight(1)}, x from tbl", -excludedOptimizerRules = - Some("org.apache.spark.sql.catalyst.optimizer.ConstantFolding")) + + manualTests.get(func.name) match { +// the test is manual query +case Some(test) => testSingleLineQuery(test._1, test._2, sqlConf = sqlConf) +case None => + // extract function arguments as a sql text + // example: + // cos(0) -> 0 + // explode_outer(array(10, 20)) -> array(10, 20) + val args = getSqlFunctionArguments(q.dropRight(1)) + val (alias
Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]
peter-toth commented on code in PR #10543: URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614522462 ## datafusion/physical-plan/src/work_table.rs: ## @@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec { &self.cache } -fn children(&self) -> Vec> { +fn children(&self) -> Vec<&Arc> { Review Comment: Actualy, I don't think we can return a slice of references. Returning an empty slice here would be ok, but at other places where there are children to return (e.g. in `BinaryExpr`) we need to build a temporary container (vec or array) to store the references of children and then return a slice of the container, but who will own the container? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]
peter-toth commented on code in PR #10543: URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614503459 ## datafusion/physical-plan/src/work_table.rs: ## @@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec { &self.cache } -fn children(&self) -> Vec> { +fn children(&self) -> Vec<&Arc> { Review Comment: ~I think we could return a slice, but the current `Vec` is in sync with other implementations how we usually return children. Like `LogicalPlan::inputs()` or `ConcreteTreeNode::children()`. Or shall we change those too?~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]
alamb commented on code in PR #10268: URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614530863 ## datafusion/sqllogictest/test_files/select.slt: ## @@ -1473,7 +1473,7 @@ DROP TABLE t; # related to https://github.com/apache/datafusion/issues/8814 statement ok -create table t(x int, y int) as values (1,1), (2,2), (3,3), (0,0), (4,0); +create table t(x bigint, y bigint) as values (1,1), (2,2), (3,3), (0,0), (4,0); Review Comment: Could you please keep the existing test and add a new test that uses `bigint` so it is clearer what behavior is changing? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: substring with negative indices should produce correct result [datafusion-comet]
sonhmai commented on PR #470: URL: https://github.com/apache/datafusion-comet/pull/470#issuecomment-2131173613 @viirya @andygrove can you take a look. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org