date:20240525

Re: [PR] Introduce expr builder for aggregate function [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10560:
URL: https://github.com/apache/datafusion/pull/10560#discussion_r1611638831


##
datafusion-examples/examples/udaf_expr.rs:
##
@@ -0,0 +1,45 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion::{
+execution::{
+config::SessionConfig,
+context::{SessionContext, SessionState},
+},
+functions_aggregate::{expr_fn::first_value, register_all},
+};
+
+use datafusion_common::Result;
+use datafusion_expr::col;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+let ctx = SessionContext::new();
+let config = SessionConfig::new();
+let mut state = SessionState::new_with_config_rt(config, 
ctx.runtime_env());
+let _ = register_all(&mut state);
+
+let first_value_udaf = 
state.aggregate_functions().get("FIRST_VALUE").unwrap();
+let first_value_builder = first_value_udaf

Review Comment:
   The downside (?) of this version is that it needs an internal function 
`first_value_udaf` (non-public). If the user can't get `first_value_udaf ` 
easily, it is not so trivial to use this expr style, and another small issue is 
`call` always expects `Vec`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Introduce expr builder for aggregate function [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10560:
URL: https://github.com/apache/datafusion/pull/10560#discussion_r1611638831


##
datafusion-examples/examples/udaf_expr.rs:
##
@@ -0,0 +1,45 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use datafusion::{
+execution::{
+config::SessionConfig,
+context::{SessionContext, SessionState},
+},
+functions_aggregate::{expr_fn::first_value, register_all},
+};
+
+use datafusion_common::Result;
+use datafusion_expr::col;
+
+#[tokio::main]
+async fn main() -> Result<()> {
+let ctx = SessionContext::new();
+let config = SessionConfig::new();
+let mut state = SessionState::new_with_config_rt(config, 
ctx.runtime_env());
+let _ = register_all(&mut state);
+
+let first_value_udaf = 
state.aggregate_functions().get("FIRST_VALUE").unwrap();
+let first_value_builder = first_value_udaf

Review Comment:
   The downside (?) of this version is that it needs an internal function 
`first_value_udaf` (non-public). If the user can't get `first_value_udaf ` 
easily, it is not so trivial to use this expr style, and another small issue is 
`call` always expects vec of expr now.
   
   But, there are no builder functions like `first_value_builder`, so it may 
make less noise to the user (?).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] refactor: simplify converting List DataTypes to `ScalarValue` [datafusion]

2024-05-25 Thread via GitHub



jonahgao commented on code in PR #10675:
URL: https://github.com/apache/datafusion/pull/10675#discussion_r1615025038


##
datafusion/common/src/scalar/mod.rs:
##
@@ -3089,46 +3089,21 @@ impl TryFrom<&DataType> for ScalarValue {
 Box::new(value_type.as_ref().try_into()?),
 ),
 // `ScalaValue::List` contains single element `ListArray`.
-DataType::List(field) => ScalarValue::List(
-new_null_array(
-&DataType::List(Arc::new(Field::new(
-"item",
-field.data_type().clone(),
-true,
-))),
-1,
-)
-.as_list::()
-.to_owned()
-.into(),
-),
-// 'ScalarValue::LargeList' contains single element `LargeListArray
-DataType::LargeList(field) => ScalarValue::LargeList(
-new_null_array(
-&DataType::LargeList(Arc::new(Field::new(
-"item",
-field.data_type().clone(),
-true,
-))),
-1,
-)
-.as_list::()
-.to_owned()
-.into(),
-),
+DataType::List(field_ref) => ScalarValue::List(Arc::new(
+GenericListArray::new_null(field_ref.clone(), 1),
+)),
+// `ScalarValue::LargeList` contains single element 
`LargeListArray`.
+DataType::LargeList(field_ref) => ScalarValue::LargeList(Arc::new(
+GenericListArray::new_null(field_ref.clone(), 1),
+)),
 // `ScalaValue::FixedSizeList` contains single element 
`FixedSizeList`.
-DataType::FixedSizeList(field, _) => ScalarValue::FixedSizeList(
-new_null_array(
-&DataType::FixedSizeList(
-Arc::new(Field::new("item", field.data_type().clone(), 
true)),
-1,

Review Comment:
   Here, it should be the fixed length instead of always being `1`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] refactor: simplify converting List DataTypes to `ScalarValue` [datafusion]

2024-05-25 Thread via GitHub



jonahgao opened a new pull request, #10675:
URL: https://github.com/apache/datafusion/pull/10675

   ## Which issue does this PR close?
   
   N/A
   
   ## Rationale for this change
   `ScalarValue::try_from(dt: &DataType)`  is commonly used to create a NULL 
`ScalarValue` for a given DataType.
   The processing of List DataTypes is a bit complex and can be simplified.
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   Yes
   
   ## Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Upgrade object_store to 0.10.1 [datafusion]

2024-05-25 Thread via GitHub



Samrose-Ahmed commented on issue #10673:
URL: https://github.com/apache/datafusion/issues/10673#issuecomment-2132048170

   I think `parquet` needs to be updated too, I tried it quickly locally and 
needed to update `pyo3` to 0.21 but get some weird type errors I couldn't 
understand. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Introduce Sum UDAF [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10651:
URL: https://github.com/apache/datafusion/pull/10651#discussion_r1614978844


##
datafusion/physical-expr-common/src/aggregate/mod.rs:
##
@@ -292,16 +307,38 @@ impl AggregateExpr for AggregateFunctionExpr {
 is_distinct: self.is_distinct,
 input_type: &self.input_type,
 args_num: self.args.len(),
+name: &self.name,
 };
 self.fun.groups_accumulator_supported(args)
 }
 
 fn create_groups_accumulator(&self) -> Result> {
-self.fun.create_groups_accumulator()
+let args = AccumulatorArgs {
+data_type: &self.data_type,
+schema: &self.schema,
+ignore_nulls: self.ignore_nulls,
+sort_exprs: &self.sort_exprs,
+is_distinct: self.is_distinct,
+input_type: &self.input_type,
+args_num: self.args.len(),
+name: &self.name,
+};
+self.fun.create_groups_accumulator(args)
 }
 
 fn order_bys(&self) -> Option<&[PhysicalSortExpr]> {
-(!self.ordering_req.is_empty()).then_some(&self.ordering_req)
+if self.fun.has_ordering_requirements() && 
!self.ordering_req.is_empty() {
+return Some(&self.ordering_req);
+}
+None

Review Comment:
   I found that most of the functions like Sum should not expect to get 
`ordering_req`, even it is non-empty.
   If `ordering_req` is returned, I will get unexpected expression here
   
   
https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/physical-plan/src/aggregates/mod.rs#L1011-L1013



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-25 Thread via GitHub



tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1614977230


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,20 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // _9 and _10 (uint8 and uint16) not supported
+  checkSparkAnswerAndOperator(
+"SELECT hex(_1), hex(_2), hex(_3), hex(_4), hex(_5), hex(_6), 
hex(_7), hex(_8), hex(_11), hex(_12), hex(_13), hex(_14), hex(_15), hex(_16), 
hex(_17), hex(_18), hex(_19), hex(_20) FROM tbl")

Review Comment:
   I tried that already and got test failures because the native function isn't 
being recognized w/ scalars. The same output I mentioned w/ `trim` above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-25 Thread via GitHub



tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1614977230


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,20 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // _9 and _10 (uint8 and uint16) not supported
+  checkSparkAnswerAndOperator(
+"SELECT hex(_1), hex(_2), hex(_3), hex(_4), hex(_5), hex(_6), 
hex(_7), hex(_8), hex(_11), hex(_12), hex(_13), hex(_14), hex(_15), hex(_16), 
hex(_17), hex(_18), hex(_19), hex(_20) FROM tbl")

Review Comment:
   I initially tried and got test failures because the native function isn't 
being recognized w/ scalars. The same issue I mentioned w/ `trim` above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] `test_sort_1k_mem` failed with too many open file error when I run locally [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 opened a new issue, #10674:
URL: https://github.com/apache/datafusion/issues/10674

   ### Describe the bug
   
   It has been annoyed me a while
   I get this error when run locally, and the failure **is not shown in CI**
   ```
   failures:
   
    fuzz_cases::sort_fuzz::test_sort_1k_mem stdout 
   thread 'fuzz_cases::sort_fuzz::test_sort_1k_mem' panicked at 
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:148:63:
   called `Result::unwrap()` on an `Err` value: Execution("Failed to create 
partition file at 
\"/var/folders/73/gkp858j550z_tt8bx1m8k67wgn/T/.tmpXo6QCe/.tmpv6tET2\": Os 
{ code: 24, kind: Uncategorized, message: \"Too many open files\" }")
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   
   failures:
   fuzz_cases::sort_fuzz::test_sort_1k_mem
   
   test result: FAILED. 23 passed; 1 failed; 0 ignored; 0 measured; 0 filtered 
out; finished in 11.56s
   ```
   
   ### To Reproduce
   
   `cargo test --lib --tests --bins --features avro,json`
   
   ### Expected behavior
   
   I expect to run the test pass locally. what setting should I use?
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Upgrade object_store to 0.10.1 [datafusion]

2024-05-25 Thread via GitHub



Samrose-Ahmed opened a new issue, #10673:
URL: https://github.com/apache/datafusion/issues/10673

   ### Is your feature request related to a problem or challenge?
   
   object_store 0.10.1 includes some updates for better retrying, with 
multipart uplads
   
   ### Describe the solution you'd like
   
   Upgrade object_store to 0.10.1
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10644:
URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614957716


##
datafusion/sqllogictest/test_files/aggregate.slt:
##
@@ -871,6 +871,33 @@ select median(distinct c) from t;
 statement ok
 drop table t;
 
+# optimize distinct median to group by
+statement ok
+create table t(c int) as values (1), (1), (1), (1), (2), (2), (3), (3);
+
+query TT
+explain select median(distinct c) from t;
+
+logical_plan
+01)Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT t.c)
+02)--Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]]
+03)Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]]

Review Comment:
   This shows distinct is optimzed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] ColumnReader.loadVector should initiate CometDictionary after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya closed issue #474: ColumnReader.loadVector should initiate 
CometDictionary after re-import arrays
URL: https://github.com/apache/datafusion-comet/issues/474


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya commented on PR #473:
URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131456605

   Merged. Thanks @andygrove @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya merged PR #473:
URL: https://github.com/apache/datafusion-comet/pull/473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



codecov-commenter commented on PR #473:
URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131417599

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `0%` with `4 lines` in your changes are missing 
coverage. Please review.
   > Project coverage is 34.05%. Comparing base 
[(`9125e6a`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/9125e6a04dce7c86bb0e4f4f297c8c2b70a39559?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`9a2a851`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/9a2a851dfd80d30242f323b9a7bd96385df183e9?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 2 commits behind head on main.
   
   | 
[Files](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[...in/java/org/apache/comet/parquet/ColumnReader.java](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?src=pr&el=tree&filepath=common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fcomet%2Fparquet%2FColumnReader.java&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-Y29tbW9uL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9jb21ldC9wYXJxdWV0L0NvbHVtblJlYWRlci5qYXZh)
 | 0.00% | [4 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@Coverage Diff@@
   ##   main #473   +/-   ##
   =
 Coverage 34.05%   34.05%   
 Complexity  859  859   
   =
 Files   116  116   
 Lines 3868038679-1 
 Branches   8568 8567-1 
   =
 Hits  1317113171   
   + Misses2274622745-1 
 Partials   2763 2763   
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/473?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump mimalloc from 0.1.39 to 0.1.41 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] commented on PR #657:
URL: 
https://github.com/apache/datafusion-python/pull/657#issuecomment-2131417589

   Superseded by #719.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump mimalloc from 0.1.39 to 0.1.41 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] closed pull request #657: build(deps): bump mimalloc from 
0.1.39 to 0.1.41
URL: https://github.com/apache/datafusion-python/pull/657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] build(deps): bump mimalloc from 0.1.39 to 0.1.42 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] opened a new pull request, #719:
URL: https://github.com/apache/datafusion-python/pull/719

   Bumps [mimalloc](https://github.com/purpleprotocol/mimalloc_rust) from 
0.1.39 to 0.1.42.
   
   Release notes
   Sourced from https://github.com/purpleprotocol/mimalloc_rust/releases";>mimalloc's 
releases.
   
   Version 0.1.42
   Changes
   
   MiMalloc v2.1.6
   Expose usable_size and version. Credits https://github.com/nathaniel-daniel";>@nathaniel-daniel.
   Link with libatomic on armv6-linux. Credits https://github.com/notorca";>@notorca.
   Add no_thp option for Linux/Android. Credits https://github.com/devnexen";>@devnexen.
   
   Version 0.1.41
   Changes
   
   Fix _mi_option_last
   Feature gate arena in extended
   
   Version 0.1.40
   Changes
   
   Mimalloc v2.1.4.
   Add arena support.
   
   
   
   
   Commits
   
   https://github.com/purpleprotocol/mimalloc_rust/commit/a9c410c248859d3f55fbb7d29a88f28cdf296f6b";>a9c410c
 v0.1.42
   https://github.com/purpleprotocol/mimalloc_rust/commit/3ba384fe63b9b4096bf427bc1cafd3a2059f1853";>3ba384f
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/117";>#117
 from nathaniel-daniel/fix-pr-106
   https://github.com/purpleprotocol/mimalloc_rust/commit/310ffbef61e6647e134b849a5616475ad7475c98";>310ffbe
 Gate usable_size behind extended feature
   https://github.com/purpleprotocol/mimalloc_rust/commit/bb3dce149307594f7bcfbbe7e564c9345db6c88f";>bb3dce1
 Merge branch 'purpleprotocol:master' into fix-pr-106
   https://github.com/purpleprotocol/mimalloc_rust/commit/679b170bdfdd0659df0e0fa8977a9b93222054af";>679b170
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/115";>#115
 from notorca/master
   https://github.com/purpleprotocol/mimalloc_rust/commit/b541013625430f3bfa193448470b27bee7b801c3";>b541013
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/118";>#118
 from nathaniel-daniel/fix-issue-114
   https://github.com/purpleprotocol/mimalloc_rust/commit/4ae150e5fe8066f6e92c5ca79a6b167282325995";>4ae150e
 Merge pull request https://redirect.github.com/purpleprotocol/mimalloc_rust/issues/119";>#119
 from nathaniel-daniel/version
   https://github.com/purpleprotocol/mimalloc_rust/commit/7a15df134eeb87379bb67b65aa4fdf2d76935e16";>7a15df1
 Test extended feature on ci
   https://github.com/purpleprotocol/mimalloc_rust/commit/e1b2306728ac67354d2301bc040d877bb31cd391";>e1b2306
 Run cargo fmt
   https://github.com/purpleprotocol/mimalloc_rust/commit/7d1366d56e8b4857382740f293d8ae58fd97f1a4";>7d1366d
 Expose version function
   Additional commits viewable in https://github.com/purpleprotocol/mimalloc_rust/compare/v0.1.39...v0.1.42";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mimalloc&package-manager=cargo&previous-version=0.1.39&new-version=0.1.42)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this

[PR] build(deps): bump parking_lot from 0.12.2 to 0.12.3 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] opened a new pull request, #718:
URL: https://github.com/apache/datafusion-python/pull/718

   Bumps [parking_lot](https://github.com/Amanieu/parking_lot) from 0.12.2 to 
0.12.3.
   
   Changelog
   Sourced from https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md";>parking_lot's
 changelog.
   
   parking_lot 0.12.3 (2024-05-24)
   
   Export types provided by arc_lock feature (https://redirect.github.com/Amanieu/parking_lot/issues/442";>#442)
   
   
   
   
   Commits
   
   https://github.com/Amanieu/parking_lot/commit/a29dd3d4ff661f73fd5cf638584bc4c5de2ad347";>a29dd3d
 Release parking_lot 0.12.3
   https://github.com/Amanieu/parking_lot/commit/f7efcae51161c25f7839cddd20cc1b809441e5e8";>f7efcae
 Merge pull request https://redirect.github.com/Amanieu/parking_lot/issues/442";>#442 from 
iwanders/add-arc_lock-feature-top-level-exports
   https://github.com/Amanieu/parking_lot/commit/c357017decfa0f21ca597a4958745ad0999cf9eb";>c357017
 Export types provided by arc_lock feature.
   See full diff in https://github.com/Amanieu/parking_lot/compare/0.12.2...0.12.3";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=parking_lot&package-manager=cargo&previous-version=0.12.2&new-version=0.12.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-25 Thread via GitHub



comphead commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614830679


##
datafusion/core/src/datasource/physical_plan/parquet/schema_adapter.rs:
##
@@ -20,35 +20,38 @@ use arrow_schema::{Schema, SchemaRef};
 use std::fmt::Debug;
 use std::sync::Arc;
 
-/// Factory of schema adapters.
+/// Factory for creating [`SchemaAdapter`]
 ///
-/// Provides means to implement custom schema adaptation.
+/// This interface provides a way to implement custom schema adaptation logic
+/// for ParquetExec (for example, to fill missing columns with default value
+/// other than null)
 pub trait SchemaAdapterFactory: Debug + Send + Sync + 'static {
 /// Provides `SchemaAdapter` for the ParquetExec.
 fn create(&self, schema: SchemaRef) -> Box;
 }
 
-/// A utility which can adapt file-level record batches to a table schema 
which may have a schema
+/// Adapt file-level [`RecordBatch`]es to a table schema, which may have a 
schema
 /// obtained from merging multiple file-level schemas.
 ///
 /// This is useful for enabling schema evolution in partitioned datasets.
 ///
 /// This has to be done in two stages.
 ///
-/// 1. Before reading the file, we have to map projected column indexes from 
the table schema to
-///the file schema.
+/// 1. Before reading the file, we have to map projected column indexes from 
the
+///table schema to the file schema.
 ///
-/// 2. After reading a record batch we need to map the read columns back to 
the expected columns
-///indexes and insert null-valued columns wherever the file schema was 
missing a colum present
-///in the table schema.
+/// 2. After reading a record batch map the read columns back to the expected
+///columns indexes and insert null-valued columns wherever the file schema 
was
+///missing a colum present in the table schema.

Review Comment:
   ```suggestion
   ///missing a column present in the table schema.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya commented on code in PR #473:
URL: https://github.com/apache/datafusion-comet/pull/473#discussion_r1614819511


##
common/src/main/java/org/apache/comet/parquet/ColumnReader.java:
##
@@ -230,15 +230,15 @@ public CometDecodedVector loadVector() {
 // return plain vector.
 currentVector = cometVector;
 return currentVector;
-  } else if (dictionary == null) {
-// There is dictionary from native side but the Java side dictionary 
hasn't been
-// initialized yet.
-Dictionary arrowDictionary = 
dictionaryProvider.lookup(dictionaryEncoding.getId());
-CometPlainVector dictionaryVector =
-new CometPlainVector(arrowDictionary.getVector(), useDecimal128, 
isUuid);
-dictionary = new CometDictionary(dictionaryVector);
   }
 
+  // We should already re-initiate `CometDictionary` here because 
`Data.importVector` API will
+  // release the previous dictionary vector and create a new one.
+  Dictionary arrowDictionary = 
dictionaryProvider.lookup(dictionaryEncoding.getId());
+  CometPlainVector dictionaryVector =
+  new CometPlainVector(arrowDictionary.getVector(), useDecimal128, 
isUuid);
+  dictionary = new CometDictionary(dictionaryVector);
+

Review Comment:
   The test failure is in TPCDSQuerySuite in #437. I don't have a simple 
reproducible test case for it. But I think the change is straightforward and 
safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya commented on PR #473:
URL: https://github.com/apache/datafusion-comet/pull/473#issuecomment-2131403965

   cc @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] ColumnReader.loadVector should initiate CometDictionary after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya opened a new issue, #474:
URL: https://github.com/apache/datafusion-comet/issues/474

   ### Describe the bug
   
   During debugging some test failures in #437, I found that `CometScanExec` 
returns empty dictionary values after first batch in failed queries. It is 
because in `Data.importVector` API call, it will clear existing dictionary 
vector in the given `dictionaryProvider`.
   
   ### Steps to reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix: `ColumnReader.loadVector` should initiate `CometDictionary` after re-import arrays [datafusion-comet]

2024-05-25 Thread via GitHub



viirya opened a new pull request, #473:
URL: https://github.com/apache/datafusion-comet/pull/473

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove merged PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add quote-style parameter for CSV options [datafusion]

2024-05-25 Thread via GitHub



DDtKey commented on issue #10669:
URL: https://github.com/apache/datafusion/issues/10669#issuecomment-2131388179

   I think this might be labeled with `good first issue`, there are links to 
the code that needs to be changed and it is also possible to write 
`sqllogictest` similar to https://github.com/apache/datafusion/pull/10671.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] ArrayList data is leaked to outside unnest logical plan from the inner unnest logical plan that comes before a Limit [datafusion]

2024-05-25 Thread via GitHub



duongcongtoai commented on issue #10672:
URL: https://github.com/apache/datafusion/issues/10672#issuecomment-2131375997

   I'm taking it
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] ArrayList data is leaked to outside unnest logical plan from the inner unnest logical plan with limit applied [datafusion]

2024-05-25 Thread via GitHub



duongcongtoai opened a new issue, #10672:
URL: https://github.com/apache/datafusion/issues/10672

   ### Describe the bug
   
   Given this query
   ```
   select unnest(struct_c1c0), unnest(list_c2c0) from (
   select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as 
list_c2c0 from temp limit 1
);
   ```
   where list_c2c0 is the result of an unnest logical plan followed by a limit
   ```
   | logical_plan  | Unnest: lists[unnest(list_c2c0)] 
structs[unnest(struct_c1c0)]
 |
   |   |   Projection: unnest(temp.column1) AS unnest(struct_c1c0), 
get_field(unnest(column2), Utf8("c0")) AS unnest(list_c2c0)|
   |   | Limit: skip=0, fetch=1 
   |
   |   |   Unnest: lists[unnest(temp.column1), unnest(column2)] 
structs[]  |
   |   | Projection: temp.column1 AS unnest(temp.column1), 
temp.column2 AS unnest(column2) |
   |   |   TableScan: temp projection=[column1, column2]   
   ```
   then unnesting on this column "list_c2c0" again will cause the error
   ```
   Arrow error: Invalid argument error: all columns in a record batch must have 
the same length
   ```
   
   
   ### To Reproduce
   
   ```
   > create table temp as values ([struct('a')], [struct([1,2]),struct([2,3])]);
   0 row(s) fetched. 
   Elapsed 0.008 seconds.
   > select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as list_c2c0 
from temp limit 1;
   +-+---+
   | struct_c1c0 | list_c2c0 |
   +-+---+
   | {c0: a} | [1, 2]|
   +-+---+
   1 row(s) fetched. 
   Elapsed 0.010 seconds.
   > select unnest(struct_c1c0), unnest(list_c2c0) from (
   select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as 
list_c2c0 from temp limit 1
);
   Arrow error: Invalid argument error: all columns in a record batch must have 
the same length
   ```
   
   ### Expected behavior
   
   Correct result returned
   ```
   > select unnest(struct_c1c0), unnest(list_c2c0) from (
   select unnest(column1) as struct_c1c0, unnest(column2)['c0'] as 
list_c2c0 from temp limit 1
);
   ++---+
   | unnest(struct_c1c0).c0 | unnest(list_c2c0) |
   ++---+
   | a  | 1 |
   | a  | 2 |
   ++---+
   ```
   
   ### Additional context
   
   The problem may be somewhere around this line
   ```
   
https://github.com/apache/datafusion/blob/34eda15b73a9e278af8844b30ed2f1c21c10359c/datafusion/physical-plan/src/unnest.rs#L431
   ```
   The context of this function is it's trying to downcast the generic Array 
into underlying type and use the underlying values as the unnested column 
values.
   However this downcast somehow leak all the values which were previously 
skipped at the Limit operator, thus leak the extra elements [2,3].
   If we add these debug lines
   ```
   println!(
   "dyn array length: {} and data {:?}",
   list_arrays[0].len(),
   list_arrays[0]
   );
   println!(
   "down casted array length {} and data {:?}",
   typed_arrays[0].values().len(),
   typed_arrays[0].values()
   );
   ```
   The result will be
   ```
   dyn array length: 1 and data ListArray
   [
 PrimitiveArray
   [
 1,
 2,
   ],
   ]
   down casted array length 4 and data PrimitiveArray
   [
 1,
 2,
 2,
 3,
   ]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix: pass `quote` parameter to CSV writer [datafusion]

2024-05-25 Thread via GitHub



DDtKey opened a new pull request, #10671:
URL: https://github.com/apache/datafusion/pull/10671

   
   
   ## Which issue does this PR close?
   
   Closes #10670
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   `sqllogictest` has been added which fails without these changes
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614789127


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   This test relies on a cast that we do not yet support and enables 
`COMET_CAST_ALLOW_INCOMPATIBLE` to allow it. I will revert the last change and 
add a comment about this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614788921


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8

Review Comment:
   Removing that case causes a test failure:
   
   ```
   - scalar subquery *** FAILED *** (8 seconds, 253 milliseconds)
   
 Cause: java.util.concurrent.ExecutionException: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 410.0 failed 1 times, most recent failure: Lost task 0.0 in stage 410.0 
(TID 1286) (192.168.64.23 executor driver): 
org.apache.comet.CometNativeException: Execution error: Comet Internal Error: 
Native cast invoked for unsupported cast from Int32 to Decimal128(38, 10)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: Add "Comet Fuzz" fuzz-testing utility [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove opened a new pull request, #472:
URL: https://github.com/apache/datafusion-comet/pull/472

   ## Which issue does this PR close?
   
   
   
   N/A
   
   ## Rationale for this change
   
   
   
   Comet Fuzz is a standalone project for generating random data and queries 
and executing queries against Spark 
   with Comet disabled and enabled and checking for incompatibilities.
   
   Although it is a simple tool it has already been useful in finding many bugs.
   
   Comet Fuzz is inspired by the [SparkFuzz](https://ir.cwi.nl/pub/30222) paper 
from Databricks and CWI.
   
   ## What changes are included in this PR?
   
   
   
   New standalone `comet-fuzz` project.
   
   ## How are these changes tested?
   
   
   
   
   Manually tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] CSV `quote` parameter is ignored [datafusion]

2024-05-25 Thread via GitHub



DDtKey opened a new issue, #10670:
URL: https://github.com/apache/datafusion/issues/10670

   ### Describe the bug
   
   quote` parameter of CSV writer is not getting passed to `arrow-csv` writer:
   
https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/file_options/csv_writer.rs#L48-L75
   
   
   
   ### To Reproduce
   
   ```sql
   COPY (select '12,3', 'abc,d' ) TO 'test.csv' STORED AS csv OPTIONS ('quote' 
'|')
   ```
   results in
   ```csv
   "Utf8(""12,3"")","Utf8(""abc,d"")"
   "12,3","abc,d"
   ```
   
   Result is the same if we use `write_csv` with passed option
   
   ### Expected behavior
   
   Writer should use appropriate quote char.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Add quote-style parameter for CSV options [datafusion]

2024-05-25 Thread via GitHub



DDtKey opened a new issue, #10669:
URL: https://github.com/apache/datafusion/issues/10669

   ### Is your feature request related to a problem or challenge?
   
   CSV writers usually supports configuration of quote style/mode with the 
following options:
   - `Always`
   - `Necessary`
   - `Never`
   - `NonNumeric`
   
   Sometimes this just need to be controlled, and for now only way to change 
that is to re-iterate through result file(s) in order to store the content with 
desired quote style.
   
   You can find such configs in many libraries:
   - `csv` crate 
([`QuoteStyle`](https://docs.rs/csv/latest/csv/enum.QuoteStyle.html)), 
   - `csv` from python (constants, like 
[`QUOTE_ALL`](https://docs.python.org/3/library/csv.html#csv.QUOTE_ALL)
   - in Appach Commons CSV for Java 
([`QuoteMode`](https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html))
   
   ### Describe the solution you'd like
   
   Just expose a way to pass the `QuoteStyle` enum along with other properties 
like `quote`, `delimiter` and etc (as part of `CsvOptions`). However, need to 
keep in mind that the configuration only makes sense for writers, not readers.
   
   
   That shouldn't be an issue to support, because `datafusion` relies on 
`arrow-csv` which uses `csv` crate under the hood.
   
   - requires to update `arrow-csv` to accept quote-style param (sub-issue for 
`arrow-rs`?) 
 - add to `WriterBuilder`: 
https://github.com/apache/arrow-rs/blob/4b5d9bfc958c06fb1ff71d90ba58497e965eff40/arrow-csv/src/writer.rs#L191-L214
 - pass to `csv::Writer`: 
https://github.com/apache/arrow-rs/blob/4b5d9bfc958c06fb1ff71d90ba58497e965eff40/arrow-csv/src/writer.rs#L402-L408
   - update `datafusion`
 - add parameter to `CsvOptions`: 
https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/config.rs#L1554-L1570
 - pass to `arrow-csv`: 
https://github.com/apache/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/common/src/file_options/csv_writer.rs#L48-L75
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Improve ObjectStoreUrl docs + examples [datafusion]

2024-05-25 Thread via GitHub



goldmedal commented on PR #10619:
URL: https://github.com/apache/datafusion/pull/10619#issuecomment-2131302454

   On a related note, I found that the usage flow of ObjectStore in DataFusion 
involves registering the source as a table and then querying the table. This 
approach makes sense. However, I think it might be possible to provide 
something like a table function or other methods to allow users to dynamically 
decide which path they want to scan.
   
   There are similar features in DuckDB and Spark:
   - DuckDB: `SELECT * FROM read_parquet('file-path/test.parquet')`
   - Spark: `spark.read.parquet('')`
   
   Does DataFusion already have this feature, or did I miss it? If not, I think 
it would be a valuable feature for users of object storage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Library Guide: Building LogicalPlans [datafusion]

2024-05-25 Thread via GitHub



edmondop commented on issue #7306:
URL: https://github.com/apache/datafusion/issues/7306#issuecomment-2131289288

   > > What do you think we should do? I can add that paragraph, but maybe we 
also want to link the examples? What else do you think it's needed to close this
   > 
   > Those both sound like excellent ideas to close off this issue. Thank you 
@edmondop
   
   I noticed that the original work from @andygrove referred examples in the 
docs  and these seems to be pretty "unique" in the sense that no other doc page 
seems to take this approach , see 
https://github.com/apache/datafusion/tree/main/docs/src
   
   What do you think about moving those into examples and linking them from the 
docs? Do we have a preprocessor macro to includes the Rust code in a markdown 
upon compilation, or maybe to ensure that the embedded snippets compile and 
execute correctly ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] feat: Implement ANSI support for UnaryMinus [datafusion-comet]

2024-05-25 Thread via GitHub



vaibhawvipul opened a new pull request, #471:
URL: https://github.com/apache/datafusion-comet/pull/471

   ## Which issue does this PR close?
   
   
   
   Closes #465 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-25 Thread via GitHub



jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614673020


##
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##
@@ -685,6 +685,19 @@ async fn roundtrip_literal_struct() -> Result<()> {
 .await
 }
 
+#[tokio::test]
+async fn roundtrip_values() -> Result<()> {
+assert_expected_plan(
+"VALUES (1, 'a', [[-213.1, NULL, 5.5, 2.0, 1.0], []], STRUCT(true, 1, 
CAST(NULL AS STRING))), (NULL, NULL, NULL, NULL)",
+"Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], 
[]]), Struct({c0:true,c1:1,c2:})), (Int64(NULL), Utf8(NULL), List(), 
Struct({c0:,c1:,c2:}))")
+.await
+}
+
+#[tokio::test]
+async fn roundtrip_empty_relation() -> Result<()> {
+roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await

Review Comment:
   We might need a test to verify the schemas. For this test, the plan str on 
both sides would just be `EmptyRelation`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Convert Variance Population to UDAF [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 opened a new issue, #10668:
URL: https://github.com/apache/datafusion/issues/10668

   ### Is your feature request related to a problem or challenge?
   
   Similar to #10667 but for Variance Population function
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-25 Thread via GitHub



jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614671030


##
datafusion/substrait/tests/cases/roundtrip_logical_plan.rs:
##
@@ -685,6 +685,19 @@ async fn roundtrip_literal_struct() -> Result<()> {
 .await
 }
 
+#[tokio::test]
+async fn roundtrip_values() -> Result<()> {
+assert_expected_plan(
+"VALUES (1, 'a', [[-213.1, NULL, 5.5, 2.0, 1.0], []], STRUCT(true, 1, 
CAST(NULL AS STRING))), (NULL, NULL, NULL, NULL)",
+"Values: (Int64(1), Utf8(\"a\"), List([[-213.1, , 5.5, 2.0, 1.0], 
[]]), Struct({c0:true,c1:1,c2:})), (Int64(NULL), Utf8(NULL), List(), 
Struct({c0:,c1:,c2:}))")
+.await
+}
+
+#[tokio::test]
+async fn roundtrip_empty_relation() -> Result<()> {
+roundtrip("SELECT * FROM (VALUES (1)) LIMIT 0").await

Review Comment:
   I've slightly modified this test and printed out the schema on the consumer 
side, but it seems to be incorrect.
   ```rust
   #[tokio::test]
   async fn roundtrip_empty_relation() -> Result<()> {
   roundtrip("SELECT * FROM (VALUES ([STRUCT(1 as a)], 2)) LIMIT 0").await
   }
   ```
   ```sh
   plan2 schema: DFSchema { inner: Schema { fields: [
   Field { name: "column1", data_type: List(Field { name: "item", data_type: 
Struct([Field { name: "c0", data_type: Int64, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }, 
   # Its name should be "column2"
   Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }], metadata: {} }, 
   field_qualifiers: [None, None], functional_dependencies: 
FunctionalDependencies { deps: [] } }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Convert `Variance Sample` to UDAF [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 opened a new issue, #10667:
URL: https://github.com/apache/datafusion/issues/10667

   ### Is your feature request related to a problem or challenge?
   
   Part of #8708 
   1. Move variance aggregate expression in 
`datafusion/physical-expr/src/aggregate/variance.rs` to `functions-aggregate`.
   2. Add expression api in `roundtrip_expr_api` in 
`datafusion/proto/tests/cases/roundtrip_logical_plan.rs`
   
   Similar to #10372 #10418
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10543:
URL: https://github.com/apache/datafusion/pull/10543#issuecomment-2131268529

   Awesome!
   
   I plan to merge this PR tomorrow unless anyone else would like time to 
review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Improve `ParquetExec` and related documentation [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10647:
URL: https://github.com/apache/datafusion/pull/10647#discussion_r1614666750


##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -642,11 +714,22 @@ fn should_enable_page_index(
 .unwrap_or(false)
 }
 
-/// Factory of parquet file readers.
+/// Interface for creating [`AsyncFileReader`]s to read parquet files.
+///
+/// This interface is used by [`ParquetOpener`] in order to create readers for
+/// parquet files. Implementations of this trait can be used to provide custom

Review Comment:
   Excellent idea. I did so



##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -75,7 +75,79 @@ pub use metrics::ParquetFileMetrics;
 pub use schema_adapter::{SchemaAdapter, SchemaAdapterFactory, SchemaMapper};
 pub use statistics::{RequestedStatistics, StatisticsConverter};
 
-/// Execution plan for scanning one or more Parquet partitions
+/// Execution plan for reading one or more Parquet files.
+///
+/// ```text
+/// ▲
+/// │
+/// │  Produce a stream of
+/// │  RecordBatches
+/// │
+/// ┌───┐
+/// │   │
+/// │  ParquetExec  │
+/// │   │
+/// └───┘
+/// ▲
+/// │  Asynchronously read from one
+/// │  or more parquet files via
+/// │  ObjectStore interface
+/// │
+/// │
+///   .───.
+///  │ )
+///  │`───'│
+///  │ObjectStore  │
+///  │.───.│
+///  │ )
+///   `───'
+///
+/// ```
+/// # Features
+///
+/// Supports the following optimizations:
+///
+/// * Multi-threaded (aka multi-partition): read from one or more files in
+/// parallel. Can read concurrently from multiple row groups from a single 
file.
+///
+/// * Predicate push down: skips row groups and pages based on
+/// min/max/null_counts in the row group metadata, the page index and bloom
+/// filters.
+///
+/// * Projection pushdown: reads and decodes only the columns required.
+///
+/// * Limit pushdown: stop execution early after some number of rows are read.
+///
+/// * Custom readers: controls I/O for accessing pages. See

Review Comment:
   good call -- updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-25 Thread via GitHub



jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1614666495


##
datafusion/substrait/src/logical_plan/producer.rs:
##
@@ -512,6 +572,62 @@ pub fn to_substrait_rel(
 }
 }
 
+fn to_substrait_named_struct(schema: &DFSchemaRef) -> Result {
+// Substrait wants a list of all field names, including nested fields from 
structs,
+// also from within lists and maps. However, it does not want the list and 
map field names
+// themselves - only structs are considered to have useful names.
+fn names_dfs(dtype: &DataType) -> Result> {
+match dtype {
+DataType::Struct(fields) => {
+let mut names = Vec::new();
+for field in fields {
+names.push(field.name().to_string());
+names.extend(names_dfs(field.data_type())?);
+}
+Ok(names)
+}
+DataType::List(l) => names_dfs(l.data_type()),

Review Comment:
   I think we also need to consider `DataType::LargeList` here because 
`to_substrait_type` supports it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Epic] Unify `AggregateFunction` Interface (remove built in list of `AggregateFunction` s), improve the system [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on issue #8708:
URL: https://github.com/apache/datafusion/issues/8708#issuecomment-2131265763

   > After #10648 and #10389 I think we have a pretty good set of examples of 
how to move aggregates out of the core (thanks to all the foundations layed by 
@jayzhan211 )
   > 
   > Would it be helpful to file a few "good first issue" type tickets for some 
of the more straightforward aggregates (I am thinking the statistical variance 
etc)?
   
   Before this, I think it would be nice to determine the API of aggregate expr 
first https://github.com/apache/datafusion/pull/10560#discussion_r1611644593
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]

2024-05-25 Thread via GitHub



peter-toth commented on code in PR #10543:
URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614657172


##
datafusion/physical-plan/src/work_table.rs:
##
@@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec {
 &self.cache
 }
 
-fn children(&self) -> Vec> {
+fn children(&self) -> Vec<&Arc> {

Review Comment:
   I opened a PR as returning a slice in `ConcreteTreeNode` is possible: 
https://github.com/apache/datafusion/pull/10666
   But it will not work for `LogicalPlan` or `ExecutionPlan` or `PhysicalExpr`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614656536


##
datafusion/sqllogictest/test_files/joins.slt:
##
@@ -1778,7 +1778,7 @@ AS VALUES
 ('BB', 6, 1);
 
 query TII
-select col1, col2, coalesce(sum_col3, 0) as sum_col3
+select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3

Review Comment:
   I think this will fail 🤔 since I disable the coercion between i64 and u64, 
but I'm not 100% sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614656536


##
datafusion/sqllogictest/test_files/joins.slt:
##
@@ -1778,7 +1778,7 @@ AS VALUES
 ('BB', 6, 1);
 
 query TII
-select col1, col2, coalesce(sum_col3, 0) as sum_col3
+select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3

Review Comment:
   I think this will fail 🤔 since I disable the coercion between i64 and u64, 
but I'm not 100% sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Remove `GetFieldAccessSchema` [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10665:
URL: https://github.com/apache/datafusion/pull/10665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Minor: Use slice in `ConcreteTreeNode` [datafusion]

2024-05-25 Thread via GitHub



peter-toth opened a new pull request, #10666:
URL: https://github.com/apache/datafusion/pull/10666

   The idea of using slices to return a node's children came up here: 
https://github.com/apache/datafusion/pull/10543#discussion_r1614057653.
   While it is not possible in all `TreeNode` implementations, the 2 
`ConcreteTreeNode` implementations (`ExprContext` and `PlanContext`) own their 
children in `Vec` form, so there is a way to return a slice like `&[Self]` 
(instead of the current `Vec<&Self>`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #10602:
URL: https://github.com/apache/datafusion/issues/10602#issuecomment-2131254591

   Thank you @tustvold and @Abdullahsab3 and @mhilton and @appletreeisyellow 
for the thoughts. 
   
   From my perspective, the current (non timezone aware) `date_bin` function 
has the benefit of being (relatively) simple to implement and fast (as it is 
some arithmetic calculation on the underlying integer value without having to 
do DST conversions per row)
   
   Given the differences in underlying timestamp representation between arrow 
and postgres I do think some difference is likely inevitable and thus likely 
not a deal breaker.
   
   Here are my suggested next steps  @appletreeisyellow tries to prototype one 
or both proposals and see if we can get it to produce the desired results:
   1. create a `to_local_time` function 
   2. Modify to the `date_bin` function to make it timezone aware
   
   I think the `to_local_time` might be the simplest approach. 
   * A ScalarUDF 
   * a `Signature` that takes `Timestamp(.., *)`
   * produces a `Timestamp(.., None)`. 
   * The `invoke` function would do the oppositte of whatever 
`cast(Timestamp(.., None) --> Timestamp(.., TZ))` does
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614654486


##
datafusion/sqllogictest/test_files/joins.slt:
##
@@ -1778,7 +1778,7 @@ AS VALUES
 ('BB', 6, 1);
 
 query TII
-select col1, col2, coalesce(sum_col3, 0) as sum_col3
+select col1, col2, coalesce(sum_col3, arrow_cast(0, 'UInt64')) as sum_col3

Review Comment:
   maybe also we can change this test back too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix typo in Cargo.toml (unused manifest key: dependencies.regex.worksapce) [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10662:
URL: https://github.com/apache/datafusion/pull/10662#issuecomment-2131244635

   > oops.. my mistake. :P Many thanks.
   
   No worries! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] build(deps): bump syn from 2.0.63 to 2.0.64 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] commented on PR #706:
URL: 
https://github.com/apache/datafusion-python/pull/706#issuecomment-2131243514

   Superseded by #717.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] build(deps): bump syn from 2.0.63 to 2.0.66 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] opened a new pull request, #717:
URL: https://github.com/apache/datafusion-python/pull/717

   Bumps [syn](https://github.com/dtolnay/syn) from 2.0.63 to 2.0.66.
   
   Release notes
   Sourced from https://github.com/dtolnay/syn/releases";>syn's 
releases.
   
   2.0.66
   
   Allow braced structs when parsing ExprLet (https://redirect.github.com/dtolnay/syn/issues/1671";>#1671)
   
   2.0.65
   
   Optimize the implementation of Fold to compile faster (https://redirect.github.com/dtolnay/syn/issues/1666";>#1666, https://redirect.github.com/dtolnay/syn/issues/1667";>#1667, https://redirect.github.com/dtolnay/syn/issues/1668";>#1668)
   
   2.0.64
   
   Support using ParseBuffer across catch_unwind (https://redirect.github.com/dtolnay/syn/issues/1646";>#1646)
   Validate that the expression in a let-else ends in brace as required by 
rustc (https://redirect.github.com/dtolnay/syn/issues/1648";>#1648, 
https://redirect.github.com/dtolnay/syn/issues/1649";>#1649)
   Legalize invalid const generic arguments by wrapping in braces (https://redirect.github.com/dtolnay/syn/issues/1654";>#1654, https://redirect.github.com/dtolnay/syn/issues/1655";>#1655)
   Fix some expression precedence edge cases involving break 
and return in loop headers (https://redirect.github.com/dtolnay/syn/issues/1656";>#1656)
   Always print closure bodies with a brace when the closure has an 
explicit return type (https://redirect.github.com/dtolnay/syn/issues/1658";>#1658)
   Automatically insert necessary parentheses in ToTokens for Expr when 
required by expression precedence (https://redirect.github.com/dtolnay/syn/issues/1659";>#1659)
   Support struct literal syntax in match guard expressions (https://redirect.github.com/dtolnay/syn/issues/1662";>#1662)
   
   
   
   
   Commits
   
   https://github.com/dtolnay/syn/commit/b992916bdac459d279bb15098507d43b1febc50e";>b992916
 Release 2.0.66
   https://github.com/dtolnay/syn/commit/4f0a23f7e147e801a996975662b0d6e6d8a7d13b";>4f0a23f
 Merge pull request https://redirect.github.com/dtolnay/syn/issues/1671";>#1671 from 
dtolnay/exprlet
   https://github.com/dtolnay/syn/commit/c6d87a7aa0fda4a8ef867e833d14d105d07ca62b";>c6d87a7
 Allow braced structs when parsing ExprLet
   https://github.com/dtolnay/syn/commit/747f42f235f8e1b5551a6aeca1d2779dce413408";>747f42f
 Update with proc-macro2 1.0.83's syntax tree sizes
   https://github.com/dtolnay/syn/commit/9f2371eefa6f681b53e4d74458d86dd41673227f";>9f2371e
 Release 2.0.65
   https://github.com/dtolnay/syn/commit/4cd181325f3488c47866f15966977682be610da1";>4cd1813
 Merge pull request https://redirect.github.com/dtolnay/syn/issues/1668";>#1668 from 
dtolnay/foldhelper
   https://github.com/dtolnay/syn/commit/ed54092bcea6798ab0b5ed7aca6755f8918fc79e";>ed54092
 Eliminate gen::helper module
   https://github.com/dtolnay/syn/commit/eacc8ab1b98b590df3ce9462510fd755cddf6762";>eacc8ab
 Eliminate FoldHelper trait
   https://github.com/dtolnay/syn/commit/6e20bb8d7799d0f4c34c144e80b3bd1b6e9afd27";>6e20bb8
 Merge pull request https://redirect.github.com/dtolnay/syn/issues/1667";>#1667 from 
dtolnay/punctuatedfold
   https://github.com/dtolnay/syn/commit/9d95cab6d332d08903538d5ce3d6e47c1598912e";>9d95cab
 Optimize punctuated::fold
   Additional commits viewable in https://github.com/dtolnay/syn/compare/2.0.63...2.0.66";>compare 
view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=syn&package-manager=cargo&previous-version=2.0.63&new-version=2.0.66)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabo

Re: [PR] build(deps): bump syn from 2.0.63 to 2.0.64 [datafusion-python]

2024-05-25 Thread via GitHub



dependabot[bot] closed pull request #706: build(deps): bump syn from 2.0.63 to 
2.0.64
URL: https://github.com/apache/datafusion-python/pull/706


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 commented on code in PR #10644:
URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614640283


##
datafusion/functions-aggregate/Cargo.toml:
##
@@ -39,6 +39,7 @@ path = "src/lib.rs"
 
 [dependencies]
 arrow = { workspace = true }
+arrow-schema = { workspace = true }

Review Comment:
   yes, some macro has inevitable `arrow-schema` dependency 
https://github.com/apache/arrow-rs/issues/5676.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] change version to 38.0.1 [datafusion-python]

2024-05-25 Thread via GitHub



andygrove merged PR #716:
URL: https://github.com/apache/datafusion-python/pull/716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614639502


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+),
+DataType::Decimal128(_, _) | DataType::Decimal256(_, _) => 
matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Decimal256(_, _)
+),
+DataType::Utf8 => matches!(to_type, DataType::Binary),
+DataType::Date32 => matches!(to_type, DataType::Utf8),
+DataType::Timestamp(_, _) => {
+matches!(
+to_type,
+DataType::Int64 | DataType::Date32 | DataType::Utf8 | 
DataType::Timestamp(_, _)
+)
+}
+DataType::Binary => {
+// note that this is not completely Spark compatible because
+// DataFusion only supports binary data containing valid UTF-8 
strings
+matches!(to_type, DataType::Utf8)
+}
+_ => false,

Review Comment:
   Casting from `Int64` to `Int32` for `Try` is covered here:
   
   ```rust
  DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
to_type,
DataType::Boolean
| DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64
| DataType::Float32
| DataType::Float64
| DataType::Decimal128(_, _)
| DataType::Utf8
),
 ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-

Re: [PR] fix: Only delegate to DataFusion cast when we know that it is compatible with Spark [datafusion-comet]

2024-05-25 Thread via GitHub



andygrove commented on code in PR #461:
URL: https://github.com/apache/datafusion-comet/pull/461#discussion_r1614638361


##
core/src/execution/datafusion/expressions/cast.rs:
##
@@ -622,14 +590,89 @@ impl Cast {
 self.eval_mode,
 from_type,
 to_type,
-)?
+)
+}
+_ if Self::is_datafusion_spark_compatible(from_type, to_type) => {
+// use DataFusion cast only when we know that it is compatible 
with Spark
+Ok(cast_with_options(&array, to_type, &CAST_OPTIONS)?)
 }
 _ => {
-// when we have no Spark-specific casting we delegate to 
DataFusion
-cast_with_options(&array, to_type, &CAST_OPTIONS)?
+// we should never reach this code because the Scala code 
should be checking
+// for supported cast operations and falling back to Spark for 
anything that
+// is not yet supported
+Err(CometError::Internal(format!(
+"Native cast invoked for unsupported cast from 
{from_type:?} to {to_type:?}"
+)))
 }
 };
-Ok(spark_cast(cast_result, from_type, to_type))
+Ok(spark_cast(cast_result?, from_type, to_type))
+}
+
+/// Determines if DataFusion supports the given cast in a way that is
+/// compatible with Spark
+fn is_datafusion_spark_compatible(from_type: &DataType, to_type: 
&DataType) -> bool {
+if from_type == to_type {
+return true;
+}
+match from_type {
+DataType::Boolean => matches!(
+to_type,
+DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Utf8
+),
+DataType::Int8 | DataType::Int16 | DataType::Int32 | 
DataType::Int64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64
+| DataType::Decimal128(_, _)
+| DataType::Utf8
+),
+DataType::Float32 | DataType::Float64 => matches!(
+to_type,
+DataType::Boolean
+| DataType::Int8
+| DataType::Int16
+| DataType::Int32
+| DataType::Int64
+| DataType::Float32
+| DataType::Float64

Review Comment:
   Yes, that is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Remove `GetFieldAccessSchema` [datafusion]

2024-05-25 Thread via GitHub



jayzhan211 opened a new pull request, #10665:
URL: https://github.com/apache/datafusion/pull/10665

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



goldmedal commented on PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131238078

   Thanks, @alamb!
   
   Actually, I tried to add some tests for `MemoryFunctionRegistry` before I 
fixed it, but I found everything was fine. I guess the reason is that we don't 
insert aliases when using `MemoryFunctionRegistry#register_udf`.
   
   
https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/execution/src/registry.rs#L174-L176
   
   Then, I found that the aliases are only inserted by `SessionState` in
   
   
https://github.com/apache/datafusion/blob/4709fc65f7debc143696fa3a23ab6569ec8a383c/datafusion/core/src/execution/context/mod.rs#L2278-L2283
   
   That's why I added tests for `SessionState`.
   
   I'm not sure, but maybe we should also insert aliases in 
`MemoryFunctionRegistry`. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Add tests showing aggregate behavior for NaNs [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10634:
URL: https://github.com/apache/datafusion/pull/10634#discussion_r1614628009


##
datafusion/sqllogictest/test_files/aggregate.slt:
##
@@ -4374,6 +4374,42 @@ GROUP BY dummy
 
 text1, text1, text1
 
+# Tests for aggregating with NaN values
+statement ok
+CREATE TABLE float_table (
+col_f32 FLOAT,
+col_f32_nan FLOAT,
+col_f64 DOUBLE,
+col_f64_nan DOUBLE
+) as VALUES
+( -128.2,  -128.2,  -128.2, -128.2 ),
+( 32768.3, arrow_cast('NAN','Float32'), 32768.3, 32768.3 ),
+( 27.3,arrow_cast('NAN','Float64'), 27.3,27.3 );

Review Comment:
   Yes, excellent catch -- fixed in 43cb27e669d5ff28f76e70bf54447cbb76df16f2



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] More properly handle nullability of types/literals in Substrait [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10640:
URL: https://github.com/apache/datafusion/pull/10640#issuecomment-2131235829

   Thanks @Blizzara and @jonahgao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#issuecomment-2131235504

   > This shouldn't have passed checks.
   > 
   > ```
   > + cargo fmt --all -- --check
   > `cargo metadata` exited with an error: error: failed to load manifest for 
workspace member `/opt/dev/datafusion/datafusion/core`
   > referenced by workspace at `/opt/dev/datafusion/Cargo.toml`
   > 
   > Caused by:
   >   failed to load manifest for dependency `datafusion-functions`
   > 
   > Caused by:
   >   failed to parse manifest at 
`/opt/dev/datafusion/datafusion/functions/Cargo.toml`
   > 
   > Caused by:
   >   dependency (regex) specified without providing a local path, Git 
repository, version, or workspace dependency to use
   > ```
   > 
   > functions/Cargo.toml
   > 
   > ```
   > regex = { worksapce = true, optional = true }
   > ```
   
   Yeah, I don't know why that is a warning and not an error -- here is a PR to 
fix it: https://github.com/apache/datafusion/pull/10662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Csv Options Clean-up [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10650:
URL: https://github.com/apache/datafusion/pull/10650#issuecomment-2131235103

   > Maybe we can do the following: Instead of making every 
TableProviderFactory responsible for implementing this behavior in create, we 
can add some general mechanism at the trait level (maybe via a default method 
implementation?) to make this fallback automatic for all TableProviderFactory 
implementations unless overridden. @alamb, what do you think? Any ideas come to 
mind?
   
   Thank you for the explanation -- what you describe makes sense -- maybe we 
can update the documentation to add some of this context and that would be good 
enough. I haven't thought through all the nuances of general fallbacks, but 
given what you say perhaps it is a corner case we can worry about if/when 
somone actually hits it.
   
   From my perspective the most important thing is for the behavior to be 
clearly documented. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Library Guide: Building LogicalPlans [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #7306:
URL: https://github.com/apache/datafusion/issues/7306#issuecomment-2131234075

   > What do you think we should do? I can add that paragraph, but maybe we 
also want to link the examples? What else do you think it's needed to close this
   
   Those both sound like excellent ideas to close off this issue. Thank you 
@edmondop 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Selecting struct field within field produces unexpected results [datafusion-python]

2024-05-25 Thread via GitHub



timsaucer commented on issue #715:
URL: 
https://github.com/apache/datafusion-python/issues/715#issuecomment-2131229316

   My statement above about testing on rust side is likely incorrect. I ran the 
same test above but loading the dataframe from a parquet file instead of 
creating in memory and the expected behavior is reproduced.
   
   If you amend these lines to the bottom of the minimal example
   
   ```
   df.write_parquet("save_out.parquet")
   
   df_reread = ctx.read_parquet("save_out.parquet")
   
   df_reread.show()
   df_reread.select(col("a")["outer_1"]["inner_2"]).show()
   ```
   
   You get the expected result
   ```
   DataFrame()
   +-+
   | a   |
   +-+
   | {outer_1: {inner_1: 1, inner_2: 2}} |
   | {outer_1: {inner_1: 1, inner_2: }}  |
   | {outer_1: } |
   +-+
   DataFrame()
   +-+
   | ?table?.a[outer_1][inner_2] |
   +-+
   | 2   |
   | |
   | |
   +-+
   ```
   
   It also shows the original table is reproduced. I'll continue digging but I 
no longer am convinced this is a python binding issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time [datafusion]

2024-05-25 Thread via GitHub



Abdullahsab3 commented on issue #10602:
URL: https://github.com/apache/datafusion/issues/10602#issuecomment-2131223669

   Thanks for filing the ticket and for all the detailed explanations! very 
enriching
   
   I wonder whether the Postgres behavior is actually that bad. Though it looks 
weird, it still is generic enough to make it widely applicable. The issue with 
making `date_bin` specifically timezone-aware in my opinion is the fact that 
future or similar grouping functions (or rolling windows functions) will also 
have to be implemented in a timezone-aware way, accounting for similar pitfalls 
and edge cases. The same problem will also occur in other functionalities; for 
example if you would like to return local time. I personally think that the 
problem might be better tackled as an additional time functionality, which may 
be the same way that Postgres does it. 
   
   The postgres way of converting UTC to local time of a given timezone is:
   ```sql
   select '2024-05-21T12:00:00Z'::timestamp AT TIME ZONE 'UTC' AT TIME ZONE 
'Europe/Brussels';
   
   timezone
   --
   2024-05-21 14:00:00
   ```
   As already mentioned by Andrew, the `AT TIME ZONE` operator in postgres 
converts a timezone-aware timestamp to local time (with no offset or timezone 
information), and local time to a timezone-aware timestamp. Though the 
overloaded functionalities of the `AT TIME ZONE` operator in Postgres are 
weird, they are definitely usable in my opinion. I also like the idea of having 
a separate `to_local_time`/`strip_timezone` function, though this would break 
Datafusions compatibility with Postgres. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [EPIC] Improved support for nested / structured types (`Struct` , `List`, `ListArray`, and other Composite types) [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #2326:
URL: https://github.com/apache/datafusion/issues/2326#issuecomment-2131214534

   > I added an issue to support recursive unnest: #10660, i think it shoul 
belong to this epic
   
   Added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `FileScanConfig::new()` API [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10623:
URL: https://github.com/apache/datafusion/pull/10623#issuecomment-2131213174

   Thanks everyone for the reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add `FileScanConfig::new()` API [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10623:
URL: https://github.com/apache/datafusion/pull/10623


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Support convert LogicalPlan JOIN with `Using` constraint to SQL String [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #10652:
URL: https://github.com/apache/datafusion/issues/10652#issuecomment-2131211386

   > Is there any plan to support them?
   
   I think we should make plans to support them!
   
   I started collecting issues on 
https://github.com/apache/datafusion/issues/8661
   
   I also filed
   https://github.com/apache/datafusion/issues/10663
https://github.com/apache/datafusion/issues/10664
   
   to track distint and window
   
   Looking at 
https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L83-L91
 it actually looks like there are seveal other types of plans not yet supported 
 (like EXPLAIN, INSERT, etc...) Maybe we should file tickets for them as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Suport unparsing `LogicalPlan::Window` to SQL [datafusion]

2024-05-25 Thread via GitHub



alamb opened a new issue, #10664:
URL: https://github.com/apache/datafusion/issues/10664

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/9726 to complete the 
LogialPlan --> SQL conversion
   
   Converting `LogicalPlan` back to `SQL` is valuable for several usecases such 
as using DataFusion to programatically create SQL -- see the 
[`plan_to_sql.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs)
 example
   
   ### Describe the solution you'd like
   
   Support converting SQL with window functions like this:
   
   ```sql
   SELECT first_value OVER (x ORDER BY y) FROM foo;
   ```
   
   ### Describe alternatives you've considered
   
   The basic pattern is this (see 
https://github.com/apache/datafusion/pull/10371 for an example):
   
   1.  Implement the LogicalPlan --> AST reverse code in 
`Unparser::plan_to_sql`([source 
link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L411-L413))
 
   3. Add a test in `roundtrip_statement` in `sql_integration.rs` [source 
link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/tests/sql_integration.rs#L4619)
   
   Note you can run the tests like 
   
   ```shell
   cargo test -p datafusion-sql -- roundtrip_statement
   ```
   
   ### Additional context
   
   I think this is a good first issue as the pattern is well established and 
there are explicit instructions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Suport unparsing `LogicalPlan::Distinct` to `DISTINCT` [datafusion]

2024-05-25 Thread via GitHub



alamb opened a new issue, #10663:
URL: https://github.com/apache/datafusion/issues/10663

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/9726 to complete the 
LogialPlan --> SQL conversion
   
   Converting `LogicalPlan` back to `SQL` is valuable for several usecases such 
as using DataFusion to programatically create SQL -- see the 
[`plan_to_sql.rs`](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/plan_to_sql.rs)
 example
   
   ### Describe the solution you'd like
   
   Support converting SQL like this:
   
   ```sql
   SELECT DISTINCT x FROM foo;
   ```
   
   ### Describe alternatives you've considered
   
   The basic pattern is this (see 
https://github.com/apache/datafusion/pull/10371 for an example):
   
   1.  Implement the LogicalPlan --> AST reverse code in 
`Unparser::plan_to_sql`([source 
link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/src/unparser/plan.rs#L275))
 
   3. Add a test in `roundtrip_statement` in `sql_integration.rs` [source 
link](https://github.com/apache/datafusion/blob/8bedecc00b2f1f04d7b1a907152ce0d19b7046a5/datafusion/sql/tests/sql_integration.rs#L4619)
   
   Note you can run the tests like 
   
   ```shell
   cargo test -p datafusion-sql -- roundtrip_statement
   ```
   
   ### Additional context
   
   I think this is a good first issue as the pattern is well established and 
there are explicit instructions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix typo in Cargo.toml (unused manifest key: dependencies.regex.worksapce) [datafusion]

2024-05-25 Thread via GitHub



jonahgao merged PR #10662:
URL: https://github.com/apache/datafusion/pull/10662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Make TaskContext wrap SessionState [datafusion]

2024-05-25 Thread via GitHub



tustvold commented on issue #10631:
URL: https://github.com/apache/datafusion/issues/10631#issuecomment-2131201943

   IIRC SessionConfig is the static configuration used to create a 
SessionContext, which is an interior mutable wrapper around SessionState.
   
   The idea was a query is planned against an immutable SessionState to 
avoiding inconsistency during planning.
   
   RuntimeEnv is then for things shared across multiple sessions.
   
   As for TaskContext, I believe it was added for Ballista where the execution 
takes place on different nodes to planning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Make TaskContext wrap SessionState [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #10631:
URL: https://github.com/apache/datafusion/issues/10631#issuecomment-2131197268

   > I see TaskContext is just part of SessionState. Could we just make 
TaskContext wrap SessionState?
   
   I can't remember why TaskContext doesn't wrap SessionState  -- maybe 
@tustvold  does 🤔 
   
   At some point `SessionState` was in `datafusion` (the core crate) so it 
couldn't be referenced by subcrates. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Implement protobuf serialization for LogicalPlan::Unnest [datafusion]

2024-05-25 Thread via GitHub



alamb commented on issue #10639:
URL: https://github.com/apache/datafusion/issues/10639#issuecomment-2131196191

   Thanks @jamesmcm  -- this would be a good first issue I think for someone 
with Rust experience looking to help with DataFusion. The protobuf 
serialization code is a bit hairy I find, but there are many examples of how to 
do it that we can follow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#issuecomment-2131195735

   Thank you @goldmedal  🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614546056


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -2860,6 +2863,57 @@ mod tests {
 Ok(())
 }
 
+#[tokio::test]
+async fn test_register_default_functions() -> Result<()> {
+let config = SessionConfig::new();
+let catalog_list =
+Arc::new(MemoryCatalogProviderList::new()) as Arc;
+let mut new_state = SessionState {
+session_id: Uuid::new_v4().to_string(),
+analyzer: Analyzer::new(),
+optimizer: Optimizer::new(),
+physical_optimizers: PhysicalOptimizer::new(),
+query_planner: Arc::new(DefaultQueryPlanner {}),
+catalog_list,
+table_functions: HashMap::new(),
+scalar_functions: HashMap::new(),
+aggregate_functions: HashMap::new(),
+window_functions: HashMap::new(),
+serializer_registry: Arc::new(EmptySerializerRegistry),
+table_option_namespace: TableOptions::default_from_session_config(
+config.options(),
+),
+config,
+execution_props: ExecutionProps::new(),
+runtime_env: Arc::new(RuntimeEnv::default()),
+table_factories: HashMap::new(),
+function_factory: None,
+};
+
+for function in all_default_functions() {

Review Comment:
   This is a nice test -- though it is basically a copy of SessionContext::new
   
   I wonder if we could make a test that is a bit simpler for example by 
creating a new 

https://docs.rs/datafusion/latest/datafusion/execution/registry/struct.MemoryFunctionRegistry.html
   
   Something like (untested)
   ```rust
   let registry = MemoryFunctionRegistry::new();
   for function in all_default_array_functions() {
 let existing_function = new_state.register_udf(function);
 assert!(existing_function.is_none(), "{} was already registered", 
function.name()
   }
   // and similarly for the aggregate and array functins
   ```
   
   🤔 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Update substrait requirement from 0.33.3 to 0.34.0 [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10632:
URL: https://github.com/apache/datafusion/pull/10632


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Move Median to `functions-aggregate` and Introduce Numeric signature [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10644:
URL: https://github.com/apache/datafusion/pull/10644#discussion_r1614539590


##
datafusion/functions-aggregate/src/median.rs:
##
@@ -15,71 +15,105 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! # Median
-
-use crate::aggregate::utils::{down_cast_any_ref, Hashable};
-use crate::expressions::format_state_name;
-use crate::{AggregateExpr, PhysicalExpr};
-use arrow::array::{Array, ArrayRef};
-use arrow::datatypes::{DataType, Field};
-use arrow_array::cast::AsArray;
-use arrow_array::{downcast_integer, ArrowNativeTypeOp, ArrowNumericType};
-use arrow_buffer::ArrowNativeType;
-use datafusion_common::{DataFusionError, Result, ScalarValue};
-use datafusion_expr::Accumulator;
-use std::any::Any;
 use std::collections::HashSet;
 use std::fmt::Formatter;
-use std::sync::Arc;
+use std::{fmt::Debug, sync::Arc};
+
+use arrow::array::{downcast_integer, ArrowNumericType};
+use arrow::{
+array::{ArrayRef, AsArray},
+datatypes::{
+DataType, Decimal128Type, Decimal256Type, Field, Float16Type, 
Float32Type,
+Float64Type,
+},
+};
+
+use arrow::array::Array;
+use arrow::array::ArrowNativeTypeOp;
+use arrow::datatypes::ArrowNativeType;
+
+use datafusion_common::{DataFusionError, Result, ScalarValue};
+use datafusion_expr::function::StateFieldsArgs;
+use datafusion_expr::{
+function::AccumulatorArgs, utils::format_state_name, Accumulator, 
AggregateUDFImpl,
+Signature, Volatility,
+};
+use datafusion_physical_expr_common::aggregate::utils::Hashable;
+
+make_udaf_expr_and_func!(
+Median,
+median,
+expression,
+"Computes the median of a set of numbers",
+median_udaf
+);
 
-/// MEDIAN aggregate expression. If using the non-distinct variation, then 
this uses a

Review Comment:
   I think this comment still holds and would be valuable to bring to the new 
`Median` UDF
   
   Specifically that median uses substantial memory



##
datafusion/expr/src/signature.rs:
##
@@ -119,6 +119,8 @@ pub enum TypeSignature {
 OneOf(Vec),
 /// Specifies Signatures for array functions
 ArraySignature(ArrayFunctionSignature),
+/// Fixed number of arguments of numeric types

Review Comment:
   Could we please document (or link to documentation) about what types are 
"numeric"? Is it 
https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric
 ?



##
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##
@@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy {
 )))
 }
 }
+Expr::AggregateFunction(AggregateFunction {

Review Comment:
   Can we please add a test (if not already done) showing that the distinct 
aggregate is rewritten? Or perhaps there is already a test for median 🤔 
   
   



##
datafusion/functions-aggregate/Cargo.toml:
##
@@ -39,6 +39,7 @@ path = "src/lib.rs"
 
 [dependencies]
 arrow = { workspace = true }
+arrow-schema = { workspace = true }

Review Comment:
   is the issue that not re-exported in arrow?



##
datafusion/optimizer/src/single_distinct_to_groupby.rs:
##
@@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy {
 )))
 }
 }
+Expr::AggregateFunction(AggregateFunction {

Review Comment:
   From what I can tell this is now a general optimization (as in it will 
rewrite *any* distinct user defined aggregate  as well). If so, that is quite 
cool. Is it correct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



goldmedal commented on code in PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -2893,21 +2893,21 @@ mod tests {
 for function in all_default_functions() {
 let udf = new_state.register_udf(function).unwrap();
 if let Some(udf) = udf {
-assert!(false, "Function {} already registered", udf.name());
+unreachable!("Function {} already registered", udf.name());
 }
 }
 
 for function in all_default_array_functions() {
 let udf = new_state.register_udf(function).unwrap();
 if let Some(udf) = udf {
-assert!(false, "Function {} already registered", udf.name());
+unreachable!("Function {} already registered", udf.name());
 }
 }
 
 for function in all_default_aggregate_functions() {
 let udaf = new_state.register_udaf(function).unwrap();
 if let Some(udaf) = udaf {
-assert!(false, "Function {} already registered", udaf.name());
+unreachable!("Function {} already registered", udaf.name());

Review Comment:
   Clippy suggests that I use `panic` or `unreachable` instead of `assert`. I'm 
not sure which one is better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



goldmedal commented on code in PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539989


##
datafusion/core/src/execution/context/mod.rs:
##
@@ -2893,21 +2893,21 @@ mod tests {
 for function in all_default_functions() {
 let udf = new_state.register_udf(function).unwrap();
 if let Some(udf) = udf {
-assert!(false, "Function {} already registered", udf.name());
+unreachable!("Function {} already registered", udf.name());
 }
 }
 
 for function in all_default_array_functions() {
 let udf = new_state.register_udf(function).unwrap();
 if let Some(udf) = udf {
-assert!(false, "Function {} already registered", udf.name());
+unreachable!("Function {} already registered", udf.name());
 }
 }
 
 for function in all_default_aggregate_functions() {
 let udaf = new_state.register_udaf(function).unwrap();
 if let Some(udaf) = udaf {
-assert!(false, "Function {} already registered", udaf.name());
+unreachable!("Function {} already registered", udaf.name());

Review Comment:
   Clippy suggests me to use `panic` or `unreachable` instead of `assert`. I'm 
not sure which one is better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Remove duplicate function name in its aliases list [datafusion]

2024-05-25 Thread via GitHub



goldmedal commented on code in PR #10661:
URL: https://github.com/apache/datafusion/pull/10661#discussion_r1614539494


##
datafusion/optimizer/src/push_down_limit.rs:
##
@@ -65,7 +65,6 @@ impl OptimizerRule for PushDownLimit {
 };
 
 let Limit { skip, fetch, input } = limit;
-let input = input;

Review Comment:
   I'm not sure why the clippy check in CI didn't detect this but it's showed 
when I ran clippy in my local.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]

2024-05-25 Thread via GitHub



alamb commented on PR #10655:
URL: https://github.com/apache/datafusion/pull/10655#issuecomment-2131186851

   Thank you for the speedy review @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10655:
URL: https://github.com/apache/datafusion/pull/10655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Error on `NULL["field_name"]`: The expression to get an indexed field is only valid for `List`, `Struct`, or `Map` types, got Null [datafusion]

2024-05-25 Thread via GitHub



alamb closed issue #10654: Error on `NULL["field_name"]`: The expression to get 
an indexed field is only valid for `List`, `Struct`, or `Map` types, got Null
URL: https://github.com/apache/datafusion/issues/10654


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `NULL["field"]` for expr_API [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10655:
URL: https://github.com/apache/datafusion/pull/10655#discussion_r1614536504


##
datafusion/functions/src/core/getfield.rs:
##
@@ -106,6 +106,9 @@ impl ScalarUDFImpl for GetFieldFunc {
 };
 let access_schema = GetFieldAccessSchema::NamedStructField { name: 
name.clone() };
 let arg_dt = args[0].get_type(schema)?;
+if arg_dt.is_null() {
+return Ok(DataType::Null);
+}
 access_schema

Review Comment:
   Thanks @jayzhan211  
   
   I agree the change in 
https://github.com/apache/datafusion/commit/6cb45b5a9f2b53688a9754b1a7d06665ebd73a6d
 to inline things looks nice -- shall we make another PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Incorrect statistics read for binary columns in parquet [datafusion]

2024-05-25 Thread via GitHub



alamb closed issue #10605: Incorrect statistics read for binary columns in 
parquet 
URL: https://github.com/apache/datafusion/issues/10605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix incorrect statistics read for binary columns in parquet [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10645:
URL: https://github.com/apache/datafusion/pull/10645


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add substrait support for Interval types and literals [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10646:
URL: https://github.com/apache/datafusion/pull/10646#discussion_r1614534288


##
datafusion/substrait/src/variation_const.rs:
##
@@ -37,3 +38,58 @@ pub const DEFAULT_CONTAINER_TYPE_REF: u32 = 0;
 pub const LARGE_CONTAINER_TYPE_REF: u32 = 1;
 pub const DECIMAL_128_TYPE_REF: u32 = 0;
 pub const DECIMAL_256_TYPE_REF: u32 = 1;
+
+// For custom types
+/// For [`DataType::Interval`] with [`IntervalUnit::YearMonth`].
+///
+/// An `i32` for elapsed whole months. See also 
[`ScalarValue::IntervalYearMonth`]
+/// for the literal definition in DataFusion.
+///
+/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
+/// [`IntervalUnit::YearMonth`]: 
datafusion::arrow::datatypes::IntervalUnit::YearMonth
+/// [`ScalarValue::IntervalYearMonth`]: 
datafusion::common::ScalarValue::IntervalYearMonth
+pub const INTERVAL_YEAR_MONTH_TYPE_REF: u32 = 1;
+
+/// For [`DataType::Interval`] with [`IntervalUnit::DayTime`].
+///
+/// An `i64` as:
+/// - days: `i32`
+/// - milliseconds: `i32`
+///
+/// See also [`ScalarValue::IntervalDayTime`] for the literal definition in 
DataFusion.
+///
+/// [`DataType::Interval`]: datafusion::arrow::datatypes::DataType::Interval
+/// [`IntervalUnit::DayTime`]: 
datafusion::arrow::datatypes::IntervalUnit::DayTime
+/// [`ScalarValue::IntervalDayTime`]: 
datafusion::common::ScalarValue::IntervalDayTime
+pub const INTERVAL_DAY_TIME_TYPE_REF: u32 = 2;
+
+/// For [`DataType::Interval`] with [`IntervalUnit::MonthDayNano`].
+///
+/// An `i128` as:
+/// - months: `i32`
+/// - days: `i32`
+/// - nanoseconds: `i64`
+///
+/// See also [`ScalarValue::IntervalMonthDayNano`] for the literal definition 
in DataFusion.

Review Comment:
   FYI the interval implementation is changing in the next arrow I think: 
https://github.com/apache/arrow-rs/pull/5769



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix wrong type validation on unnest expr [datafusion]

2024-05-25 Thread via GitHub



alamb merged PR #10657:
URL: https://github.com/apache/datafusion/pull/10657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Wrong error thrown when unnesting a list of struct [datafusion]

2024-05-25 Thread via GitHub



alamb closed issue #10656: Wrong error thrown when unnesting a list of struct
URL: https://github.com/apache/datafusion/issues/10656


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]

2024-05-25 Thread via GitHub



advancedxy commented on code in PR #455:
URL: https://github.com/apache/datafusion-comet/pull/455#discussion_r1614532699


##
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala:
##
@@ -217,6 +325,25 @@ class CometExpressionCoverageSuite extends CometTestBase 
with AdaptiveSparkPlanH
 str shouldBe s"${getLicenseHeader()}\n# Supported Spark Expressions\n\n### 
group1\n - [x] f1\n - [ ] f2\n\n### group2\n - [x] f3\n - [ ] f4\n\n### 
group3\n - [x] f5"
   }
 
+  test("get sql function arguments") {
+// getSqlFunctionArguments("SELECT unix_seconds(TIMESTAMP('1970-01-01 
00:00:01Z'))") shouldBe Seq("TIMESTAMP('1970-01-01 00:00:01Z')")
+// getSqlFunctionArguments("SELECT decode(unhex('537061726B2053514C'), 
'UTF-8')") shouldBe Seq("unhex('537061726B2053514C')", "'UTF-8'")
+// getSqlFunctionArguments("SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 
01:00:00.123456')") shouldBe Seq("'YEAR'", "TIMESTAMP '2019-08-12 
01:00:00.123456'")
+// getSqlFunctionArguments("SELECT exists(array(1, 2, 3), x -> x % 2 == 
0)") shouldBe Seq("array(1, 2, 3)")
+getSqlFunctionArguments("select to_char(454, '999')") shouldBe 
Seq("array(1, 2, 3)")

Review Comment:
   this test is wrong? the arguments are not correct.



##
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala:
##
@@ -54,16 +57,79 @@ class CometExpressionCoverageSuite extends CometTestBase 
with AdaptiveSparkPlanH
   private val valuesPattern = """(?i)FROM VALUES(.+?);""".r
   private val selectPattern = """(i?)SELECT(.+?)FROM""".r
 
+  // exclude funcs Comet has no plans to support streaming in near future
+  // like spark streaming functions, java calls
+  private val outofRoadmapFuncs =
+List("window", "session_window", "window_time", "java_method", "reflect")
+  private val sqlConf = Seq(
+"spark.comet.exec.shuffle.enabled" -> "true",
+"spark.sql.optimizer.excludedRules" -> 
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding",
+"spark.sql.adaptive.optimizer.excludedRules" -> 
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding")
+
+  // Tests to run manually as its syntax is different from usual or nested
+  val manualTests: Map[String, (String, String)] = Map(
+"!" -> ("select true a", "select ! true from tbl"),
+"%" -> ("select 1 a, 2 b", "select a + b from tbl"),

Review Comment:
   the mapped should be `select a % b from the tbl`?



##
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala:
##
@@ -54,16 +57,79 @@ class CometExpressionCoverageSuite extends CometTestBase 
with AdaptiveSparkPlanH
   private val valuesPattern = """(?i)FROM VALUES(.+?);""".r
   private val selectPattern = """(i?)SELECT(.+?)FROM""".r
 
+  // exclude funcs Comet has no plans to support streaming in near future
+  // like spark streaming functions, java calls
+  private val outofRoadmapFuncs =
+List("window", "session_window", "window_time", "java_method", "reflect")
+  private val sqlConf = Seq(
+"spark.comet.exec.shuffle.enabled" -> "true",
+"spark.sql.optimizer.excludedRules" -> 
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding",
+"spark.sql.adaptive.optimizer.excludedRules" -> 
"org.apache.spark.sql.catalyst.optimizer.ConstantFolding")
+
+  // Tests to run manually as its syntax is different from usual or nested
+  val manualTests: Map[String, (String, String)] = Map(
+"!" -> ("select true a", "select ! true from tbl"),
+"%" -> ("select 1 a, 2 b", "select a + b from tbl"),

Review Comment:
   Or maybe you can just generate the binary operators and its mappings in a 
pragmatic way?  Such as:
   ```scala
   Seq("%", "&", ..., "|").map(x => x -> ("select 1 a, 2 b", s"select a $x b 
from tbl")
   ```



##
spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala:
##
@@ -116,20 +182,62 @@ class CometExpressionCoverageSuite extends CometTestBase 
with AdaptiveSparkPlanH
   // ConstantFolding is a operator optimization rule in Catalyst 
that replaces expressions
   // that can be statically evaluated with their equivalent 
literal values.
   dfMessage = runDatafusionCli(q)
-  testSingleLineQuery(
-"select 'dummy' x",
-s"${q.dropRight(1)}, x from tbl",
-excludedOptimizerRules =
-  
Some("org.apache.spark.sql.catalyst.optimizer.ConstantFolding"))
+
+  manualTests.get(func.name) match {
+// the test is manual query
+case Some(test) => testSingleLineQuery(test._1, test._2, 
sqlConf = sqlConf)
+case None =>
+  // extract function arguments as a sql text
+  // example:
+  // cos(0) -> 0
+  // explode_outer(array(10, 20)) -> array(10, 20)
+  val args = getSqlFunctionArguments(q.dropRight(1))
+  val (alias

Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]

2024-05-25 Thread via GitHub



peter-toth commented on code in PR #10543:
URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614522462


##
datafusion/physical-plan/src/work_table.rs:
##
@@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec {
 &self.cache
 }
 
-fn children(&self) -> Vec> {
+fn children(&self) -> Vec<&Arc> {

Review Comment:
   Actualy, I don't think we can return a slice of references. Returning an 
empty slice here would be ok, but at other places where there are children to 
return (e.g. in `BinaryExpr`) we need to build a temporary container (vec or 
array) to store the references of children and then return a slice of the 
container, but who will own the container?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Add reference visitor `TreeNode` APIs, change `ExecutionPlan::children()` and `PhysicalExpr::children()` return references [datafusion]

2024-05-25 Thread via GitHub



peter-toth commented on code in PR #10543:
URL: https://github.com/apache/datafusion/pull/10543#discussion_r1614503459


##
datafusion/physical-plan/src/work_table.rs:
##
@@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec {
 &self.cache
 }
 
-fn children(&self) -> Vec> {
+fn children(&self) -> Vec<&Arc> {

Review Comment:
   ~I think we could return a slice, but the current `Vec` is in sync with 
other implementations how we usually return children. Like 
`LogicalPlan::inputs()` or `ConcreteTreeNode::children()`.
   Or shall we change those too?~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-25 Thread via GitHub



alamb commented on code in PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#discussion_r1614530863


##
datafusion/sqllogictest/test_files/select.slt:
##
@@ -1473,7 +1473,7 @@ DROP TABLE t;
 
 # related to https://github.com/apache/datafusion/issues/8814
 statement ok
-create table t(x int, y int) as values (1,1), (2,2), (3,3), (0,0), (4,0);
+create table t(x bigint, y bigint) as values (1,1), (2,2), (3,3), (0,0), (4,0);

Review Comment:
   Could you please keep the existing test and add a new test that uses 
`bigint` so it is clearer what behavior is changing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: substring with negative indices should produce correct result [datafusion-comet]

2024-05-25 Thread via GitHub



sonhmai commented on PR #470:
URL: https://github.com/apache/datafusion-comet/pull/470#issuecomment-2131173613

   @viirya @andygrove can you take a look. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo