Re: [PR] add catalog as part of the table path in plan_to_sql [datafusion]

2024-05-21 Thread via GitHub


phillipleblanc commented on code in PR #10612:
URL: https://github.com/apache/datafusion/pull/10612#discussion_r1609299261


##
datafusion/sql/src/unparser/plan.rs:
##
@@ -502,3 +505,35 @@ impl From for DataFusionError {
 DataFusionError::External(Box::new(e))
 }
 }
+
+#[cfg(test)]
+mod test {
+use crate::unparser::plan_to_sql;
+use arrow::datatypes::{DataType, Field, Schema};
+use datafusion_expr::{col, logical_plan::table_scan};
+#[test]
+fn test_plan_to_sql() {

Review Comment:
   ```suggestion
   fn test_table_references_in_plan_to_sql() {
   ```
   
   This test technically just covers the table references part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] add catalog as part of the table path in plan_to_sql [datafusion]

2024-05-21 Thread via GitHub


y-f-u opened a new pull request, #10612:
URL: https://github.com/apache/datafusion/pull/10612

   ## Which issue does this PR close?
   
   TableReference has the catalog information but it's not used in `plan_to_sql`
   
   ## Rationale for this change
   
   It's common for DBMS to have pattern of `[catalog.][schema.]table_name` in 
SQL statement. E.g. snowflake.
   
   ## What changes are included in this PR?
   
   - Add catalog information if it exists in table reference.
   
   ## Are these changes tested?
   
   Yes
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-21 Thread via GitHub


goldmedal commented on code in PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#discussion_r1609280746


##
datafusion/sql/src/unparser/dialect.rs:
##
@@ -15,19 +15,33 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/// Dialect is used to capture dialect specific syntax.
+use regex::Regex;
+use sqlparser::keywords::ALL_KEYWORDS;
+
+/// `Dialect` to usse for Unparsing

Review Comment:
   Thanks @phillipleblanc 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-21 Thread via GitHub


phillipleblanc commented on code in PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#discussion_r1609277646


##
datafusion/sql/src/unparser/dialect.rs:
##
@@ -15,19 +15,33 @@
 // specific language governing permissions and limitations
 // under the License.
 
-/// Dialect is used to capture dialect specific syntax.
+use regex::Regex;
+use sqlparser::keywords::ALL_KEYWORDS;
+
+/// `Dialect` to usse for Unparsing

Review Comment:
   ```suggestion
   /// `Dialect` to use for Unparsing
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-21 Thread via GitHub


goldmedal commented on PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#issuecomment-2123883816

   > Thank you @goldmedal -- I think this looks really nice
   > 
   > Thank you for the reviews @comphead
   > 
   > I left some suggestions for improvement but I think they could be done as 
follow on PRs as well.
   > 
   > cc @phillipleblanc and @devinjdangelo and @backkem
   
   Thanks @alamb !
   I think the suggestions is very simple and reasonable. So, I just fixed them 
in this PR quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-21 Thread via GitHub


goldmedal commented on code in PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#discussion_r1609270235


##
datafusion/sql/src/unparser/dialect.rs:
##
@@ -15,19 +15,30 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use regex::Regex;
+use sqlparser::keywords::ALL_KEYWORDS;
+
 /// Dialect is used to capture dialect specific syntax.

Review Comment:
   Thanks. Look nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-21 Thread via GitHub


appletreeisyellow commented on PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2123829586

   > I am not sure I have the time to do that in the next week -- maybe 
@appletreeisyellow does 🤔
   
   @alamb I'm happy to coordinate with our performance team and run an 
influxdb_iox regression tests if needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1609228014


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet
+  // checkSparkAnswerAndOperator("SELECT hex(_9), hex(_10), hex(_11), 
hex(_12) FROM tbl")
+  checkSparkAnswerAndOperator("SELECT hex(_11), hex(_12) FROM tbl")
+
+  // strings, binary
+  checkSparkAnswerAndOperator("SELECT hex(_8), hex(_14) FROM tbl")

Review Comment:
   I updated dictionary handling, so I've added _13 back in. I think only the 
smaller uints are omitted now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#issuecomment-2123824670

   @kazuyukitanimura @advancedxy ... re-requesting reviews from you two, 
please. I've updated the code to support dictionaries and removed some of the 
finer int types. Barring something happening on intel mac, the tests look to 
pass here: 
https://github.com/tshauck/arrow-datafusion-comet/actions/runs/9183637020 
thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]

2024-05-21 Thread via GitHub


advancedxy commented on code in PR #455:
URL: https://github.com/apache/datafusion-comet/pull/455#discussion_r1609204351


##
docs/spark_expressions_support.md:
##
@@ -0,0 +1,477 @@
+
+
+# Supported Spark Expressions
+
+### agg_funcs
+ - [ ] any
+ - [ ] any_value
+ - [ ] approx_count_distinct
+ - [ ] approx_percentile
+ - [ ] array_agg
+ - [ ] avg
+ - [ ] bit_and
+ - [ ] bit_or
+ - [ ] bit_xor
+ - [ ] bool_and
+ - [ ] bool_or
+ - [ ] collect_list
+ - [ ] collect_set
+ - [ ] corr
+ - [ ] count
+ - [ ] count_if
+ - [ ] count_min_sketch
+ - [ ] covar_pop
+ - [ ] covar_samp
+ - [ ] every
+ - [ ] first
+ - [ ] first_value
+ - [ ] grouping
+ - [ ] grouping_id
+ - [ ] histogram_numeric
+ - [ ] kurtosis
+ - [ ] last
+ - [ ] last_value
+ - [ ] max
+ - [ ] max_by
+ - [ ] mean
+ - [ ] median
+ - [ ] min
+ - [ ] min_by
+ - [ ] mode
+ - [ ] percentile
+ - [ ] percentile_approx
+ - [ ] regr_avgx
+ - [ ] regr_avgy
+ - [ ] regr_count
+ - [ ] regr_intercept
+ - [ ] regr_r2
+ - [ ] regr_slope
+ - [ ] regr_sxx
+ - [ ] regr_sxy
+ - [ ] regr_syy
+ - [ ] skewness
+ - [ ] some
+ - [ ] std
+ - [ ] stddev
+ - [ ] stddev_pop
+ - [ ] stddev_samp

Review Comment:
   Yeah.. Maybe we need to enable Comet Shuffle to re-run the 
CometExpressionCoverageSuite.



##
docs/spark_expressions_support.md:
##
@@ -0,0 +1,477 @@
+
+
+# Supported Spark Expressions
+
+### agg_funcs
+ - [ ] any
+ - [ ] any_value
+ - [ ] approx_count_distinct
+ - [ ] approx_percentile
+ - [ ] array_agg
+ - [ ] avg
+ - [ ] bit_and
+ - [ ] bit_or
+ - [ ] bit_xor
+ - [ ] bool_and
+ - [ ] bool_or
+ - [ ] collect_list
+ - [ ] collect_set
+ - [ ] corr
+ - [ ] count
+ - [ ] count_if
+ - [ ] count_min_sketch
+ - [ ] covar_pop
+ - [ ] covar_samp
+ - [ ] every
+ - [ ] first
+ - [ ] first_value
+ - [ ] grouping
+ - [ ] grouping_id
+ - [ ] histogram_numeric
+ - [ ] kurtosis
+ - [ ] last
+ - [ ] last_value
+ - [ ] max
+ - [ ] max_by
+ - [ ] mean
+ - [ ] median
+ - [ ] min
+ - [ ] min_by
+ - [ ] mode
+ - [ ] percentile
+ - [ ] percentile_approx
+ - [ ] regr_avgx
+ - [ ] regr_avgy
+ - [ ] regr_count
+ - [ ] regr_intercept
+ - [ ] regr_r2
+ - [ ] regr_slope
+ - [ ] regr_sxx
+ - [ ] regr_sxy
+ - [ ] regr_syy
+ - [ ] skewness
+ - [ ] some
+ - [ ] std
+ - [ ] stddev
+ - [ ] stddev_pop
+ - [ ] stddev_samp
+ - [ ] sum
+ - [ ] try_avg
+ - [ ] try_sum
+ - [ ] var_pop
+ - [ ] var_samp
+ - [ ] variance
+
+### array_funcs
+ - [ ] array
+ - [ ] array_append
+ - [ ] array_compact
+ - [ ] array_contains
+ - [ ] array_distinct
+ - [ ] array_except
+ - [ ] array_insert
+ - [ ] array_intersect
+ - [ ] array_join
+ - [ ] array_max
+ - [ ] array_min
+ - [ ] array_position
+ - [ ] array_remove
+ - [ ] array_repeat
+ - [ ] array_union
+ - [ ] arrays_overlap
+ - [ ] arrays_zip
+ - [ ] flatten
+ - [x] get
+ - [ ] sequence
+ - [ ] shuffle
+ - [ ] slice
+ - [ ] sort_array
+
+### bitwise_funcs
+ - [x] &
+ - [x] ^
+ - [ ] bit_count
+ - [ ] bit_get
+ - [ ] getbit
+ - [x] shiftright
+ - [ ] shiftrightunsigned
+ - [x] |
+ - [ ] ~
+
+### collection_funcs
+ - [ ] array_size
+ - [ ] cardinality
+ - [ ] concat
+ - [x] reverse
+ - [ ] size
+
+### conditional_funcs
+ - [x] coalesce
+ - [x] if
+ - [ ] ifnull
+ - [ ] nanvl
+ - [x] nullif
+ - [ ] nvl

Review Comment:
   hmm, it should be supported? It's essential the same as `coalesce`, which is 
replaced during analysis phase.
   
   Maybe we should file an issue to track this kind of problem.



##
docs/spark_expressions_support.md:
##
@@ -0,0 +1,477 @@
+
+
+# Supported Spark Expressions
+
+### agg_funcs
+ - [ ] any
+ - [ ] any_value
+ - [ ] approx_count_distinct
+ - [ ] approx_percentile
+ - [ ] array_agg
+ - [ ] avg
+ - [ ] bit_and
+ - [ ] bit_or
+ - [ ] bit_xor
+ - [ ] bool_and
+ - [ ] bool_or
+ - [ ] collect_list
+ - [ ] collect_set
+ - [ ] corr
+ - [ ] count
+ - [ ] count_if
+ - [ ] count_min_sketch
+ - [ ] covar_pop
+ - [ ] covar_samp
+ - [ ] every
+ - [ ] first
+ - [ ] first_value
+ - [ ] grouping
+ - [ ] grouping_id
+ - [ ] histogram_numeric
+ - [ ] kurtosis
+ - [ ] last
+ - [ ] last_value
+ - [ ] max
+ - [ ] max_by
+ - [ ] mean
+ - [ ] median
+ - [ ] min
+ - [ ] min_by
+ - [ ] mode
+ - [ ] percentile
+ - [ ] percentile_approx
+ - [ ] regr_avgx
+ - [ ] regr_avgy
+ - [ ] regr_count
+ - [ ] regr_intercept
+ - [ ] regr_r2
+ - [ ] regr_slope
+ - [ ] regr_sxx
+ - [ ] regr_sxy
+ - [ ] regr_syy
+ - [ ] skewness
+ - [ ] some
+ - [ ] std
+ - [ ] stddev
+ - [ ] stddev_pop
+ - [ ] stddev_samp
+ - [ ] sum
+ - [ ] try_avg
+ - [ ] try_sum
+ - [ ] var_pop
+ - [ ] var_samp
+ - [ ] variance
+
+### array_funcs
+ - [ ] array
+ - [ ] array_append
+ - [ ] array_compact
+ - [ ] array_contains
+ - [ ] array_distinct
+ - [ ] array_except
+ - [ ] array_insert
+ - [ ] array_intersect
+ - [ ] array_join
+ - [ ] array_max
+ - [ ] array_min
+ - [ ] array_position
+ - [ ] array_remove
+ - [ ] array_repeat
+ - [ ] array_union
+ - [ ] arrays_overlap
+ - [ ] arrays_zip
+ - [ ] flatten
+ - [x] get
+ - [ ] sequence
+ - [ ] shuffle
+ - [ ] slice
+

Re: [PR] Minor: Move group accumulator for aggregate function to physical-expr-common, and add ahash physical-expr-common [datafusion]

2024-05-21 Thread via GitHub


jayzhan211 merged PR #10574:
URL: https://github.com/apache/datafusion/pull/10574


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Move group accumulator for aggregate function to physical-expr-common, and add ahash physical-expr-common [datafusion]

2024-05-21 Thread via GitHub


jayzhan211 commented on PR #10574:
URL: https://github.com/apache/datafusion/pull/10574#issuecomment-2123785496

   Thanks @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Minor: Move median test [datafusion]

2024-05-21 Thread via GitHub


jayzhan211 opened a new pull request, #10611:
URL: https://github.com/apache/datafusion/pull/10611

   ## Which issue does this PR close?
   
   
   
   Part of #10384 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-21 Thread via GitHub


jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1609182533


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -505,7 +507,35 @@ pub async fn from_substrait_rel(
 _ => Ok(t),
 }
 }
-_ => not_impl_err!("Only NamedTable reads are supported"),
+Some(ReadType::VirtualTable(vt)) => {
+let base_schema = read
+.base_schema
+.clone()
+.expect("expected schema to exist for virtual table");
+
+let fields = from_substrait_named_struct(&base_schema)?;
+let schema = 
DFSchemaRef::new(DFSchema::try_from(Schema::new(fields))?);

Review Comment:
   I once wanted to remove the schema, but we need to use it to pass the 
struct's names, so the current approach should be fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-21 Thread via GitHub


jonahgao commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1609180143


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1277,7 +1404,84 @@ pub(crate) fn from_substrait_literal(lit: &Literal) -> 
Result {
 s,
 )
 }
-Some(LiteralType::Null(ntype)) => from_substrait_null(ntype)?,
+Some(LiteralType::List(l)) => {
+let element_dt: Option<&DataType> = match dt {
+Some(DataType::List(l)) => Ok(Some(l.data_type())),

Review Comment:
   Thank you for your explanation @Blizzara . My concern is that using two 
types for List simultaneously is a bit verbose unless it is necessary.
   
   I tested the following two cases: the first one panicked, and the second one 
also failed.
   ```rust
   
   #[tokio::test]
   async fn roundtrip_list_iteral() -> Result<()> {
   roundtrip("SELECT [] from data").await
   }
   
   #[tokio::test]
   async fn roundtrip_struct_iteral() -> Result<()> {
   roundtrip("select struct(1 as name0, 3.14 as name1, 'e', true as name3) 
from data")
   .await
   }
   ``` 
   ```sh
    cases::roundtrip_logical_plan::roundtrip_list_iteral stdout 
   thread 'cases::roundtrip_logical_plan::roundtrip_list_iteral' panicked at 
consumer.rs:1425:41:
   index out of bounds: the len is 0 but the index is 0
   ```
   I think lists and structs are more complex than the virtual table, so it's 
better for them to have separate PRs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] PR build for Linux Java 11 with Spark 3.4 is not running [datafusion-comet]

2024-05-21 Thread via GitHub


advancedxy commented on issue #389:
URL: 
https://github.com/apache/datafusion-comet/issues/389#issuecomment-2123752077

   There's quota limit per repo for github runners(might not be a problem for 
apache project but for the forked ones) and we thought it would be sufficient 
to cover Java8(oldest) and Java17(quite the latest) with Spark 3.4 on Linux 
runners for pull requests. In that way, the runners required to run PR actions 
is reduced. The power consumption is reduced as well😁.
   
   Is there any specific need to run Java11, such as it's the java version used 
in your production? If so, I think we can replace java 17/8 with java11.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: show stats in explain of two representative queries [datafusion]

2024-05-21 Thread via GitHub


github-actions[bot] commented on PR #8173:
URL: https://github.com/apache/datafusion/pull/8173#issuecomment-2123722891

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Draft: Optimize to_timestamp [datafusion]

2024-05-21 Thread via GitHub


github-actions[bot] commented on PR #9694:
URL: https://github.com/apache/datafusion/pull/9694#issuecomment-2123722853

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] adding benchmark for extracting arrow statistics from parquet [datafusion]

2024-05-21 Thread via GitHub


Lordworms opened a new pull request, #10610:
URL: https://github.com/apache/datafusion/pull/10610

   ## Which issue does this PR close?
   cargo bench --bench parquet_statistic   
   
   
   Closes #10606 
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make it easier to create WindowFunctions with the Expr API [datafusion]

2024-05-21 Thread via GitHub


shanretoo commented on issue #6747:
URL: https://github.com/apache/datafusion/issues/6747#issuecomment-2123697634

   You can check it in the unit test: 
[`test_fn_lead`](https://github.com/shanretoo/datafusion/blob/65a8895d948152845dda1934b404399919ebe23c/datafusion/core/tests/dataframe/dataframe_functions.rs#L1207).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Start setting up new StreamTable config [datafusion]

2024-05-21 Thread via GitHub


matthewmturner commented on PR #10600:
URL: https://github.com/apache/datafusion/pull/10600#issuecomment-2123694427

   Im hoping to get to a similar API as `ListingTable`.
   
   `ListingTable` => `ListingTableConfig` => `FileFormat`
   
   Where `ListingTable` and `ListingTableConfig` are provided by datafusion and 
if you want to extend then you can just implement a `FileFormat` and benefit 
from the `ListingTable` and `ListingTableConfig` machinery.
   
   In the case of streams i had in mind: 
   
   `StreamTable` => `StreamTableConfig` => `StreamProvider` (maybe i rename to 
`StreamFormat` for consistency?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Introduce expr builder for aggregate function [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10560:
URL: https://github.com/apache/datafusion/pull/10560#issuecomment-2123673946

   THanks @jayzhan211  -- I will plan to review this tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


huaxingao commented on PR #456:
URL: https://github.com/apache/datafusion-comet/pull/456#issuecomment-2123660352

   cc @andygrove @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-21 Thread via GitHub


jayzhan211 commented on PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2123600422

   I believe the coercion rule is quite messy as it currently stands. It would 
be more understandable and maintainable to move the coercion rule from 
coerced_from to each individual function. This way, it would be clear which 
coercion rule applies to each function or signature. Having a single function 
handle multiple coercion rules makes the code harder to reason about and could 
cause downstream issues, as you have pointed out. I've also filed an issue to 
track this at @10507.
   
   Would it be better to start by removing the rule from coerce_from and moving 
it to the specific function or signature?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement a benchmark for extracting arrow statistics from parquet [datafusion]

2024-05-21 Thread via GitHub


Lordworms commented on issue #10606:
URL: https://github.com/apache/datafusion/issues/10606#issuecomment-2123596547

   Take this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] PR build for Linux Java 11 with Spark 3.4 is not running [datafusion-comet]

2024-05-21 Thread via GitHub


kazuyukitanimura commented on issue #389:
URL: 
https://github.com/apache/datafusion-comet/issues/389#issuecomment-2123578608

   Thanks @advancedxy Do you know the original reason why this specific 
combination was excluded by any chance?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: RewriteCycle API for short-circuiting optimizer loops [datafusion]

2024-05-21 Thread via GitHub


erratic-pattern commented on PR #10386:
URL: https://github.com/apache/datafusion/pull/10386#issuecomment-2123560931

   @alamb  I will try to update this today or tomorrow. I've been putting this 
off a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: reduce allocations in push down filter [datafusion]

2024-05-21 Thread via GitHub


erratic-pattern commented on PR #10567:
URL: https://github.com/apache/datafusion/pull/10567#issuecomment-2123559213

   Yes the `Arc::clone` is not a performance improvement, but was just a way 
for me to keep track of which clones were "zero copy" while working on this. It 
is a recommended convention as well, so I think it's worth keeping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Pass BigQuery options to the ArrowSchema [datafusion]

2024-05-21 Thread via GitHub


ozankabak commented on PR #10590:
URL: https://github.com/apache/datafusion/pull/10590#issuecomment-2123531925

   I agree with @alamb. Maybe we can do a quick survey on how different systems 
take per-column schema metadata and try to elucidate the best DF syntax from 
that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Excessive memory consumption on sorting [datafusion]

2024-05-21 Thread via GitHub


samuelcolvin commented on issue #10511:
URL: https://github.com/apache/datafusion/issues/10511#issuecomment-2123523912

   Sorry for the delay, here we go:
   
   ### `logical_plan`
   
   ```
Projection: records_store.span_name
 Limit: skip=0, fetch=20
   Sort: bit_length(records_store.attributes) DESC NULLS FIRST, fetch=20
 TableScan: records_store projection=[span_name, attributes]
   ```
   
   ### `physical_plan`
   
   ```
ProjectionExec: expr=[span_name@0 as span_name]
 GlobalLimitExec: skip=0, fetch=20
   SortPreservingMergeExec: [bit_length(attributes@1) DESC], fetch=20
 SortExec: TopK(fetch=20), expr=[bit_length(attributes@1) DESC], 
preserve_partitioning=[true]
   ParquetExec: file_groups={14 groups: 
[[Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_400_0001_2139d404-1e6c-4903-90f1-775727315226.parquet:0..4619991,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_400_0001_41744ee3-e741-43cf-ab17-6b986d785a58.parquet:0..4631041,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_401_0001_91993b2d-9555-4d85-8dd9-71ad352a7066.parquet:0..16945773,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_401_0001_f611b930-13d2-4311-acf2-489585ca7e2a.parquet:0..16920185,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_402_0001_3f208d74-4b67-44ec-aa4a-1ed3
 0d881b7f.parquet:0..18130069, ...], 
[Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_403_0001_780473ad-7867-4bcf-b63d-a52460f466cf.parquet:16513148..17675405,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_404_0001_d11bf279-599e-4e2c-9a60-14d13045c8dd.parquet:0..16597007,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_405_0001_456e9ee4-35d7-4483-813e-73c29d8023d2.parquet:0..15207780,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_406_0001_5becfa29-66d8-44ff-b61c-b82987f3d060.parquet:0..16885514,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_407_0001_1a81e83c-a226-4f29-b00b-54f849e
 b46cd.parquet:0..15583994, ...], 
[Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_408_0001_ec8d95f7-1464-4adf-a975-6857937592e1.parquet:12323655..15398417,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_409_0001_f27ecf09-e64b-4fc3-9cb9-cdd53b2200a6.parquet:0..15601539,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_410_0001_bd93a8e8-dd6f-48ae-8622-fd439f9d27f9.parquet:0..15515518,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_411_0001_5ec01d68-8a3a-4d21-b0df-3d507bea2d1c.parquet:0..15347993,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_412_0001_3611c025-d788-4d26-b77d-424e421fcd
 98.parquet:0..16151170, ...], 
[Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_413_0001_17decbd1-13d0-4a01-9213-8fee2ba46b7b.parquet:12069225..16928575,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_414_0001_d7924295-f735-4b34-af7f-3975ad3f91b6.parquet:0..16952022,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_415_0001_7a7e0d3e-de3e-462b-9fb5-264277a2c74b.parquet:0..16826232,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_416_0001_da93e867-a912-454a-a758-c9e27dcf74df.parquet:0..17230023,
 
Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_417_0001_577d3fa8-ea7d-471f-a33a-0d65faa54902.
 parquet:0..17533201, ...], 
[Users/samuel/code/pydantic-data-platform/src/services/fusionfire/object_store/{org}/records/project_id={col}/day=2024-05-13/data_418_0001_b5f013f2-d3ed-4f98-bb9c-cb58b059b5fc.parquet:4359379..17560075,
 
Users/samu

Re: [PR] Improve `UserDefinedLogicalNodeCore::from_template` API to return Result [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10597:
URL: https://github.com/apache/datafusion/pull/10597


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Add spark-4.0 profile and shims [datafusion-comet]

2024-05-21 Thread via GitHub


kazuyukitanimura commented on code in PR #407:
URL: https://github.com/apache/datafusion-comet/pull/407#discussion_r1608978203


##
spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala:
##
@@ -454,12 +458,13 @@ class CometShuffleWriteProcessor(
 new 
SQLShuffleWriteMetricsReporter(context.taskMetrics().shuffleWriteMetrics, 
metrics)
   }
 
+  // FIXME: we need actually prev of rdd?
   override def write(
-  rdd: RDD[_],
+  inputs: Iterator[_],

Review Comment:
   I implemented the partitionId removal.
   Thanks @viirya 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Start setting up new StreamTable config [datafusion]

2024-05-21 Thread via GitHub


matthewmturner commented on PR #10600:
URL: https://github.com/apache/datafusion/pull/10600#issuecomment-2123477465

   @alamb i would be happy to add example - for this PR it would likely just be 
copying from 
https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fifo.rs.  
I will get to that shortly.
   
   Ultimately the motivation here is to start working towards a more generic 
interface where `StreamTable` can be used for multiple stream types as opposed 
to just files (such as websockets, kafka, etc. or maybe thats a non-goal, in 
which case i can just close this).  Ive seen that it can require lots of custom 
code (formats and providers) to implement streaming tables and im hoping to 
simplify that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Test for reading read statistics from parquet files without statistics and boolean & struct data type [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN commented on code in PR #10608:
URL: https://github.com/apache/datafusion/pull/10608#discussion_r1608970404


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -73,6 +73,22 @@ pub fn parquet_file_one_column(
 no_null_values_start: i64,
 no_null_values_end: i64,
 row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+parquet_file_one_column_stats(
+num_null,
+no_null_values_start,
+no_null_values_end,
+row_per_group,
+EnabledStatistics::Chunk,
+)
+}
+
+pub fn parquet_file_one_column_stats(

Review Comment:
   If we want to do this, we need to think about `parquet_file_many_columns`, 
too.
   I have to sign off now. How about we do this in following PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN commented on issue #10453:
URL: https://github.com/apache/datafusion/issues/10453#issuecomment-2123465859

   @alamb Another bug: https://github.com/apache/datafusion/issues/10609


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Incorrect statistics read for struct array in parquet [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN opened a new issue, #10609:
URL: https://github.com/apache/datafusion/issues/10609

   ### Describe the bug
   
   I found this while adding tests 
https://github.com/apache/datafusion/pull/10608. The statistics of struct array 
returns nothing
   
   ### To Reproduce
   
   See `test_struct` in https://github.com/apache/datafusion/pull/10608
   
   
   ### Expected behavior
   
   Return some values for the statistics
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add eliminate group by constant optimizer rule [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10591:
URL: https://github.com/apache/datafusion/pull/10591#discussion_r1608953945


##
datafusion/optimizer/src/eliminate_group_by_constant.rs:
##
@@ -0,0 +1,318 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`EliminateGroupByConstant`] removes constant expressions from `GROUP BY` 
clause
+use crate::optimizer::ApplyOrder;
+use crate::{OptimizerConfig, OptimizerRule};
+
+use datafusion_common::tree_node::Transformed;
+use datafusion_common::{internal_err, Result};
+use datafusion_expr::{Aggregate, Expr, LogicalPlan, LogicalPlanBuilder, 
Volatility};
+
+/// Optimizer rule that removes constant expressions from `GROUP BY` clause
+/// and places additional projection on top of aggregation, to preserve
+/// original schema
+#[derive(Default)]
+pub struct EliminateGroupByConstant {}
+
+impl EliminateGroupByConstant {
+pub fn new() -> Self {
+Self {}
+}
+}
+
+impl OptimizerRule for EliminateGroupByConstant {
+fn supports_rewrite(&self) -> bool {
+true
+}
+
+fn rewrite(
+&self,
+plan: LogicalPlan,
+_config: &dyn OptimizerConfig,
+) -> Result> {
+match plan {
+LogicalPlan::Aggregate(aggregate) => {
+let (const_group_expr, nonconst_group_expr): (Vec<_>, Vec<_>) 
= aggregate
+.group_expr
+.iter()
+.partition(|expr| is_constant_expression(expr));
+
+// If no constant expressions found (nothing to optimize) or
+// constant expression is the only expression in aggregate,
+// optimization is skipped
+if const_group_expr.is_empty()
+|| (!const_group_expr.is_empty()
+&& nonconst_group_expr.is_empty()
+&& aggregate.aggr_expr.is_empty())
+{
+return 
Ok(Transformed::no(LogicalPlan::Aggregate(aggregate)));
+}
+
+let simplified_aggregate = 
LogicalPlan::Aggregate(Aggregate::try_new(
+aggregate.input,
+nonconst_group_expr.into_iter().cloned().collect(),
+aggregate.aggr_expr.clone(),
+)?);
+
+let projection_expr =
+
aggregate.group_expr.into_iter().chain(aggregate.aggr_expr);
+
+let projection = LogicalPlanBuilder::from(simplified_aggregate)
+.project(projection_expr)?
+.build()?;
+
+Ok(Transformed::yes(projection))
+}
+_ => Ok(Transformed::no(plan)),
+}
+}
+
+fn try_optimize(
+&self,
+_plan: &LogicalPlan,
+_config: &dyn OptimizerConfig,
+) -> Result> {
+internal_err!("Should have called EliminateGroupByConstant::rewrite")
+}
+
+fn name(&self) -> &str {
+"eliminate_group_by_constant"
+}
+
+fn apply_order(&self) -> Option {
+Some(ApplyOrder::BottomUp)
+}
+}
+
+/// Checks if expression is constant, and can be eliminated from group by.
+///
+/// Intended to be used only within this rule, helper function, which heavily
+/// reiles on `SimplifyExpressions` result.
+fn is_constant_expression(expr: &Expr) -> bool {

Review Comment:
   I think you could use `TreeNode::Exists` here: 
https://github.com/apache/datafusion/blob/045e8fcaeac2f2c29afe5017e5efe5a3a2560080/datafusion/common/src/tree_node.rs#L408
   
   Which handles all types of expressions and would be general
   
   Something like this, perhaps (untested)
   
   ```rust
   fn is_constant_expression(expr: &Expr) -> bool {
 !expr.exsts(|expr| {
   match expr { 
 Expr::Column(_) => true,
 Expr::ScalarFunction(e) if matches!(
   e.func.signature().volatility,
   Volatility::Volatile
 ) => true,
 _ => false
})
   ```



##
datafusion/sqllogictest/test_files/subquery.slt:
##
@@ -768,8 +766,8 @@ logical_plan
 02)--Left Join: t1.t1_int = __scalar_sq_1.t2_int
 03)TableScan: t1 projection=[t1_id, t1_int]
 04)SubqueryAlias: __scala

Re: [PR] Test for reading read statistics from parquet files without statistics [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10608:
URL: https://github.com/apache/datafusion/pull/10608#issuecomment-2123435559

   This PR also appears to have a conflict now 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `UserDefinedLogicalNodeCore::from_template` API to return Result [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10597:
URL: https://github.com/apache/datafusion/pull/10597#issuecomment-2123433357

   I took the liberty of merging this branch up from main to resolve a 
conflict. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Improve `UserDefinedLogicalNodeCore::from_template` API to return Result [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10597:
URL: https://github.com/apache/datafusion/pull/10597#discussion_r1608941760


##
datafusion/expr/src/logical_plan/extension.rs:
##
@@ -248,23 +248,27 @@ pub trait UserDefinedLogicalNodeCore:
 /// For example: `TopK: k=10`
 fn fmt_for_explain(&self, f: &mut fmt::Formatter) -> fmt::Result;
 
-/// Create a new `ExtensionPlanNode` with the specified children
+#[deprecated(since = "39.0.0", note = "use with_exprs_and_inputs instead")]

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: extend `unnest` to support Struct datatype [datafusion]

2024-05-21 Thread via GitHub


duongcongtoai commented on code in PR #10429:
URL: https://github.com/apache/datafusion/pull/10429#discussion_r1608935540


##
datafusion/expr/src/expr_schema.rs:
##
@@ -123,7 +123,8 @@ impl ExprSchemable for Expr {
 Ok(field.data_type().clone())
 }
 DataType::Struct(_) => {
-not_impl_err!("unnest() does not support struct yet")
+// TODO: this is not correct, because unnest(struct) 
wll result into multiple data_type

Review Comment:
   if we returns error here, the error will stop the query plan from continue.
   Initially unnest comes into projection step as an Expr here: 
https://github.com/apache/datafusion/blob/d58bae487329b7a7078429f083bffc611f42c8c7/datafusion/expr/src/utils.rs#L734,
 where it check data types of each expr, then later on we call
   
[try_process_unnest](https://github.com/apache/datafusion/blob/ae4b3a0f8366ab18be5ef5cfa2b3cc3aca12baf1/datafusion/sql/src/select.rs#L226)
 to do the transformation from unnest(struct) -> struct.field1, struct.field 2 
...
   
   I think unnest is the only expr so far that represent a set of fields 
instead of singular value here, so i'm thinking of adding an extra check at 
this function 
https://github.com/apache/datafusion/blob/d58bae487329b7a7078429f083bffc611f42c8c7/datafusion/expr/src/utils.rs#L734



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement a dialect-specific rule for unparsing an identifier with or without quotes [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10573:
URL: https://github.com/apache/datafusion/pull/10573#discussion_r1608928216


##
datafusion-examples/examples/plan_to_sql.rs:
##
@@ -52,7 +52,7 @@ fn simple_expr_to_sql_demo() -> Result<()> {
 let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
 let ast = expr_to_sql(&expr)?;
 let sql = format!("{}", ast);
-assert_eq!(sql, r#"(("a" < 5) OR ("a" = 8))"#);
+assert_eq!(sql, r#"((a < 5) OR (a = 8))"#);

Review Comment:
   ❤️ 



##
datafusion/core/Cargo.toml:
##
@@ -145,7 +145,7 @@ postgres-protocol = "0.6.4"
 postgres-types = { version = "0.2.4", features = ["derive", "with-chrono-0_4"] 
}
 rand = { workspace = true, features = ["small_rng"] }
 rand_distr = "0.4.3"
-regex = "1.5.4"
+regex = { workspace = true }

Review Comment:
   that is certainly nice to use the same version of regex everywhere 👍 



##
datafusion/sql/src/unparser/dialect.rs:
##
@@ -15,19 +15,30 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use regex::Regex;
+use sqlparser::keywords::ALL_KEYWORDS;
+
 /// Dialect is used to capture dialect specific syntax.

Review Comment:
   ```suggestion
   /// `Dialect` to usse for Unparsing
   ///
   /// The default dialect tries to avoid quoting identifiers unless necessary 
(e.g. `a` instead of `"a"`)
   /// but this behavior can be overridden as needed
   ```



##
datafusion-examples/examples/plan_to_sql.rs:
##
@@ -52,7 +52,7 @@ fn simple_expr_to_sql_demo() -> Result<()> {
 let expr = col("a").lt(lit(5)).or(col("a").eq(lit(8)));
 let ast = expr_to_sql(&expr)?;
 let sql = format!("{}", ast);
-assert_eq!(sql, r#"(("a" < 5) OR ("a" = 8))"#);
+assert_eq!(sql, r#"((a < 5) OR (a = 8))"#);

Review Comment:
   Given this change, perhaps we can remove the next example in the file 
`simple_expr_to_sql_demo_no_escape` as I don't think it serves any purpose



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: RewriteCycle API for short-circuiting optimizer loops [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10386:
URL: https://github.com/apache/datafusion/pull/10386#issuecomment-2123403077

   Is this PR ready for the next round of review @erratic-pattern ? Or do you 
plan to make further changes to it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: extend unnest to support Struct datatype [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10429:
URL: https://github.com/apache/datafusion/pull/10429#discussion_r1608917787


##
datafusion/sqllogictest/test_files/unnest.slt:
##
@@ -288,6 +308,18 @@ select unnest(array_remove(column1, 12)) from unnest_table;
 5
 6
 
+## unnest struct-typed column and list-typed column at the same time
+query I?II?
+select unnest(column1), column1, unnest(column5), column5 from unnest_table;
+
+1 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+2 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+3 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+4 [4, 5] 3 4 {c0: 3, c1: 4}
+5 [4, 5] 3 4 {c0: 3, c1: 4}
+6 [6] NULL NULL NULL
+12 [12] 7 8 {c0: 7, c1: 8}
+

Review Comment:
   Can we please add some other tests:
   
   1. Test the output type (e.g. `select arrow_typeof(unnest(column1))`)
   2. Test nested structs (I tested it works well, but I think we should have 
some coverage)
   
   
   Nested structs
   ```sql
   > create or replace table t as values (struct('a', 'b', struct('c'))), 
(struct('d', 'e', struct('f')));
   0 row(s) fetched.
   Elapsed 0.031 seconds.
   
   > select * from t;
   +-+
   | column1 |
   +-+
   | {c0: a, c1: b, c2: {c0: c}} |
   | {c0: d, c1: e, c2: {c0: f}} |
   +-+
   2 row(s) fetched.
   Elapsed 0.006 seconds.
   
   > select unnest(column1) from t;
   +--+--+--+
   | unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
   +--+--+--+
   | a| b| {c0: c}  |
   | d| e| {c0: f}  |
   +--+--+--+
   2 row(s) fetched.
   Elapsed 0.013 seconds.
   ```
   
   And 
   
   ```sql
   > create or replace table t as values (struct('a', 'b', [1,2,3])), 
(struct('x', 'y', [10,20]));
   0 row(s) fetched.
   Elapsed 0.010 seconds.
   
   > select unnest(column1) from t;
   +--+--+--+
   | unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
   +--+--+--+
   | a| b| [1, 2, 3]|
   | x| y| [10, 20] |
   +--+--+--+
   2 row(s) fetched.
   Elapsed 0.008 seconds.
   ```



##
datafusion/expr/src/expr_schema.rs:
##
@@ -123,7 +123,8 @@ impl ExprSchemable for Expr {
 Ok(field.data_type().clone())
 }
 DataType::Struct(_) => {
-not_impl_err!("unnest() does not support struct yet")
+// TODO: this is not correct, because unnest(struct) 
wll result into multiple data_type

Review Comment:
   When do we end up with an `unest` **Expr**  (vs an `LogicalPlan::Unnest`). 
   
   Could we simply return `not_yet_implemented` error here?



##
datafusion/sqllogictest/test_files/unnest.slt:
##
@@ -288,6 +308,18 @@ select unnest(array_remove(column1, 12)) from unnest_table;
 5
 6
 
+## unnest struct-typed column and list-typed column at the same time
+query I?II?
+select unnest(column1), column1, unnest(column5), column5 from unnest_table;
+
+1 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+2 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+3 [1, 2, 3] 1 2 {c0: 1, c1: 2}
+4 [4, 5] 3 4 {c0: 3, c1: 4}
+5 [4, 5] 3 4 {c0: 3, c1: 4}
+6 [6] NULL NULL NULL
+12 [12] 7 8 {c0: 7, c1: 8}
+

Review Comment:
   Interestingly the output type doesn't seem to work (which probably makes 
sense as unnest needs to be converted to a LogicalPlan). I don't think we have 
to fix this in the PR (but adding coverage would be good)
   
   ```sql
   > select unnest(column1) from t;
   +--+--+--+
   | unnest(t.column1).c0 | unnest(t.column1).c1 | unnest(t.column1).c2 |
   +--+--+--+
   | a| b| [1, 2, 3]|
   | x| y| [10, 20] |
   +--+--+--+
   2 row(s) fetched.
   
   > select arrow_typeof(unnest(column1)) from t;
   Internal error: unnest on struct can ony be applied at the root level of 
select expression.
   This was likely caused by a bug in DataFusion's code and we would welcome 
that you file an bug report in our issue tracker
   ```



##
datafusion/expr/src/logical_plan/builder.rs:
##
@@ -1592,7 +1592,47 @@ impl TableSource for LogicalTableSource {
 
 /// Create a [`LogicalPlan::Unnest`] plan
 pub fn unnest(input: LogicalPlan, columns: Vec) -> Result 
{
-

Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2123378506

   Basically I worry this is just changing behavior rather than fixing a bug 
and will result in churn for no benefit downstream. I may be mis understanding 
the change and rationale however


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix `Coalesce` casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10268:
URL: https://github.com/apache/datafusion/pull/10268#issuecomment-2123377440

   In general I am concerned about the potential downstream effects of this 
change. I don't fully understand them 
   
   What I would ideally like to do is to run the influxdb_iox regression tests 
with this change and see what happens. 
   
   I am not sure I have the time to do that in the next week -- maybe 
@appletreeisyellow  does 🤔  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: support `grouping` aggregate function [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10208:
URL: https://github.com/apache/datafusion/pull/10208#discussion_r1608905520


##
datafusion/physical-expr/src/aggregate/grouping.rs:
##
@@ -96,8 +113,172 @@ impl PartialEq for Grouping {
 self.name == x.name
 && self.data_type == x.data_type
 && self.nullable == x.nullable
-&& self.expr.eq(&x.expr)
+&& self.exprs.len() == x.exprs.len()
+&& self
+.exprs
+.iter()
+.zip(x.exprs.iter())
+.all(|(expr1, expr2)| expr1.eq(expr2))
 })
 .unwrap_or(false)
 }
 }
+
+#[derive(Debug)]
+struct GroupingGroupsAccumulator {
+/// Grouping columns' indices in grouping set
+indices: Vec,
+
+/// Mask per group.
+///
+/// Note this is an i32 and not a u32 (or usize) because the
+/// output type of grouping is `DataType::Int32`. Thus by using `i32`
+/// for the grouping, the output [`Int32Array`] can be created
+/// without copy.
+masks: Vec,
+}
+
+impl GroupingGroupsAccumulator {
+pub fn new(
+grouping_exprs: &[Arc],
+group_by_exprs: &[(Arc, String)],
+) -> Result {
+macro_rules! downcast_column {
+($EXPR:expr) => {{
+if let Some(column) = $EXPR.as_any().downcast_ref::() {
+column
+} else {
+return Err(DataFusionError::Execution(
+"Grouping only supports grouping set which only 
contains Column Expr".to_string(),
+));
+}
+}}
+}
+
+// collect column indices of group_by_exprs, only Column Expr
+let mut group_by_column_indices = 
Vec::with_capacity(group_by_exprs.len());
+for (group_by_expr, _) in group_by_exprs.iter() {
+let column = downcast_column!(group_by_expr);
+group_by_column_indices.push(column.index());
+}
+
+// collect grouping_exprs' indices in group_by_exprs list, eg:
+// SQL: SELECT c1, c2, grouping(c2, c1) FROM t GROUP BY ROLLUP(c1, c2);
+// group_by_exprs: [c1, c2]
+// grouping_exprs: [c2, c1]
+// indices: [1, 0]
+let mut indices = Vec::with_capacity(grouping_exprs.len());
+for grouping_expr in grouping_exprs {
+let column = downcast_column!(grouping_expr);
+indices.push(find_grouping_column_index(
+&group_by_column_indices,
+column.index(),
+)?);
+}
+
+Ok(Self {
+indices,
+masks: vec![],
+})
+}
+}
+
+fn find_grouping_column_index(
+group_by_column_indices: &[usize],
+grouping_column_index: usize,
+) -> Result {
+for (i, group_by_column_index) in 
group_by_column_indices.iter().enumerate() {
+if grouping_column_index == *group_by_column_index {
+return Ok(i);
+}
+}
+Err(DataFusionError::Execution(
+"Not found grouping column in group by columns".to_string(),
+))
+}
+
+fn compute_mask(indices: &[usize], grouping_set: &[bool]) -> i32 {
+let mut mask = 0;
+for (i, index) in indices.iter().rev().enumerate() {
+if grouping_set[*index] {
+mask |= 1 << i;
+}
+}
+mask
+}
+
+impl GroupsAccumulator for GroupingGroupsAccumulator {

Review Comment:
   I agree having some special case simply for the `grouping` aggregate that 
forces changes on all other aggregates isn't ideal
   
   > When calling update_batch, we need to know the information of the current 
grouping set, so we need to add a parameter to update_batch
   
   After reading  https://www.postgresql.org/docs/9.5/functions-aggregate.html 
I see that `grouping` is basically a special case that only makes sense in the 
context of grouping set (it provides some context into the grouping set).
   
   Given it is so special, I wonder if we could special case it somehow 🤔 
   
   One thing maybe we could do is to add another signature?
   
   ```
   trait `GroupsAccumulator`  {
   ...
   /// Called with the information with what grouping set this batch 
belongs to.
   /// The default implementaiton calls `Self::update_batch` and ignores 
the grouping_set
   fn update_grouping_batch(
   &mut self,
   _values: &[ArrayRef],
   group_indices: &[usize],
   opt_filter: Option<&arrow_array::BooleanArray>,
   total_num_groups: usize,
   grouping_set: &[bool],
   ) -> Result<()> {
 self.update_batch(_values, group_indices, opt_filter, total_num_groups)
   }
   ...
   ```
   
   
   And then we could make it clear in the documentation that the agregator 
calls `update_group_batch` but that most implementations can just implement 
`update_batch` 
   
   



-- 
Th

Re: [I] Support `date_bin` on timestamps with timezone, properly accounting for Daylight Savings Time [datafusion]

2024-05-21 Thread via GitHub


appletreeisyellow commented on issue #10602:
URL: https://github.com/apache/datafusion/issues/10602#issuecomment-2123367235

   I'd like to work on this issue 🙋‍♀️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


huaxingao commented on code in PR #456:
URL: https://github.com/apache/datafusion-comet/pull/456#discussion_r1608882278


##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -1212,6 +1212,157 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("correlation") {
+withSQLConf(CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") {
+  Seq(false).foreach { cometColumnShuffleEnabled =>
+withSQLConf(
+  CometConf.COMET_COLUMNAR_SHUFFLE_ENABLED.key -> 
cometColumnShuffleEnabled.toString) {
+  Seq(false).foreach { dictionary =>
+withSQLConf("parquet.enable.dictionary" -> dictionary.toString) {
+  Seq(true).foreach { nullOnDivideByZero =>
+withSQLConf(
+  "spark.sql.legacy.statisticalAggregate" -> 
nullOnDivideByZero.toString) {
+  val table = "test"
+  withTable(table) {
+sql(
+  s"create table $table(col1 double, col2 double, col3 
double) using parquet")
+sql(s"insert into $table values(1, 4, 1), (2, 5, 1), (3, 
6, 2)")
+val expectedNumOfCometAggregates = 2
+
+sql("SELECT corr(col1, col2) FROM test GROUP BY col3").show

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


huaxingao commented on code in PR #456:
URL: https://github.com/apache/datafusion-comet/pull/456#discussion_r1608882001


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -547,6 +547,26 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   withInfo(aggExpr, child)
   None
 }
+  case corr @ Corr(child1, child2, nullOnDivideByZero) =>
+val child1Expr = exprToProto(child1, inputs, binding)
+val child2Expr = exprToProto(child2, inputs, binding)
+val dataType = serializeDataType(corr.dataType)
+
+if (child1Expr.isDefined && child2Expr.isDefined && 
dataType.isDefined) {
+  val corrBuilder = ExprOuterClass.Correlation.newBuilder()
+  corrBuilder.setChild1(child1Expr.get)
+  corrBuilder.setChild2(child2Expr.get)
+  corrBuilder.setNullOnDivideByZero(nullOnDivideByZero)
+  corrBuilder.setDatatype(dataType.get)
+
+  Some(
+ExprOuterClass.AggExpr
+  .newBuilder()
+  .setCorrelation(corrBuilder)
+  .build())
+} else {
+  None

Review Comment:
   Yes, I forgot this. Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Move group accumulator for aggregate function to physical-expr-common, and add ahash physical-expr-common [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10574:
URL: https://github.com/apache/datafusion/pull/10574#discussion_r1608876810


##
datafusion/physical-expr-common/src/aggregate/groups_accumulator/mod.rs:
##
@@ -0,0 +1,20 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+pub mod accumulate;

Review Comment:
   Perhaps we could add a note that this module is for implementing 
`GroupsAccumulator`
   
   ```suggestion
   //! Utilities for implementing [`GroupsAccumulator`]
   
   pub mod accumulate;
   ```



##
Cargo.toml:
##
@@ -59,6 +59,9 @@ version = "38.0.0"
 # for the inherited dependency but cannot do the reverse (override from true 
to false).
 #
 # See for more detaiils: https://github.com/rust-lang/cargo/issues/11329
+ahash = { version = "0.8", default-features = false, features = [

Review Comment:
   👍 



##
datafusion/physical-expr/src/aggregate/groups_accumulator/mod.rs:
##
@@ -15,10 +15,18 @@
 // specific language governing permissions and limitations
 // under the License.
 
-pub(crate) mod accumulate;
 mod adapter;
-pub use accumulate::NullState;
+

Review Comment:
   👍  for backwards compatibility
   
   Maybe we can leave a comment explaining it is for backwards compatibility
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] docs.rs build fails for datafusion-proto `37.0.0` [datafusion]

2024-05-21 Thread via GitHub


alamb commented on issue #10163:
URL: https://github.com/apache/datafusion/issues/10163#issuecomment-2123330179

   FWIW it seems to be working on 38.0.0: 
https://docs.rs/datafusion-proto/latest/datafusion_proto/
   
   ![Screenshot 2024-05-21 at 3 47 44 
PM](https://github.com/apache/datafusion/assets/490673/9f10e665-480c-4416-970d-1b4d339c07d3)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608872166


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet
+  // checkSparkAnswerAndOperator("SELECT hex(_9), hex(_10), hex(_11), 
hex(_12) FROM tbl")
+  checkSparkAnswerAndOperator("SELECT hex(_11), hex(_12) FROM tbl")
+
+  // strings, binary
+  checkSparkAnswerAndOperator("SELECT hex(_8), hex(_14) FROM tbl")

Review Comment:
   1caa6fd adds the other columns... `_13` also fails as the rust code is 
provide with a `Dictionary(Int32, Utf8)` that isn't matched on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Test for reading read statistics from parquet files without statistics [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10608:
URL: https://github.com/apache/datafusion/pull/10608#discussion_r1608873185


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -73,6 +73,22 @@ pub fn parquet_file_one_column(
 no_null_values_start: i64,
 no_null_values_end: i64,
 row_per_group: usize,
+) -> ParquetRecordBatchReaderBuilder {
+parquet_file_one_column_stats(
+num_null,
+no_null_values_start,
+no_null_values_end,
+row_per_group,
+EnabledStatistics::Chunk,
+)
+}
+
+pub fn parquet_file_one_column_stats(

Review Comment:
   Can we please add some doc comments about what this function does / how it 
is different? It is clear from this PR that it adds `enable_stats` but that 
might not be obvious in the future
   
   While we are on this topic, I personally find a struct with named parameters 
easier to read than something the following where you have to look up 
`parquet_file_one_column_stats` to know what the `0`, `4` and `7` mean.
   
   ```rust
   let reader =
   parquet_file_one_column_stats(0, 4, 7, row_per_group, 
EnabledStatistics::None);
   ```
   
   I wonder if we could change the code to something like this to make it 
easier to read 🤔 
   
   ```rust
 let reader = TestFile {
   num_null: 0
   no_null_values_start: 4,
   no_null_values_end: 7,
   row_per_group,
   enable_stats: EnabledStatistic::None,
 }
 .build();
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608872166


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet
+  // checkSparkAnswerAndOperator("SELECT hex(_9), hex(_10), hex(_11), 
hex(_12) FROM tbl")
+  checkSparkAnswerAndOperator("SELECT hex(_11), hex(_12) FROM tbl")
+
+  // strings, binary
+  checkSparkAnswerAndOperator("SELECT hex(_8), hex(_14) FROM tbl")

Review Comment:
   1caa6fd adds the other columns... `_13` also fails failure as the rust code 
is provide with a `Dictionary(Int32, Utf8)` that isn't matched on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608872166


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet
+  // checkSparkAnswerAndOperator("SELECT hex(_9), hex(_10), hex(_11), 
hex(_12) FROM tbl")
+  checkSparkAnswerAndOperator("SELECT hex(_11), hex(_12) FROM tbl")
+
+  // strings, binary
+  checkSparkAnswerAndOperator("SELECT hex(_8), hex(_14) FROM tbl")

Review Comment:
   1caa6fd adds the other columns... `_13` also fails failure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


tshauck commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608871374


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   Yeah, I think that's the case. I'll pare down the rust code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN commented on code in PR #10592:
URL: https://github.com/apache/datafusion/pull/10592#discussion_r1608870254


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -624,20 +624,281 @@ async fn test_dates_64_diff_rg_sizes() {
 .run("date64");
 }
 
+// BUG:
+// https://github.com/apache/datafusion/issues/10604
+#[tokio::test]
+async fn test_uint() {
+let row_per_group = 4;
+
+// This creates a parquet files of 4 columns named "u8", "u16", "u32", 
"u64"
+// "u8" --> UInt8Array
+// "u16" --> UInt16Array
+// "u32" --> UInt32Array
+// "u64" --> UInt64Array
+
+// The file is created by 4 record batches (each has a null row), each has 
5 rows but then will be split into 5 row groups with size 4
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+
+// u8
+// BUG: expect UInt8Array but returns Int32Array
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt8Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt8Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u8");
+
+// u16
+// BUG: expect UInt16Array but returns Int32Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt16Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt16Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u16");
+
+// u32
+// BUG: expect UInt32Array but returns Int32Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt32Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt32Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u32");
+
+// u64
+// BUG: expect UInt64rray but returns Int64Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int64Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt64Array
+expected_max: Arc::new(Int64Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt64Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u64");
+}
+
+#[tokio::test]
+async fn test_int32_range() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "i"
+// file has 2 record batches, each has 2 rows. They will be saved into one 
row group
+let reader = parquet_file_many_columns(Scenario::Int32Range, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0])),
+expected_max: Arc::new(Int32Array::from(vec![30])),
+expected_null_counts: UInt64Array::from(vec![0]),
+expected_row_counts: UInt64Array::from(vec![4]),
+}
+.run("i");
+}
+
+// BUG: not convert UInt32Array to Int32Array
+// https://github.com/apache/datafusion/issues/10604
+#[tokio::test]
+async fn test_uint32_range() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "u"
+// file has 2 record batches, each has 2 rows. They will be saved into one 
row group
+let reader = parquet_file_many_columns(Scenario::UInt32Range, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0])), // shoudld be 
UInt32Array
+expected_max: Arc::new(Int32Array::from(vec![30])), // shoudld be 
UInt32Array
+expected_null_counts: UInt64Array::from(vec![0]),
+expected_row_counts: UInt64Array::from(vec![4]),
+}
+.run("u");
+}
+
+#[tokio::test]
+async fn test_float64() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "f"
+// file has 4 record batches, each has 5 rows. They will be saved into 4 
row groups
+let reader = parquet_file_many_columns(Scenario::Float64, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Float64Array::from(vec![-5.0, -4.0, -0.0, 
5.0])),
+expected_max: Arc::new(Float64Array::from(vec![-1.0, 0.0, 4.0, 9.0])),
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
+}
+.run("f");
+}
+
+#[tok

Re: [PR] Start setting up new StreamTable config [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10600:
URL: https://github.com/apache/datafusion/pull/10600#discussion_r1608866960


##
datafusion/core/src/datasource/stream.rs:
##
@@ -103,19 +105,29 @@ impl FromStr for StreamEncoding {
 }
 }
 
-/// The configuration for a [`StreamTable`]
+pub trait StreamSource: std::fmt::Debug + Send + Sync {

Review Comment:
   Perhaps some documentation about what this trait is meant to represent would 
help



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


kazuyukitanimura commented on code in PR #456:
URL: https://github.com/apache/datafusion-comet/pull/456#discussion_r1608866121


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -547,6 +547,26 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   withInfo(aggExpr, child)
   None
 }
+  case corr @ Corr(child1, child2, nullOnDivideByZero) =>
+val child1Expr = exprToProto(child1, inputs, binding)
+val child2Expr = exprToProto(child2, inputs, binding)
+val dataType = serializeDataType(corr.dataType)
+
+if (child1Expr.isDefined && child2Expr.isDefined && 
dataType.isDefined) {
+  val corrBuilder = ExprOuterClass.Correlation.newBuilder()
+  corrBuilder.setChild1(child1Expr.get)
+  corrBuilder.setChild2(child2Expr.get)
+  corrBuilder.setNullOnDivideByZero(nullOnDivideByZero)
+  corrBuilder.setDatatype(dataType.get)
+
+  Some(
+ExprOuterClass.AggExpr
+  .newBuilder()
+  .setCorrelation(corrBuilder)
+  .build())
+} else {
+  None

Review Comment:
   Should we add `withInfo()`?



##
spark/src/test/scala/org/apache/comet/exec/CometAggregateSuite.scala:
##
@@ -1212,6 +1212,157 @@ class CometAggregateSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
 }
   }
 
+  test("correlation") {
+withSQLConf(CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") {
+  Seq(false).foreach { cometColumnShuffleEnabled =>
+withSQLConf(
+  CometConf.COMET_COLUMNAR_SHUFFLE_ENABLED.key -> 
cometColumnShuffleEnabled.toString) {
+  Seq(false).foreach { dictionary =>
+withSQLConf("parquet.enable.dictionary" -> dictionary.toString) {
+  Seq(true).foreach { nullOnDivideByZero =>
+withSQLConf(
+  "spark.sql.legacy.statisticalAggregate" -> 
nullOnDivideByZero.toString) {
+  val table = "test"
+  withTable(table) {
+sql(
+  s"create table $table(col1 double, col2 double, col3 
double) using parquet")
+sql(s"insert into $table values(1, 4, 1), (2, 5, 1), (3, 
6, 2)")
+val expectedNumOfCometAggregates = 2
+
+sql("SELECT corr(col1, col2) FROM test GROUP BY col3").show

Review Comment:
   nit: we can reuse s`$table`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Unparser for `UNION ALL` [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10603:
URL: https://github.com/apache/datafusion/pull/10603


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Unparser for `UNION ALL` [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10603:
URL: https://github.com/apache/datafusion/pull/10603#discussion_r1608860177


##
datafusion/sql/src/unparser/plan.rs:
##
@@ -347,8 +373,33 @@ impl Unparser<'_> {
 
 Ok(())
 }
-LogicalPlan::Union(_union) => {
-not_impl_err!("Unsupported operator: {plan:?}")
+LogicalPlan::Union(union) => {

Review Comment:
   I double checked that Union is UNION ALL 
https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Union.html



##
datafusion/sql/tests/sql_integration.rs:
##
@@ -4641,6 +4641,14 @@ fn roundtrip_statement() -> Result<()> {
 group by "Last Name", p.id 
 having count_first_name>5 and count_first_name<10
 order by count_first_name, "Last Name""#,
+r#"SELECT j1_string as string FROM j1

Review Comment:
   Perhaps we should also add a test (as a follow on PR) for `UNION` (not 
`UNION ALL`) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Rename monotonicity as output_ordering in ScalarUDF's [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10596:
URL: https://github.com/apache/datafusion/pull/10596


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add to_date function to scalar functions doc [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10601:
URL: https://github.com/apache/datafusion/pull/10601


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Add to_date function to scalar functions doc [datafusion]

2024-05-21 Thread via GitHub


alamb closed issue #10461: Add to_date function to scalar functions doc
URL: https://github.com/apache/datafusion/issues/10461


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] error: this arithmetic operation will overflow (on i386) [datafusion]

2024-05-21 Thread via GitHub


alamb closed issue #10552: error: this arithmetic operation will overflow (on 
i386)
URL: https://github.com/apache/datafusion/issues/10552


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix compilation of datafusion-cli on 32bit targets [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10594:
URL: https://github.com/apache/datafusion/pull/10594


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fix compilation of datafusion-cli on 32bit targets [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10594:
URL: https://github.com/apache/datafusion/pull/10594#issuecomment-2123297797

   Thank you @nathaniel-daniel  and @jonahgao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: reduce allocations in push down filter [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10567:
URL: https://github.com/apache/datafusion/pull/10567#issuecomment-2123296914

   Thanks again @erratic-pattern 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] refactor: reduce allocations in push down filter [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10567:
URL: https://github.com/apache/datafusion/pull/10567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] read statistics from parquet without statistics [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN opened a new pull request, #10608:
URL: https://github.com/apache/datafusion/pull/10608

   ## Which issue does this PR close?
   
   
   
   More tests for https://github.com/apache/datafusion/issues/10453
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


alamb merged PR #10592:
URL: https://github.com/apache/datafusion/pull/10592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10592:
URL: https://github.com/apache/datafusion/pull/10592#issuecomment-2123281857

   Since this PR is just tests, merging it in


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10592:
URL: https://github.com/apache/datafusion/pull/10592#discussion_r1608840230


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -624,20 +624,281 @@ async fn test_dates_64_diff_rg_sizes() {
 .run("date64");
 }
 
+// BUG:
+// https://github.com/apache/datafusion/issues/10604
+#[tokio::test]
+async fn test_uint() {
+let row_per_group = 4;
+
+// This creates a parquet files of 4 columns named "u8", "u16", "u32", 
"u64"
+// "u8" --> UInt8Array
+// "u16" --> UInt16Array
+// "u32" --> UInt32Array
+// "u64" --> UInt64Array
+
+// The file is created by 4 record batches (each has a null row), each has 
5 rows but then will be split into 5 row groups with size 4
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+
+// u8
+// BUG: expect UInt8Array but returns Int32Array
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt8Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt8Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u8");
+
+// u16
+// BUG: expect UInt16Array but returns Int32Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt16Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt16Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u16");
+
+// u32
+// BUG: expect UInt32Array but returns Int32Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt32Array
+expected_max: Arc::new(Int32Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt32Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u32");
+
+// u64
+// BUG: expect UInt64rray but returns Int64Array
+let reader = parquet_file_many_columns(Scenario::UInt, 
row_per_group).await;
+Test {
+reader,
+expected_min: Arc::new(Int64Array::from(vec![0, 1, 4, 7, 251])), // 
shoudld be UInt64Array
+expected_max: Arc::new(Int64Array::from(vec![3, 4, 6, 250, 254])), // 
shoudld be UInt64Array
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![4, 4, 4, 4, 4]),
+}
+.run("u64");
+}
+
+#[tokio::test]
+async fn test_int32_range() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "i"
+// file has 2 record batches, each has 2 rows. They will be saved into one 
row group
+let reader = parquet_file_many_columns(Scenario::Int32Range, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0])),
+expected_max: Arc::new(Int32Array::from(vec![30])),
+expected_null_counts: UInt64Array::from(vec![0]),
+expected_row_counts: UInt64Array::from(vec![4]),
+}
+.run("i");
+}
+
+// BUG: not convert UInt32Array to Int32Array
+// https://github.com/apache/datafusion/issues/10604
+#[tokio::test]
+async fn test_uint32_range() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "u"
+// file has 2 record batches, each has 2 rows. They will be saved into one 
row group
+let reader = parquet_file_many_columns(Scenario::UInt32Range, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Int32Array::from(vec![0])), // shoudld be 
UInt32Array
+expected_max: Arc::new(Int32Array::from(vec![30])), // shoudld be 
UInt32Array
+expected_null_counts: UInt64Array::from(vec![0]),
+expected_row_counts: UInt64Array::from(vec![4]),
+}
+.run("u");
+}
+
+#[tokio::test]
+async fn test_float64() {
+let row_per_group = 5;
+// This creates a parquet file of 1 column "f"
+// file has 4 record batches, each has 5 rows. They will be saved into 4 
row groups
+let reader = parquet_file_many_columns(Scenario::Float64, 
row_per_group).await;
+
+Test {
+reader,
+expected_min: Arc::new(Float64Array::from(vec![-5.0, -4.0, -0.0, 
5.0])),
+expected_max: Arc::new(Float64Array::from(vec![-1.0, 0.0, 4.0, 9.0])),
+expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
+expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
+}
+.run("f");
+}
+
+#[tokio:

Re: [PR] fix: Specify schema when converting TPC-H csv to parquet [datafusion-benchmarks]

2024-05-21 Thread via GitHub


andygrove merged PR #3:
URL: https://github.com/apache/datafusion-benchmarks/pull/3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fixes bug expect `Date32Array` but returns Int32Array [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10593:
URL: https://github.com/apache/datafusion/pull/10593#discussion_r1608838018


##
datafusion/core/tests/parquet/arrow_statistics.rs:
##
@@ -593,10 +593,16 @@ async fn test_dates_32_diff_rg_sizes() {
 
 Test {
 reader,
-// mins are [18262, 18565,]
-expected_min: Arc::new(Int32Array::from(vec![18262, 18565])),
-// maxes are [18564, 21865,]
-expected_max: Arc::new(Int32Array::from(vec![18564, 21865])),
+// mins are [2020-01-01, 2020-10-30]
+expected_min: Arc::new(Date32Array::from(vec![

Review Comment:
   ❤️  that is much easier to read



##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -75,6 +75,12 @@ macro_rules! get_statistic {
 *scale,
 ))
 }
+Some(DataType::Date32) => {
+Some(ScalarValue::Date32(Some(*s.$func(
+}
+Some(DataType::Date64) => {

Review Comment:
   I don't think a Date64 be stored as an `Int32` in parquet. Is this branch 
required to get the tests to pass? I would be surprised if it were covered



##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -75,6 +75,12 @@ macro_rules! get_statistic {
 *scale,
 ))
 }
+Some(DataType::Date32) => {
+Some(ScalarValue::Date32(Some(*s.$func(
+}
+Some(DataType::Date64) => {
+Some(ScalarValue::Date64(Some(i64::from(*s.$func()) * 
24 * 60 * 60 * 1000)))
+}
 _ => Some(ScalarValue::Int32(Some(*s.$func(,

Review Comment:
I agree it is likely what is masking all the bugs
   
   I am not sure if it is worth explicitly listing all the types out here to be 
honest 🤔  I am hoping in some future PR to revamp how these statistics are done 
entirely (basically match on the arrow target type in the outer loop rather 
than the inner loop)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Fixes bug expect `Date32Array` but returns Int32Array [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10593:
URL: https://github.com/apache/datafusion/pull/10593#issuecomment-2123268926

   > Not a PMC or a full-time committer, my review is not very valuable, but 
looks good to me.
   
   Your review is valuable in my opinion -- thank you @edmondop 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Refactor parquet row group pruning into a struct (use new statistics API, part 1) [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10607:
URL: https://github.com/apache/datafusion/pull/10607#discussion_r1608821292


##
datafusion/core/src/datasource/physical_plan/parquet/mod.rs:
##
@@ -556,32 +557,36 @@ impl FileOpener for ParquetOpener {
 };
 };
 
-// Row group pruning by statistics: attempt to skip entire 
row_groups
-// using metadata on the row groups
+// Determine which row groups to actually read. The idea is to skip
+// as many row groups as possible based on the metadata and query
 let file_metadata = builder.metadata().clone();
 let predicate = pruning_predicate.as_ref().map(|p| p.as_ref());
-let mut row_groups = row_groups::prune_row_groups_by_statistics(
-&file_schema,
-builder.parquet_schema(),
-file_metadata.row_groups(),
-file_range,
-predicate,
-&file_metrics,
-);
+let rg_metadata = file_metadata.row_groups();
+// track which row groups to actually read
+let mut row_groups = RowGroupSet::new(rg_metadata.len());

Review Comment:
   The whole point of this PR is to move the set of RowGroups that are still 
under consideration into a `RowGroupSet` rather than a `Vec`
   
   This makes it clearer what is going on in my opinion, as well as sets up up 
for vectorized pruning evaluation in the next PR



##
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs:
##
@@ -1196,14 +1223,60 @@ mod tests {
 .await
 }
 
+// What row groups are expected to be left after pruning

Review Comment:
   I was finding `vec![0]` hard to understand what was being tested, so I made 
this structure to encapsulate it



##
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs:
##
@@ -38,42 +38,94 @@ use crate::physical_optimizer::pruning::{PruningPredicate, 
PruningStatistics};
 
 use super::ParquetFileMetrics;
 
-/// Prune row groups based on statistics
+/// Tracks which RowGroupsw within a parquet file should be scanned.
 ///
-/// Returns a vector of indexes into `groups` which should be scanned.
-///
-/// If an index is NOT present in the returned Vec it means the
-/// predicate filtered all the row group.
-///
-/// If an index IS present in the returned Vec it means the predicate
-/// did not filter out that row group.
-///
-/// Note: This method currently ignores ColumnOrder
-/// 
-pub(crate) fn prune_row_groups_by_statistics(
-arrow_schema: &Schema,
-parquet_schema: &SchemaDescriptor,
-groups: &[RowGroupMetaData],
-range: Option,
-predicate: Option<&PruningPredicate>,
-metrics: &ParquetFileMetrics,
-) -> Vec {
-let mut filtered = Vec::with_capacity(groups.len());
-for (idx, metadata) in groups.iter().enumerate() {
-if let Some(range) = &range {
-// figure out where the first dictionary page (or first data page 
are)
+/// This struct encapsulates the various types of pruning that can be applied 
to
+/// a set of row groups within a parquet file.
+#[derive(Debug, PartialEq)]
+pub(crate) struct RowGroupSet {
+/// row_groups[i] is true if the i-th row group should be scanned
+row_groups: Vec,
+}
+
+impl RowGroupSet {
+/// Create a new RowGroupSet with all row groups set to true (will be 
scanned)
+pub fn new(num_row_groups: usize) -> Self {
+Self {
+row_groups: vec![true; num_row_groups],
+}
+}
+
+/// Set the i-th row group to false (should not be scanned)
+pub fn do_not_scan(&mut self, idx: usize) {
+self.row_groups[idx] = false;
+}
+
+/// return true if the i-th row group should be scanned
+fn should_scan(&self, idx: usize) -> bool {
+self.row_groups[idx]
+}
+
+pub fn len(&self) -> usize {
+self.row_groups.len()
+}
+
+pub fn is_empty(&self) -> bool {
+self.row_groups.is_empty()
+}
+
+/// Return an iterator over the row group indexes that should be scanned
+pub fn iter(&self) -> impl Iterator + '_ {
+self.row_groups
+.iter()
+.enumerate()
+.filter_map(|(idx, &b)| if b { Some(idx) } else { None })
+}
+
+/// Return a vector with the row group indexes that should be scanned
+pub fn indexes(&self) -> Vec {
+self.iter().collect()
+}
+
+/// Prune remaining row groups so that only those row groups within the
+/// specified range are scanned.
+///
+/// Updates this set to mark row groups that should not be scanned
+pub fn prune_by_range(&mut self, groups: &[RowGroupMetaData], range: 
&FileRange) {
+for (idx, metadata) in groups.iter().enumerate() {

Review Comment:
   I moved the range filtering into its own function to make it clearer what is 
going on (it

Re: [PR] feat: Add HashJoin support for BuildRight [datafusion-comet]

2024-05-21 Thread via GitHub


viirya commented on code in PR #437:
URL: https://github.com/apache/datafusion-comet/pull/437#discussion_r1608831021


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2417,11 +2417,12 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   return None
 }
 
-if (join.buildSide == BuildRight) {
-  // DataFusion HashJoin assumes build side is always left.
-  // TODO: support BuildRight
-  withInfo(join, "BuildRight is not supported")
-  return None
+join match {
+  case b: BroadcastHashJoinExec if b.isNullAwareAntiJoin =>
+// DataFusion HashJoin LeftAnti has bugs on null keys.
+withInfo(join, "DataFusion doesn't support null-aware anti join")
+return None
+  case _ => // no-op

Review Comment:
   https://github.com/apache/datafusion-comet/issues/457



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Comet doesn't support Spark BroadcastHashJoinExec if it is null-aware anti-join [datafusion-comet]

2024-05-21 Thread via GitHub


viirya opened a new issue, #457:
URL: https://github.com/apache/datafusion-comet/issues/457

   ### What is the problem the feature request solves?
   
   DataFusion HashJoin LeftAnti doesn't support null-aware anti join.
   
   See https://github.com/apache/datafusion/issues/10583
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: Specify schema when converting TPC-H csv to parquet [datafusion-benchmarks]

2024-05-21 Thread via GitHub


andygrove opened a new pull request, #3:
URL: https://github.com/apache/datafusion-benchmarks/pull/3

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


codecov-commenter commented on PR #456:
URL: https://github.com/apache/datafusion-comet/pull/456#issuecomment-2123243529

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `85.71429%` with `2 lines` in your changes are 
missing coverage. Please review.
   > Project coverage is 34.02%. Comparing base 
[(`14494d3`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/14494d3a06338b28ce8ad31d032ac60b75f4c227?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`fead995`)](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?dropdown=coverage&src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 26 commits behind head on main.
   
   | 
[Files](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==)
 | 85.71% | [1 Missing and 1 partial :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main #456  +/-   ##
   
   - Coverage 34.02%   34.02%   -0.01% 
   - Complexity  857  859   +2 
   
 Files   116  116  
 Lines 3856538671 +106 
 Branches   8517 8564  +47 
   
   + Hits  1312013156  +36 
   - Misses2269122753  +62 
   - Partials   2754 2762   +8 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/456?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Refactor parquet row group pruning into a struct [datafusion]

2024-05-21 Thread via GitHub


alamb opened a new pull request, #10607:
URL: https://github.com/apache/datafusion/pull/10607

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] PhysicalExpr Orderings with Range Information [datafusion]

2024-05-21 Thread via GitHub


alamb commented on code in PR #10504:
URL: https://github.com/apache/datafusion/pull/10504#discussion_r1608748477


##
datafusion/functions/src/math/monotonicity.rs:
##
@@ -0,0 +1,241 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::datatypes::DataType;

Review Comment:
   Ideally I would recommend .slt tests (that show sorts/not sorts for 
exmaple), but I am not sure if you have sufficient bandwidth to do so. Maybe 
unit tests would be best. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DataFusion to run SQL queries on Parquet files with error No suitable object store found for file [datafusion]

2024-05-21 Thread via GitHub


alamb commented on issue #9280:
URL: https://github.com/apache/datafusion/issues/9280#issuecomment-2123183351

   > I'd be happy to contribute some docs / examples if you point me at 
something similar.
   
   Thanks @aditanase 🙏 
   
   I would recommend two things:
   
   # Suggestion 1: Change the error message 
   
   Change the error message to include instructions on how to fix the issue
   
   For example, instead of 
   > No suitable object store found for file://
   
   Perhaps it could be something more like
   
   > No object store registered for urls starting with 'file://'. See 
`RuntimeEnv::register_object_store`
   
   # Suggestion 2: Add an example to the docs
   
   Once the error message directs to  `RuntimeEnv::register_object_store` then 
it would be ideal if the docs
   
   
https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnv.html#method.register_object_store
   
   actually had an example
   
   So that would perhaps take the form of adapting the example above as a doc 
example
   
   Does that make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

2024-05-21 Thread via GitHub


alamb commented on issue #10453:
URL: https://github.com/apache/datafusion/issues/10453#issuecomment-2123173647

   > @alamb I have created 2 more bug tickets but I cannot edit the description 
to add them in the subtasks. Can you help with that?
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


alamb commented on PR #10592:
URL: https://github.com/apache/datafusion/pull/10592#issuecomment-2123171863

   > What I still cannot understand is this a regression test for the bug we 
missed earlier?
   
   My strong suspicion is that the bugs @NGA-TRAN  is finding would manifest 
themselves as potentially incorrect results when reading parquet files 
predicates on  these types of columns. 
   
   It may also simply manifest as not being able to prune row groups based on 
such predicates
   
   I haven't worked to make a reproducer as it would likely require creating 
parquet files with multiple row groups with carefully chosen data patterns
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: correlation support [datafusion-comet]

2024-05-21 Thread via GitHub


huaxingao closed pull request #456: feat: correlation support
URL: https://github.com/apache/datafusion-comet/pull/456


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


kazuyukitanimura commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608714060


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet

Review Comment:
   I remember unsigned ints currently do not work in other places 
https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala#L97
   
   I think it is ok to leave it as TODO



##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,26 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+Seq(true, false).foreach { dictionaryEnabled =>
+  withTempDir { dir =>
+val path = new Path(dir.toURI.toString, "hex.parquet")
+makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 
1)
+
+withParquetTable(path.toString, "tbl") {
+  // ints
+  checkSparkAnswerAndOperator("SELECT hex(_2), hex(_3), hex(_4), 
hex(_5) FROM tbl")
+
+  // uints, uint8 and uint16 not working yet
+  // checkSparkAnswerAndOperator("SELECT hex(_9), hex(_10), hex(_11), 
hex(_12) FROM tbl")
+  checkSparkAnswerAndOperator("SELECT hex(_11), hex(_12) FROM tbl")
+
+  // strings, binary
+  checkSparkAnswerAndOperator("SELECT hex(_8), hex(_14) FROM tbl")

Review Comment:
   What about other types like _1, _6, _7, _13, _15+?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


kazuyukitanimura commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608699758


##
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala:
##
@@ -1038,6 +1038,46 @@ class CometExpressionSuite extends CometTestBase with 
AdaptiveSparkPlanHelper {
   }
 }
   }
+  test("hex") {
+val str_table = "string_hex_table"
+withTable(str_table) {

Review Comment:
   @advancedxy got a good point. Spark may be casting ints to `i64`. Then we 
only need to handle `i64` in Rust as long as it passes the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Ensure examples stay updated in CI. [datafusion-python]

2024-05-21 Thread via GitHub


andygrove closed issue #696: Ensure examples stay updated in CI.
URL: https://github.com/apache/datafusion-python/issues/696


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Regression in `substr` performance from 37.1.0 to 38.0.0 [datafusion-python]

2024-05-21 Thread via GitHub


andygrove closed issue #712: Regression in `substr` performance from 37.1.0 to 
38.0.0
URL: https://github.com/apache/datafusion-python/issues/712


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] tsaucer/run TPC-H examples in CI [datafusion-python]

2024-05-21 Thread via GitHub


andygrove merged PR #711:
URL: https://github.com/apache/datafusion-python/pull/711


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Make it easier to create WindowFunctions with the Expr API [datafusion]

2024-05-21 Thread via GitHub


timsaucer commented on issue #6747:
URL: https://github.com/apache/datafusion/issues/6747#issuecomment-2123031753

   Oh, great. Have you been able to run the [example code 
above](https://github.com/apache/datafusion/issues/6747#issuecomment-2090260284)
 using the new easy interface?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: add more tests for statistics reading [datafusion]

2024-05-21 Thread via GitHub


NGA-TRAN commented on PR #10592:
URL: https://github.com/apache/datafusion/pull/10592#issuecomment-2123026061

   @comphead 
   
   > What I still cannot understand is this a regression test for the bug we 
missed earlier?
   
   I am working on new arrow statistics API 
https://github.com/apache/datafusion/issues/10453 and @alamb suggested me to 
add more coverage tests as well as moving tests for private functions in 
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs to this 
file.  Sorry that I am not aware of available/related bug tickets


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: add hex scalar function [datafusion-comet]

2024-05-21 Thread via GitHub


advancedxy commented on code in PR #449:
URL: https://github.com/apache/datafusion-comet/pull/449#discussion_r1608635628


##
core/src/execution/datafusion/expressions/scalar_funcs/hex.rs:
##
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use arrow::array::as_string_array;
+use arrow_array::StringArray;
+use arrow_schema::DataType;
+use datafusion::logical_expr::ColumnarValue;
+use datafusion_common::{
+cast::{as_binary_array, as_int64_array},
+exec_err, DataFusionError, ScalarValue,
+};
+use std::fmt::Write;
+
+fn hex_bytes(bytes: &[u8]) -> Vec {
+let length = bytes.len();
+let mut value = vec![0; length * 2];
+let mut i = 0;
+while i < length {
+value[i * 2] = (bytes[i] & 0xF0) >> 4;
+value[i * 2 + 1] = bytes[i] & 0x0F;
+i += 1;
+}
+value
+}
+
+fn hex_int64(num: i64) -> String {
+if num >= 0 {
+format!("{:X}", num)
+} else {
+format!("{:016X}", num as u64)
+}
+}
+
+fn hex_string(s: &str) -> Vec {
+hex_bytes(s.as_bytes())
+}
+
+fn hex_bytes_to_string(bytes: &[u8]) -> Result {
+let mut hex_string = String::with_capacity(bytes.len() * 2);
+for byte in bytes {
+write!(&mut hex_string, "{:01X}", byte)?;

Review Comment:
   I think I can move sha2/md5 related to scalar_funcs in a follow-up pr. It’s 
indeed a better place. For the hex_encode, maybe we need to extract it to 
hex_utils or simply in hex.rs, and make that crate public. I don’t think it’s a 
blocker to this PR. Just saying it so that it can be addressed later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Generate the supported Spark builtin expression list into MD file [datafusion-comet]

2024-05-21 Thread via GitHub


comphead commented on code in PR #455:
URL: https://github.com/apache/datafusion-comet/pull/455#discussion_r1608634043


##
docs/spark_expressions_support.md:
##
@@ -0,0 +1,477 @@
+
+
+# Supported Spark Expressions
+
+### agg_funcs
+ - [ ] any
+ - [ ] any_value
+ - [ ] approx_count_distinct
+ - [ ] approx_percentile
+ - [ ] array_agg
+ - [ ] avg
+ - [ ] bit_and
+ - [ ] bit_or
+ - [ ] bit_xor
+ - [ ] bool_and
+ - [ ] bool_or
+ - [ ] collect_list
+ - [ ] collect_set
+ - [ ] corr
+ - [ ] count
+ - [ ] count_if
+ - [ ] count_min_sketch
+ - [ ] covar_pop
+ - [ ] covar_samp
+ - [ ] every
+ - [ ] first
+ - [ ] first_value
+ - [ ] grouping
+ - [ ] grouping_id
+ - [ ] histogram_numeric
+ - [ ] kurtosis
+ - [ ] last
+ - [ ] last_value
+ - [ ] max
+ - [ ] max_by
+ - [ ] mean
+ - [ ] median
+ - [ ] min
+ - [ ] min_by
+ - [ ] mode
+ - [ ] percentile
+ - [ ] percentile_approx
+ - [ ] regr_avgx
+ - [ ] regr_avgy
+ - [ ] regr_count
+ - [ ] regr_intercept
+ - [ ] regr_r2
+ - [ ] regr_slope
+ - [ ] regr_sxx
+ - [ ] regr_sxy
+ - [ ] regr_syy
+ - [ ] skewness
+ - [ ] some
+ - [ ] std
+ - [ ] stddev
+ - [ ] stddev_pop
+ - [ ] stddev_samp

Review Comment:
   good point



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Implement a benchmark for extracting arrow statistics from parquet [datafusion]

2024-05-21 Thread via GitHub


alamb opened a new issue, #10606:
URL: https://github.com/apache/datafusion/issues/10606

   ### Is your feature request related to a problem or challenge?
   
   Part of https://github.com/apache/datafusion/issues/10453 
   
   Part of https://github.com/apache/datafusion/issues/10453 is to 
"efficiently" extract statistics from parquet files. The current code is likely 
quite bad, but we have no benchmarks that actually show that or allow for us to 
get better
   
   ### Describe the solution you'd like
   
   I would like to create a benchmark, run like `cargo bench ...` that does:
   1. Creates a parquet file that has 100 row groups once
   2. Read the parquet metadata out
   3. in the loop extracts the statistics from the metadata  into Arrow 
statistics (using the API introduced in 
https://github.com/apache/datafusion/pull/10537) -- not it shouldn't including 
reading the metadata from the parquet file
   
   
   I suggest four benchmarks with different types:
   1. UInt64
   2. f64
   3. String
   4. Dictionary(Int32, String) 
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-21 Thread via GitHub


Blizzara commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1608623644


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -1277,7 +1404,84 @@ pub(crate) fn from_substrait_literal(lit: &Literal) -> 
Result {
 s,
 )
 }
-Some(LiteralType::Null(ntype)) => from_substrait_null(ntype)?,
+Some(LiteralType::List(l)) => {
+let element_dt: Option<&DataType> = match dt {
+Some(DataType::List(l)) => Ok(Some(l.data_type())),

Review Comment:
   Yeah, we already do that (look at the first element) if `dt` is not given. I 
think the reason I included `dt` was because Substrait types don't include the 
field names for structs, so not using `dt` would mean using something like 
`names` and `name_idx` instead (like in from_substrait_type_with_names).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support Substrait's VirtualTables [datafusion]

2024-05-21 Thread via GitHub


Blizzara commented on code in PR #10531:
URL: https://github.com/apache/datafusion/pull/10531#discussion_r1608622303


##
datafusion/substrait/src/logical_plan/consumer.rs:
##
@@ -505,7 +507,35 @@ pub async fn from_substrait_rel(
 _ => Ok(t),
 }
 }
-_ => not_impl_err!("Only NamedTable reads are supported"),
+Some(ReadType::VirtualTable(vt)) => {
+let base_schema = read
+.base_schema
+.clone()
+.expect("expected schema to exist for virtual table");
+
+let fields = from_substrait_named_struct(&base_schema)?;
+let schema = 
DFSchemaRef::new(DFSchema::try_from(Schema::new(fields))?);

Review Comment:
   It probably could, except for the column names - LogicalPlanBuilder::values 
names everything to just `column1` etc. Do you see some benefit to using that 
(and then handling the column names somehow) instead of passing in the schema 
explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >