date:20240616

Re: [I] Add example for writing an `AnalyzerRule` [datafusion]

2024-06-16 Thread via GitHub

goldmedal commented on issue #10855:
URL: https://github.com/apache/datafusion/issues/10855#issuecomment-2171124147

Hi @alamb, I just want to share how I use user-defined AnalyzerRule. As you
know, I'm working on reimplementing the semantic layer engine,
[wren-engine](https://github.com/Canner/wren-engine), for LLM using DataFusion.
The project is still a work in progress. However, I think I have finished the
part of integration with DataFusion. I think it could be a nice use case for
`AnalyzerRule`.

The basic concept of wren-engine is that user can define a virtual modeling
layer to apply on their physical data.
For example, given a csv file registered in DataFusion context called,
`datafusion.public.customers`. We can wrap a virtual table for it called,
`wrenai.default.customers_model`.
User can query it like
```sql
select * from wrenai.default.customers_model
```
Then , wren engine translate it to a physical query:
```sql
SELECT
"customers_model"."city",
"customers_model"."id",
"customers_model"."state"
FROM
(
SELECT
"customers_model"."city",
"customers_model"."id",
"customers_model"."state"
FROM
(
SELECT
"datafusion"."public"."customers"."city" AS "city",
"datafusion"."public"."customers"."id" AS "id",
"datafusion"."public"."customers"."state" AS "state"
FROM
"public"."customers"
) AS "customers_model"
) AS "customers_model"
```
It's a simple use case to show how it works. We have other features, such as
`relationship` between models or calculated fields in the model. I don't go
into detail about them here. However, I implemented all of them through
`AnalyzerRule` and `UserDefinedLogicalNode`.

The related PR, https://github.com/Canner/wren-engine/pull/613, is still
under review. However, you can find the usage of `AnalyzerRule` in
[rule.rs](https://github.com/goldmedal/wren-engine/blob/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/core/src/logical_plan/analyze/rule.rs)
and the plan node in
[plan.rs](https://github.com/goldmedal/wren-engine/blob/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/core/src/logical_plan/analyze/plan.rs).
I also added [some
examples](https://github.com/goldmedal/wren-engine/tree/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/wren-example)
to show how to use the API or integrate with the DataFusion query flow.

Thanks to DataFusion for providing such amazing features, allowing me to
implement a more stable and structured converter for the semantic layer.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on code in PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641668286


##
datafusion/proto/src/physical_plan/mod.rs:
##
@@ -496,11 +496,14 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode {
 }
 
AggregateFunction::UserDefinedAggrFunction(udaf_name) => {
 let agg_udf = 
registry.udaf(udaf_name)?;
+// TODO: 'logical_exprs' is not 
supported for UDAF yet.
+// approx_percentile_cont and 
approx_percentile_cont_weight are not supported for UDAF from protobuf yet.
+let logical_exprs = &[];

Review Comment:
   I agree. I thought Flight SQL might be a use case for `from_proto`. However, 
I tried to query `approx_percentile_cont` through Flight SQL, and it worked 
well. Maybe we can just ignore it and leave this TODO comment until we find any 
`from_proto` use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Remove `prefer_hash_join` env variable for clickbench [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 opened a new pull request, #10933:
URL: https://github.com/apache/datafusion/pull/10933

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   clickbench does not expect `--prefer_hash_join`, remove it. It is introduce 
in #10809 
   
   ```
   ***
   DataFusion Benchmark Script
   COMMAND: run
   BENCHMARK: clickbench_partitioned
   DATAFUSION_DIR: /Users/bytedance/arrow-datafusion/benchmarks/..
   BRANCH_NAME: rm-hash-join-env-clickbench
   DATA_DIR: /Users/bytedance/arrow-datafusion/benchmarks/data
   RESULTS_DIR: 
/Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench
   CARGO_COMMAND: cargo run --profile release-nonlto
   PREFER_HASH_JOIN: true
   ***
   RESULTS_FILE: 
/Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench/clickbench_partitioned.json
   Running clickbench (partitioned, 100 files) benchmark...
   Finished `release-nonlto` profile [optimized] target(s) in 0.68s
Running 
`/Users/bytedance/arrow-datafusion/target/release-nonlto/dfbench clickbench 
--iterations 5 --path 
/Users/bytedance/arrow-datafusion/benchmarks/data/hits_partitioned 
--prefer_hash_join true --queries-path 
/Users/bytedance/arrow-datafusion/benchmarks/queries/clickbench/queries.sql -o 
/Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench/clickbench_partitioned.json`
   error: Found argument '--prefer_hash_join' which wasn't expected, or isn't 
valid in this context
   
   USAGE:
   dfbench clickbench --iterations  --path 
   
   For more information try --help
   
   ```
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: Support CartesianProductExec in comet [datafusion-comet]

2024-06-16 Thread via GitHub



leoluan2009 commented on PR #442:
URL: https://github.com/apache/datafusion-comet/pull/442#issuecomment-2171428443

   > @leoluan2009 Would you like to update the plan stability results? Thanks.
   
   Hi @viirya  sorry for delay reply, I think this pr do not change the query 
plan for tpcds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]

2024-06-16 Thread via GitHub



alamb commented on issue #10926:
URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171460701

   
   
   > Should we also change the signature and return Result?
   
   Yes I think so
   
   > Should we also change the signature and return Result?
   
   I actually think using `new_null_array` is wrong here. It returns the wrong 
type -- for example if we are getting row counts for a `Utf8` column, using 
new_null_arrray will return a `StringArray` rather than a `UInt64Array`. 
   
   I actually hit this issue when working on 
https://github.com/apache/datafusion/pull/10924 
   
   And I had to change it to 
   
   ```rust
   let Some(parquet_index) = self.parquet_index else {
   let num_row_groups = metadatas.into_iter().count();
   return Ok(Arc::new(UInt64Array::from_iter(
   std::iter::repeat(None).take(num_row_groups),
   )));
   };
   ```
   
   To make a test pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]

2024-06-16 Thread via GitHub



alamb merged PR #10927:
URL: https://github.com/apache/datafusion/pull/10927


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]

2024-06-16 Thread via GitHub



alamb commented on PR #10927:
URL: https://github.com/apache/datafusion/pull/10927#issuecomment-2171460836

   Thanks everyone for the reviews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Minor: Remove `prefer_hash_join` env variable for clickbench [datafusion]

2024-06-16 Thread via GitHub



alamb merged PR #10933:
URL: https://github.com/apache/datafusion/pull/10933


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Add a benchmark for extracting data page statistics [datafusion]

2024-06-16 Thread via GitHub



alamb opened a new issue, #10934:
URL: https://github.com/apache/datafusion/issues/10934

   ### Is your feature request related to a problem or challenge?
   
   As we work to make extracting statistics from parquet data pages more 
correct and performant in  https://github.com/apache/datafusion/issues/10922 
one thing that would be good is to have benchmark overage
   
   
   
   ### Describe the solution you'd like
   
   
   
   Add a benchmark for extracting page statistics 
   
   ### Describe alternatives you've considered
   
   Add a benchmark 
([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152))
 for extracting data page statistics
   
   These are run via
   
   ```shell
   cargo bench --bench parquet_statistic
   ```
   
   
   In order to create a reasonable number of data page staistics, it would be 
good to configure the parquet writer to limit the sizez of data pages
   
   
https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75
   
   
   And use 
https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit
 to set the the limit to 1 and then send the data in row by row as we did in 
the test:
   
   
https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130
   
   ### Additional context
   
   The need for a benchmark also came up in 
https://github.com/apache/datafusion/pull/10932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]

2024-06-16 Thread via GitHub



alamb commented on PR #10932:
URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171467776

   cc @xinlifoobar and @marvinlanhenke 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on code in PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808518


##
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##
@@ -15,19 +15,254 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::any::Any;
+use std::fmt::{Debug, Formatter};
+
 use arrow::{
 array::{
 ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, 
Int64Array,
 Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array,
 },
 datatypes::DataType,
 };
+use arrow_schema::Field;
 
-use datafusion_common::{downcast_value, internal_err, DataFusionError, 
ScalarValue};
-use datafusion_expr::Accumulator;
+use datafusion_common::{
+downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, 
ScalarValue,
+};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS};
+use datafusion_expr::utils::format_state_name;
+use datafusion_expr::{
+Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility,
+};
 use datafusion_physical_expr_common::aggregate::tdigest::{
 TDigest, TryIntoF64, DEFAULT_MAX_SIZE,
 };
+use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref;
+
+make_udaf_expr_and_func!(
+ApproxPercentileCont,
+approx_percentile_cont,
+expression percentile,
+"Computes the approximate percentile continuous of a set of numbers",
+approx_percentile_cont_udaf
+);
+
+pub struct ApproxPercentileCont {
+signature: Signature,
+}
+
+impl Debug for ApproxPercentileCont {
+fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+f.debug_struct("ApproxPercentileCont")
+.field("name", &self.name())
+.field("signature", &self.signature)
+.finish()
+}
+}
+
+impl Default for ApproxPercentileCont {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl ApproxPercentileCont {
+/// Create a new [`ApproxPercentileCont`] aggregate function.
+pub fn new() -> Self {
+let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() 
+ 1));
+// Accept any numeric value paired with a float64 percentile
+for num in NUMERICS {
+variants.push(TypeSignature::Exact(vec![num.clone(), 
DataType::Float64]));
+// Additionally accept an integer number of centroids for T-Digest
+for int in INTEGERS {
+variants.push(TypeSignature::Exact(vec![
+num.clone(),
+DataType::Float64,
+int.clone(),
+]))
+}
+}
+Self {
+signature: Signature::one_of(variants, Volatility::Immutable),

Review Comment:
   I found we can't use `Signature::numeric`. I tried to use it, but I 
discovered that all the input arguments will have the same type as the first 
argument. For example, in a SQL query like:
   ```
   select approx_percentile_cont(float64_col, 0.5, 100) from t
   ```
   the input arguments will be `vec![Float64, Float64, Float64]`. However, the 
last argument should be `int64`. This will cause a failure.
   
   I'm not sure if it's related, but I found something weird. Why is 
`AccumulatorArgs#input_type` not a `vec` but a single datatype? If the number 
of arguments is greater than 1, how do we handle the other arguments?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on code in PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808587


##
datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs:
##
@@ -15,91 +15,139 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use crate::expressions::ApproxPercentileCont;
-use crate::{AggregateExpr, PhysicalExpr};
+use std::any::Any;
+use std::fmt::{Debug, Formatter};
+
 use arrow::{
 array::ArrayRef,
 datatypes::{DataType, Field},
 };
-use 
datafusion_functions_aggregate::approx_percentile_cont::ApproxPercentileAccumulator;
+
+use datafusion_common::ScalarValue;
+use datafusion_common::{not_impl_err, plan_err, Result};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::NUMERICS;
+use datafusion_expr::{
+Accumulator, AggregateUDFImpl, Signature, TypeSignature, Volatility,
+};
 use datafusion_physical_expr_common::aggregate::tdigest::{
 Centroid, TDigest, DEFAULT_MAX_SIZE,
 };
+use datafusion_physical_expr_common::physical_expr::down_cast_any_ref;
 
-use datafusion_common::Result;
-use datafusion_common::ScalarValue;
-use datafusion_expr::Accumulator;
+use crate::approx_percentile_cont::{ApproxPercentileAccumulator, 
ApproxPercentileCont};
 
-use crate::aggregate::utils::down_cast_any_ref;
-use std::{any::Any, sync::Arc};
+make_udaf_expr_and_func!(
+ApproxPercentileContWithWeight,
+approx_percentile_cont_with_weight,
+expression weight percentile,
+"Computes the approximate percentile continuous with weight of a set of 
numbers",
+approx_percentile_cont_with_weight_udaf
+);
 
 /// APPROX_PERCENTILE_CONT_WITH_WEIGTH aggregate expression
-#[derive(Debug)]
 pub struct ApproxPercentileContWithWeight {
+signature: Signature,
 approx_percentile_cont: ApproxPercentileCont,
-column_expr: Arc,
-weight_expr: Arc,
-percentile_expr: Arc,
+}
+
+impl Debug for ApproxPercentileContWithWeight {
+fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+f.debug_struct("ApproxPercentileContWithWeight")
+.field("signature", &self.signature)
+.finish()
+}
+}
+
+impl Default for ApproxPercentileContWithWeight {
+fn default() -> Self {
+Self::new()
+}
 }
 
 impl ApproxPercentileContWithWeight {
 /// Create a new [`ApproxPercentileContWithWeight`] aggregate function.
-pub fn new(
-expr: Vec>,
-name: impl Into,
-return_type: DataType,
-) -> Result {
-// Arguments should be [ColumnExpr, WeightExpr, 
DesiredPercentileLiteral]
-debug_assert_eq!(expr.len(), 3);
-
-let sub_expr = vec![expr[0].clone(), expr[2].clone()];
-let approx_percentile_cont =
-ApproxPercentileCont::new(sub_expr, name, return_type)?;
-
-Ok(Self {
-approx_percentile_cont,
-column_expr: expr[0].clone(),
-weight_expr: expr[1].clone(),
-percentile_expr: expr[2].clone(),
-})
+pub fn new() -> Self {
+Self {
+signature: Signature::one_of(

Review Comment:
   Same as 
https://github.com/apache/datafusion/pull/10917#discussion_r1641808518



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub



MohamedAbdeen21 commented on PR #10868:
URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171468852

   Hey @peter-toth sorry for the late reply.
   
   Looks good, and I like the Alias generator. But I have a couple of comments:
   
   1. Thought we agreed on the `#` prefix, is there a reason for choosing the 
`__cse_` prefix? DuckDB uses `#XXX`, SQLServer uses `ExprXXX`, Spark uses 
`_expr#XXX`, all of which look cleaner than `__cse_XXX`
   2. Tiny nit, but can you use `into_values` instead of using `into_iter` then 
ignoring the key?
   
   I don't mind either options, a PR against this one keeps the entire review 
history and can make reviews easier, but a new PR can be easier to  maintain as 
you'll be able to directly push commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171470071

   > > External error: statement is expected to fail with error:
   > >  (regex) DataFusion error: Error during planning: .* No function matches 
the given name and argument types 
   > I usually just remove the error message, although it is not an idea 
solution, but I don't know why the CI message is different from local too.
   
   I modified the expected pattern of error message to match the message thrown 
in CI flow. The full message is
   ```
   DataFusion error: Error during planning: Error during planning: 
Coercion from [Utf8, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, 
Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), 
Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, 
UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, 
Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, 
Float64])]) failed. No function matches the given name and argument types 
'approx_percentile_cont_with_weight(Utf8, Int8, Float64)'. You might need to 
add explicit type casts.
   Candidate functions:
   approx_percentile_cont_with_weight(Int8, Int8, Float64)
   approx_percentile_cont_with_weight(Int16, Int16, Float64)
   approx_percentile_cont_with_weight(Int32, Int32, Float64)
   approx_percentile_cont_with_weight(Int64, Int64, Float64)
   approx_percentile_cont_with_weight(UInt8, UInt8, Float64)
   approx_percentile_cont_with_weight(UInt16, UInt16, Float64)
   approx_percentile_cont_with_weight(UInt32, UInt32, Float64)
   approx_percentile_cont_with_weight(UInt64, UInt64, Float64)
   approx_percentile_cont_with_weight(Float32, Float32, Float64)
   approx_percentile_cont_with_weight(Float64, Float64, Float64)
   ```
   
   I noticed that the CI flow is missing the tail part of the error message: 
`You might need to add explicit type casts`. There might be a length 
limitation for the error message, but I'm not sure. Just making a note of it 
here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]

2024-06-16 Thread via GitHub



marvinlanhenke commented on issue #10926:
URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171471803

   > I actually think using `new_null_array` is wrong here. It returns the 
wrong type
   
   I passed `&DataType::UInt64` into `new_null_array(...)`. However, your 
approach solves the issue with downcasting the result from new_null_array so it 
makes more sense here. Thank you.
   
   Just to confirm my understanding (regarding my second question) 
`row_group_row_counts` should also return a null_array when we reference a 
non-existing column correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub



ahirner opened a new issue, #10935:
URL: https://github.com/apache/datafusion/issues/10935

   ### Is your feature request related to a problem or challenge?
   
   Currently where clauses in a sub query are not pushed below `Unnest` 
expressions. In conjunction with some other limitations, it composing views 
with `unnest()` not ideal.
   
   In `0.39.0` cli:
   ```
   > CREATE TABLE IF NOT EXISTS v AS VALUES(1,[1,2,3]),(2,[3,4,5]);
   0 row(s) fetched.
   Elapsed 0.008 seconds.
   ```
   
   We get equal results with or w/o a subquery:
   ```
   > select unnest(column2) from v where column1=2;
   +---+
   | unnest(v.column2) |
   +---+
   | 3 |
   | 4 |
   | 5 |
   +---+
   3 row(s) fetched.
   Elapsed 0.011 seconds.
   ```
   
   ```
   > select "unnest(v.column2)" from (select unnest(column2), column1 from v) 
where column1=2;
   +---+
   | unnest(v.column2) |
   +---+
   | 3 |
   | 4 |
   | 5 |
   +---+
   3 row(s) fetched.
   Elapsed 0.015 seconds.
   ```
   
   `Filter` and `FilterExec` are above projections with `unnest`, instead of 
below.
   Ok:
   ```
   > explain select unnest(column2) from v where column1=2;
   
+---++
   | plan_type | plan   
|
   
+---++
   | logical_plan  | Unnest: lists[unnest(v.column2)] structs[] 
|
   |   |   Projection: v.column2 AS unnest(v.column2)   
|
   |   | Filter: v.column1 = Int64(2)   
|
   |   |   TableScan: v projection=[column1, column2]   
|
   | physical_plan | UnnestExec 
|
   |   |   RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1 |
   |   | ProjectionExec: expr=[column2@1 as unnest(v.column2)]  
|
   |   |   CoalesceBatchesExec: target_batch_size=8192  
|
   |   | FilterExec: column1@0 = 2  
|
   |   |   MemoryExec: partitions=1, partition_sizes=[1]
|
   |   |
|
   
+---++
   ```
   
   Problematic:
   ```
   > explain select "unnest(v.column2)" from (select unnest(column2), column1 
from v) where column1=2;
   
+---+---+
   | plan_type | plan   
   |
   
+---+---+
   | logical_plan  | Projection: unnest(v.column2)  
   |
   |   |   Filter: v.column1 = Int64(2) 
   |
   |   | Unnest: lists[unnest(v.column2)] structs[] 
   |
   |   |   Projection: v.column2 AS unnest(v.column2), 
v.column1   |
   |   | TableScan: v projection=[column1, column2] 
   |
   | physical_plan | ProjectionExec: expr=[unnest(v.column2)@0 as 
unnest(v.column2)]   |
   |   |   CoalesceBatchesExec: target_batch_size=8192  
   |
   |   | FilterExec: column1@1 = 2  
   |
   |   |   RepartitionExec: partitioning=RoundRobinBatch(8), 
input_partitions=1|
   |   | UnnestExec 
   |
   |   |   ProjectionExec: expr=[column2@1 as 
unnest(v.column2), column1@0 as column1] |
   |   | MemoryExec: partitions=1, partition_sizes=[1]  
   |
   |   |
   |
   
+---+---+
   ```
   
   
   ### Describe the solution you'd like
   
   `FilterExec` is pushed down to tables that support them.
   
   ### Describe alternatives you've considered
   
   A workaround is trying to not use subqueries. However, t

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on code in PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641813923


##
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##
@@ -15,19 +15,254 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::any::Any;
+use std::fmt::{Debug, Formatter};
+
 use arrow::{
 array::{
 ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, 
Int64Array,
 Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array,
 },
 datatypes::DataType,
 };
+use arrow_schema::Field;
 
-use datafusion_common::{downcast_value, internal_err, DataFusionError, 
ScalarValue};
-use datafusion_expr::Accumulator;
+use datafusion_common::{
+downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, 
ScalarValue,
+};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS};
+use datafusion_expr::utils::format_state_name;
+use datafusion_expr::{
+Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility,
+};
 use datafusion_physical_expr_common::aggregate::tdigest::{
 TDigest, TryIntoF64, DEFAULT_MAX_SIZE,
 };
+use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref;
+
+make_udaf_expr_and_func!(
+ApproxPercentileCont,
+approx_percentile_cont,
+expression percentile,
+"Computes the approximate percentile continuous of a set of numbers",
+approx_percentile_cont_udaf
+);
+
+pub struct ApproxPercentileCont {
+signature: Signature,
+}
+
+impl Debug for ApproxPercentileCont {
+fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+f.debug_struct("ApproxPercentileCont")
+.field("name", &self.name())
+.field("signature", &self.signature)
+.finish()
+}
+}
+
+impl Default for ApproxPercentileCont {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl ApproxPercentileCont {
+/// Create a new [`ApproxPercentileCont`] aggregate function.
+pub fn new() -> Self {
+let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() 
+ 1));
+// Accept any numeric value paired with a float64 percentile
+for num in NUMERICS {
+variants.push(TypeSignature::Exact(vec![num.clone(), 
DataType::Float64]));
+// Additionally accept an integer number of centroids for T-Digest
+for int in INTEGERS {
+variants.push(TypeSignature::Exact(vec![
+num.clone(),
+DataType::Float64,
+int.clone(),
+]))
+}
+}
+Self {
+signature: Signature::one_of(variants, Volatility::Immutable),

Review Comment:
   It is single type because we have not yet meet any function that takes 
multiple input to build accumulator. If there is, then we might need to extend 
it to Vec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on code in PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641818096


##
datafusion/functions-aggregate/src/approx_percentile_cont.rs:
##
@@ -15,19 +15,254 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::any::Any;
+use std::fmt::{Debug, Formatter};
+
 use arrow::{
 array::{
 ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, 
Int64Array,
 Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array,
 },
 datatypes::DataType,
 };
+use arrow_schema::Field;
 
-use datafusion_common::{downcast_value, internal_err, DataFusionError, 
ScalarValue};
-use datafusion_expr::Accumulator;
+use datafusion_common::{
+downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, 
ScalarValue,
+};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS};
+use datafusion_expr::utils::format_state_name;
+use datafusion_expr::{
+Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility,
+};
 use datafusion_physical_expr_common::aggregate::tdigest::{
 TDigest, TryIntoF64, DEFAULT_MAX_SIZE,
 };
+use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref;
+
+make_udaf_expr_and_func!(
+ApproxPercentileCont,
+approx_percentile_cont,
+expression percentile,
+"Computes the approximate percentile continuous of a set of numbers",
+approx_percentile_cont_udaf
+);
+
+pub struct ApproxPercentileCont {
+signature: Signature,
+}
+
+impl Debug for ApproxPercentileCont {
+fn fmt(&self, f: &mut Formatter) -> std::fmt::Result {
+f.debug_struct("ApproxPercentileCont")
+.field("name", &self.name())
+.field("signature", &self.signature)
+.finish()
+}
+}
+
+impl Default for ApproxPercentileCont {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl ApproxPercentileCont {
+/// Create a new [`ApproxPercentileCont`] aggregate function.
+pub fn new() -> Self {
+let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() 
+ 1));
+// Accept any numeric value paired with a float64 percentile
+for num in NUMERICS {
+variants.push(TypeSignature::Exact(vec![num.clone(), 
DataType::Float64]));
+// Additionally accept an integer number of centroids for T-Digest
+for int in INTEGERS {
+variants.push(TypeSignature::Exact(vec![
+num.clone(),
+DataType::Float64,
+int.clone(),
+]))
+}
+}
+Self {
+signature: Signature::one_of(variants, Volatility::Immutable),

Review Comment:
   > the input arguments will be vec![Float64, Float64, Float64]
   
   Oh, yes. It will coerce types to the unify one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on issue #10935:
URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171495707

   For this kind of query `select unnest(unnest([[1],[1,2],[1,2,3]]));`, I 
think we can follow what Duckdb is.
   
   Unnest with recursive, and max_depth
   
   `select unnest([[1],[1,2],[1,2,3]], recursive:= true);`
   or `select unnest([[1],[1,2],[1,2,3]], max_depth := 2);`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]

2024-06-16 Thread via GitHub



marvinlanhenke commented on PR #10932:
URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171497286

   > I think this is a nice code impeovement ❤️
   
   It is indeed. Thanks @Weijun-H really nice refactor.
   
   > I am not sure this will improve the performance
   
   I'll guess in terms of performance it won't make any difference. 
   However, it is more readable and thats always great, so thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] x NOT IN y works but NOT (x IN y) doesn't [datafusion]

2024-06-16 Thread via GitHub



akoshchiy commented on issue #10830:
URL: https://github.com/apache/datafusion/issues/10830#issuecomment-2171512487

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] fix: rewriting NOT IN () to anti join [datafusion]

2024-06-16 Thread via GitHub



akoshchiy opened a new pull request, #10936:
URL: https://github.com/apache/datafusion/pull/10936

   ## Which issue does this PR close?
   
   
   
   Closes #10830 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   Yes.
   
   
   ## Are there any user-facing changes?
   No.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Improve performance for grouping by variable length columns (strings) [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on issue #9403:
URL: https://github.com/apache/datafusion/issues/9403#issuecomment-2171579091

   I try to an approach that take multiple columns into consideration but found 
that the time spend on `insert_accounted` largely increase.
   
   ```rust
   self.map.insert_accounted(
   new_header,
   |header| header.hash,
   &mut self.map_size,
   );
   ```
   
   The hash entry includes vector like 
   ```rust
   struct Entry
   where
   O: OffsetSizeTrait,
   {
   /// hash of the value (stored to avoid recomputing it in hash table 
check)
   hash: u64,
   /// if len =< [`SHORT_VALUE_LEN`]: the data inlined
   /// if len > [`SHORT_VALUE_LEN`], the offset of where the data starts
   offset_or_inline: Vec,
   /// length of the value, in bytes (use O here so we use only i32 for
   /// strings, rather 64 bit usize)
   len: Vec>,
   group_id: usize,
   }
   ```
   
   It seems that we need avoid adding `Vec` into hash entry. It seems 
`ArrowBytesMap`'s idea couldn't  help much :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] [DRAFT] Multiple group by optimization [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 opened a new pull request, #10937:
URL: https://github.com/apache/datafusion/pull/10937

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub



vaibhawvipul commented on PR #10930:
URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171678704

   looks nice, please let me know when it is ready for review. 
   
   Note, can you also run the relevant scala test? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Add example for writing SQL analysis using DataFusion structures [datafusion]

2024-06-16 Thread via GitHub



LorrensP-2158466 opened a new pull request, #10938:
URL: https://github.com/apache/datafusion/pull/10938

   ## Which issue does this PR close?
   
   
   Closes #10871 .
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   Added an example of an SQL analysis where we count joins in a SQL query.
   
   ## Are these changes tested?
   
   
   Ran it on some other queries and the join counts where right.
   
   ## Are there any user-facing changes?
   
   
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171687704

   Thanks @jayzhan211 approved it. : )
   If no more comments, how about merge this PR first? I guess it may cause 
some conflicts for other UDAF-related PRs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Move `DynamicFileCatalog` back to core [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on PR #10745:
URL: https://github.com/apache/datafusion/pull/10745#issuecomment-2171692026

   Hi @alamb, is there any update on this topic? Or maybe I can help by 
splitting out some traits into the new crate `datafusion-catalog` first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171694652

   > Thanks @jayzhan211 approved it. : ) If no more comments, how about merge 
this PR first? I guess it may cause some conflicts for other UDAF-related PRs.
   
   Sure!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 merged PR #10917:
URL: https://github.com/apache/datafusion/pull/10917


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Convert `ApproxPercentileCont` and ``ApproxPercentileContWithWeight` to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 closed issue #10870: Convert `ApproxPercentileCont` and 
``ApproxPercentileContWithWeight` to UDAF
URL: https://github.com/apache/datafusion/issues/10870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]

2024-06-16 Thread via GitHub



goldmedal commented on PR #10917:
URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171695849

   Thanks again @jayzhan211 :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] use ScalarValue::to_pyarrow to convert to python object [datafusion-python]

2024-06-16 Thread via GitHub



andygrove merged PR #731:
URL: https://github.com/apache/datafusion-python/pull/731


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]

2024-06-16 Thread via GitHub



andygrove closed issue #729: Simpify `PyExpr::python_value` by using 
`ScalarValue::into_py`
URL: https://github.com/apache/datafusion-python/issues/729


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]

2024-06-16 Thread via GitHub



andygrove closed issue #729: Simpify `PyExpr::python_value` by using 
`ScalarValue::into_py`
URL: https://github.com/apache/datafusion-python/issues/729


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub



peter-toth commented on PR #10868:
URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171754117

   > Hey @peter-toth sorry for the late reply.
   > 
   > Looks good, and I like the Alias generator. But I have a couple of 
comments:
   > 
   > 1. Thought we agreed on the `#` prefix. Is there a reason for choosing the 
`__cse_` prefix? DuckDB uses `#XXX`, SQLServer uses `ExprXXX` (can't verify 
this atm), Spark uses `_expr#XXX`, all of which look cleaner than `__cse_XXX`
   > 2. Tiny nit, but can you use `into_values` instead of using `into_iter` 
then ignoring the key?
   > 
   > I don't mind either options, a PR against this one keeps the entire review 
history and can make reviews easier, but a new PR can be easier to maintain as 
you'll be able to directly push commits.
   
   No problem, I've opened https://github.com/apache/datafusion/pull/10939. Yes 
and actually I'm fine with that as well, but I found a few `AliasGenerator` 
usecases in the existing code and wanted to follow the conventions of 
https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/decorrelate_predicate_subquery.rs#L249
 and 
https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/scalar_subquery_to_join.rs#L245
 with the `__cse` prefix. I've also changed `into_iter()` to  `into_values ()` 
in 
https://github.com/apache/datafusion/pull/10939/commits/6b1d1e34abd4c9da4a550db99471c09f5add09ff.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Use shorter aliases in CSE [datafusion]

2024-06-16 Thread via GitHub



peter-toth commented on PR #10939:
URL: https://github.com/apache/datafusion/pull/10939#issuecomment-2171754406

   cc @MohamedAbdeen21, @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub



MohamedAbdeen21 closed pull request #10868: CSE shorthand alias
URL: https://github.com/apache/datafusion/pull/10868


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] CSE shorthand alias [datafusion]

2024-06-16 Thread via GitHub



MohamedAbdeen21 commented on PR #10868:
URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171757651

   Became part of #10939.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub



duongcongtoai commented on issue #10935:
URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171768257

   I think this is related https://github.com/apache/datafusion/issues/10660
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] fix: rewriting NOT IN () to anti join [datafusion]

2024-06-16 Thread via GitHub



viirya commented on code in PR #10936:
URL: https://github.com/apache/datafusion/pull/10936#discussion_r1641921370


##
datafusion/optimizer/src/decorrelate_predicate_subquery.rs:
##
@@ -1099,6 +1126,29 @@ mod tests {
 Ok(())
 }
 
+#[test]
+fn wrapped_not_in_subquery() -> Result<()> {
+let table_scan = test_table_scan()?;
+let plan = LogicalPlanBuilder::from(table_scan)
+.filter(not(in_subquery(col("c"), 
test_subquery_with_name("sq")?)))?

Review Comment:
   Could you add another test for `not(not_in_subquery)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Push down filters below `Unnest` in sub queries [datafusion]

2024-06-16 Thread via GitHub



ahirner commented on issue #10935:
URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171783792

   Thanks for the pointers @duongcongtoai
   I think the recursive and max_depth options can be a special workaround, but 
introduce a good amount of complexity. Fixing the push down makes single 
`unnest` composable in general.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on PR #10930:
URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171794774

   @jayzhan211  / @alamb can you please help me with a review here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] refactor: dont include fallback in match on mem_pool_type [datafusion]

2024-06-16 Thread via GitHub



tshauck opened a new pull request, #10940:
URL: https://github.com/apache/datafusion/pull/10940

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   Using the wildcard pattern has resulted in some minor bugs in the past in 
DataFusion. So this PR makes a minor tweak in the CLI the `args.mem_pool_type` 
to explicitly handle all the enum variants and `None` case.
   
   ## What changes are included in this PR?
   
   Just update a match statement.
   
   ## Are these changes tested?
   
   Compiled the CLI and ran it to check, otherwise tests remain the same.
   
   ## Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] refactor: remove extra default in max rows [datafusion]

2024-06-16 Thread via GitHub



tshauck opened a new pull request, #10941:
URL: https://github.com/apache/datafusion/pull/10941

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   Currently `-h` in the datafusion CLI shows the default for max rows twice.
   
   ## What changes are included in this PR?
   
   This PR removes the value from the argument help, so the default is only 
shown once.
   
   ## Are these changes tested?
   
   Yes, compiled and looked at the help.
   
   ## Are there any user-facing changes?
   
   Minor, CLI help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on code in PR #10930:
URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642021123


##
datafusion/functions-aggregate/src/bit_and_or_xor.rs:
##
@@ -0,0 +1,400 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines `BitAnd`, `BitOr`, `BitXor` and `BitXor DISTINCT` aggregate 
accumulators
+
+use std::any::Any;
+use std::collections::HashSet;
+use std::fmt::{Display, Formatter};
+
+use ahash::RandomState;
+use arrow::array::{Array, ArrayRef, AsArray};
+use arrow::datatypes::{
+ArrowNativeType, ArrowNumericType, DataType, Int16Type, Int32Type, 
Int64Type,
+Int8Type, UInt16Type, UInt32Type, UInt64Type, UInt8Type,
+};
+use arrow_schema::Field;
+
+use datafusion_common::cast::as_list_array;
+use datafusion_common::{exec_err, not_impl_err, Result, ScalarValue};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::INTEGERS;
+use datafusion_expr::utils::format_state_name;
+use datafusion_expr::{Accumulator, AggregateUDFImpl, Signature, Volatility};
+
+/// `accumulator_helper` is a macro accepting (ArrowPrimitiveType, 
BitwiseOperationType)
+macro_rules! accumulator_helper {
+($t:ty, $opr:expr) => {
+match $opr {
+BitwiseOperationType::And => 
Ok(Box::>::default()),
+BitwiseOperationType::Or => 
Ok(Box::>::default()),
+BitwiseOperationType::Xor => 
Ok(Box::>::default()),
+BitwiseOperationType::XorDistinct => {
+Ok(Box::>::default())
+}
+}
+};
+}
+
+/// AND, OR and XOR only supports a subset of numeric types
+///
+/// `args` is [AccumulatorArgs]
+/// `opr` is [BitwiseOperationType]
+macro_rules! downcast_bitwise_accumulator {
+($args:ident, $opr:expr) => {
+match $args.data_type {
+DataType::Int8 => accumulator_helper!(Int8Type, $opr),
+DataType::Int16 => accumulator_helper!(Int16Type, $opr),
+DataType::Int32 => accumulator_helper!(Int32Type, $opr),
+DataType::Int64 => accumulator_helper!(Int64Type, $opr),
+DataType::UInt8 => accumulator_helper!(UInt8Type, $opr),
+DataType::UInt16 => accumulator_helper!(UInt16Type, $opr),
+DataType::UInt32 => accumulator_helper!(UInt32Type, $opr),
+DataType::UInt64 => accumulator_helper!(UInt64Type, $opr),
+_ => {
+not_impl_err!(
+"{} not supported for {}: {}",
+stringify!($opr),
+$args.name,
+$args.data_type
+)
+}
+}
+};
+}
+
+/// Simplifies the creation of User-Defined Aggregate Functions (UDAFs) for 
performing bitwise operations in a declarative manner.
+///
+/// `EXPR_FN` identifier used to name the generated expression function.
+/// `AGGREGATE_UDF_FN` is an identifier used to name the underlying UDAF 
function.
+/// `OPR_TYPE` is an expression that evaluates to the type of bitwise 
operation to be performed.
+macro_rules! make_bitwise_udaf_expr_and_func {
+($EXPR_FN:ident, $AGGREGATE_UDF_FN:ident, $OPR_TYPE:expr) => {
+make_udaf_expr!($EXPR_FN, expr_y expr_x, concat!("Returns the 
bitwise", stringify!($OPR_TYPE), "of a group of values"), $AGGREGATE_UDF_FN);
+create_func!($EXPR_FN, $AGGREGATE_UDF_FN, 
BitwiseOperation::new($OPR_TYPE, stringify!($EXPR_FN)));
+}
+}
+
+make_bitwise_udaf_expr_and_func!(bit_and, bit_and_udaf, 
BitwiseOperationType::And);
+make_bitwise_udaf_expr_and_func!(bit_or, bit_or_udaf, 
BitwiseOperationType::Or);
+make_bitwise_udaf_expr_and_func!(bit_xor, bit_xor_udaf, 
BitwiseOperationType::Xor);
+
+/// The different types of bitwise operations that can be performed.
+#[derive(Debug, Clone, Eq, PartialEq)]
+enum BitwiseOperationType {
+And,
+Or,
+Xor,
+/// `XorDistinct` is a variation of the bitwise XOR operation specifically 
for the scenario of BitXor DISTINCT
+XorDistinct,
+}
+
+impl Display for BitwiseOperationType {
+fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+write!(f, "{:?}", self)
+}
+}
+
+/// [BitwiseOperation] struct encapsulates informa

[I] Convert `Avg` to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 opened a new issue, #10942:
URL: https://github.com/apache/datafusion/issues/10942

   ### Is your feature request related to a problem or challenge?
   
   Similar to others issue in #8708 
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Convert `Min` and `Max` to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 opened a new issue, #10943:
URL: https://github.com/apache/datafusion/issues/10943

   ### Is your feature request related to a problem or challenge?
   
   Similar to other issues in #8708 
   
   I think the change of this might be large, could split it in several PR.
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 opened a new issue, #10944:
URL: https://github.com/apache/datafusion/issues/10944

   ### Is your feature request related to a problem or challenge?
   
   Similar to other issues in #8708 
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] refactor: remove extra default in max rows [datafusion]

2024-06-16 Thread via GitHub



jonahgao merged PR #10941:
URL: https://github.com/apache/datafusion/pull/10941


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: RewriteCycle API for short-circuiting optimizer loops [datafusion]

2024-06-16 Thread via GitHub



jayzhan211 commented on code in PR #10386:
URL: https://github.com/apache/datafusion/pull/10386#discussion_r1642086691


##
datafusion/optimizer/src/rewrite_cycle.rs:
##
@@ -0,0 +1,262 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/// An API for executing a sequence of [TreeNodeRewriter]s in multiple passes.
+///
+/// See [RewriteCycle] for more information.
+///
+use std::ops::ControlFlow;
+
+use datafusion_common::{
+tree_node::{Transformed, TreeNode, TreeNodeRewriter},
+Result,
+};
+
+/// A builder with methods for executing a "rewrite cycle".
+///
+/// Often the results of one optimization rule can uncover more optimizations 
in other optimization
+/// rules. A sequence of optimization rules can be ran in multiple "passes" 
until there are no
+/// more optmizations to make.
+///
+/// The [RewriteCycle] handles logic for running these multi-pass loops.
+/// It applies a sequence of [TreeNodeRewriter]s to a [TreeNode] by calling
+/// [TreeNode::rewrite] in a loop - passing the output of one rewrite as the 
input to the next
+/// rewrite - until [RewriteCycle::max_cycles] is reached or until every 
[TreeNode::rewrite]
+/// returns a [Transformed::no] result in a consecutive sequence.
+#[derive(Debug)]
+pub struct RewriteCycle {
+max_cycles: usize,
+}
+
+impl Default for RewriteCycle {
+fn default() -> Self {
+Self::new()
+}
+}
+
+impl RewriteCycle {
+/// The default maximum number of completed cycles to run before 
terminating the rewrite loop.
+/// You can override this default with [Self::with_max_cycles]
+pub const DEFAULT_MAX_CYCLES: usize = 3;
+
+/// Creates a new [RewriteCycle] with default options.
+pub fn new() -> Self {
+Self {
+max_cycles: Self::DEFAULT_MAX_CYCLES,
+}
+}
+/// Sets the [Self::max_cycles] to run before terminating the rewrite loop.
+pub fn with_max_cycles(mut self, max_cycles: usize) -> Self {
+self.max_cycles = max_cycles;
+self
+}
+
+/// The maximum number of completed cycles to run before terminating the 
rewrite loop.
+/// Defaults to [Self::DEFAULT_MAX_CYCLES].
+pub fn max_cycles(&self) -> usize {
+self.max_cycles
+}
+
+/// Runs a rewrite cycle on the given [TreeNode] using the given callback 
function to
+/// explicitly handle the cycle iterations.
+///
+/// The callback function is given a [RewriteCycleState], which manages 
the short-circuiting
+/// logic of the loop. The function is expected to call 
[RewriteCycleState::rewrite] for each
+/// individual [TreeNodeRewriter] in the cycle. 
[RewriteCycleState::rewrite] returns a [RewriteCycleControlFlow]
+/// result, indicating whether the loop should break or continue.
+///
+/// ```rust
+///
+/// use arrow::datatypes::{Schema, Field, DataType};
+/// use datafusion_expr::{col, lit};
+/// use datafusion_common::{DataFusionError, ToDFSchema};
+/// use datafusion_expr::execution_props::ExecutionProps;
+/// use datafusion_expr::simplify::SimplifyContext;
+/// use datafusion_optimizer::rewrite_cycle::RewriteCycle;
+/// use datafusion_optimizer::simplify_expressions::{Simplifier, 
ConstEvaluator};
+///
+/// // Create the schema
+/// let schema = Schema::new(vec![
+/// Field::new("c1", DataType::Int64, true),
+/// Field::new("c2", DataType::UInt32, true),
+/// ]).to_dfschema_ref().unwrap();
+///
+/// // Create the rewriters
+/// let props = ExecutionProps::new();
+/// let context = SimplifyContext::new(&props)
+///.with_schema(schema);
+/// let mut simplifier = Simplifier::new(&context);
+/// let mut const_evaluator = ConstEvaluator::try_new(&props).unwrap();
+///
+/// // ((c4 < 1 or c3 < 2) and c3 < 3) and false
+/// let expr = col("c2")
+/// .lt(lit(1))
+/// .or(col("c1").lt(lit(2)))
+/// .and(col("c1").lt(lit(3)))
+/// .and(lit(false));
+///
+/// // run the rewrite cycle loop
+/// let (expr, info) = RewriteCycle::new()
+/// .with_max_cycles(4)
+/// .each_cycle(expr, |cycle_state| {
+/// cycle_state
+///

Re: [I] Convert `Min` and `Max` to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on issue #10943:
URL: https://github.com/apache/datafusion/issues/10943#issuecomment-2172176854

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on issue #10944:
URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172177142

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on code in PR #10930:
URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642130995


##
datafusion/functions-aggregate/src/bit_and_or_xor.rs:
##
@@ -0,0 +1,400 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines `BitAnd`, `BitOr`, `BitXor` and `BitXor DISTINCT` aggregate 
accumulators
+
+use std::any::Any;
+use std::collections::HashSet;
+use std::fmt::{Display, Formatter};
+
+use ahash::RandomState;
+use arrow::array::{Array, ArrayRef, AsArray};
+use arrow::datatypes::{
+ArrowNativeType, ArrowNumericType, DataType, Int16Type, Int32Type, 
Int64Type,
+Int8Type, UInt16Type, UInt32Type, UInt64Type, UInt8Type,
+};
+use arrow_schema::Field;
+
+use datafusion_common::cast::as_list_array;
+use datafusion_common::{exec_err, not_impl_err, Result, ScalarValue};
+use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs};
+use datafusion_expr::type_coercion::aggregates::INTEGERS;
+use datafusion_expr::utils::format_state_name;
+use datafusion_expr::{Accumulator, AggregateUDFImpl, Signature, Volatility};
+
+/// `accumulator_helper` is a macro accepting (ArrowPrimitiveType, 
BitwiseOperationType)
+macro_rules! accumulator_helper {
+($t:ty, $opr:expr) => {
+match $opr {
+BitwiseOperationType::And => 
Ok(Box::>::default()),
+BitwiseOperationType::Or => 
Ok(Box::>::default()),
+BitwiseOperationType::Xor => 
Ok(Box::>::default()),
+BitwiseOperationType::XorDistinct => {
+Ok(Box::>::default())
+}
+}
+};
+}
+
+/// AND, OR and XOR only supports a subset of numeric types
+///
+/// `args` is [AccumulatorArgs]
+/// `opr` is [BitwiseOperationType]
+macro_rules! downcast_bitwise_accumulator {
+($args:ident, $opr:expr) => {
+match $args.data_type {
+DataType::Int8 => accumulator_helper!(Int8Type, $opr),
+DataType::Int16 => accumulator_helper!(Int16Type, $opr),
+DataType::Int32 => accumulator_helper!(Int32Type, $opr),
+DataType::Int64 => accumulator_helper!(Int64Type, $opr),
+DataType::UInt8 => accumulator_helper!(UInt8Type, $opr),
+DataType::UInt16 => accumulator_helper!(UInt16Type, $opr),
+DataType::UInt32 => accumulator_helper!(UInt32Type, $opr),
+DataType::UInt64 => accumulator_helper!(UInt64Type, $opr),
+_ => {
+not_impl_err!(
+"{} not supported for {}: {}",
+stringify!($opr),
+$args.name,
+$args.data_type
+)
+}
+}
+};
+}
+
+/// Simplifies the creation of User-Defined Aggregate Functions (UDAFs) for 
performing bitwise operations in a declarative manner.
+///
+/// `EXPR_FN` identifier used to name the generated expression function.
+/// `AGGREGATE_UDF_FN` is an identifier used to name the underlying UDAF 
function.
+/// `OPR_TYPE` is an expression that evaluates to the type of bitwise 
operation to be performed.
+macro_rules! make_bitwise_udaf_expr_and_func {
+($EXPR_FN:ident, $AGGREGATE_UDF_FN:ident, $OPR_TYPE:expr) => {
+make_udaf_expr!($EXPR_FN, expr_y expr_x, concat!("Returns the 
bitwise", stringify!($OPR_TYPE), "of a group of values"), $AGGREGATE_UDF_FN);
+create_func!($EXPR_FN, $AGGREGATE_UDF_FN, 
BitwiseOperation::new($OPR_TYPE, stringify!($EXPR_FN)));
+}
+}
+
+make_bitwise_udaf_expr_and_func!(bit_and, bit_and_udaf, 
BitwiseOperationType::And);
+make_bitwise_udaf_expr_and_func!(bit_or, bit_or_udaf, 
BitwiseOperationType::Or);
+make_bitwise_udaf_expr_and_func!(bit_xor, bit_xor_udaf, 
BitwiseOperationType::Xor);
+
+/// The different types of bitwise operations that can be performed.
+#[derive(Debug, Clone, Eq, PartialEq)]
+enum BitwiseOperationType {
+And,
+Or,
+Xor,
+/// `XorDistinct` is a variation of the bitwise XOR operation specifically 
for the scenario of BitXor DISTINCT
+XorDistinct,
+}
+
+impl Display for BitwiseOperationType {
+fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
+write!(f, "{:?}", self)
+}
+}
+
+/// [BitwiseOperation] struct encapsulates informati

Re: [I] Modulus can produce incorrect results in some cases [datafusion-comet]

2024-06-16 Thread via GitHub



vaibhawvipul commented on issue #521:
URL: 
https://github.com/apache/datafusion-comet/issues/521#issuecomment-2172247315

   working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support dictionary data type in array_to_string [datafusion]

2024-06-16 Thread via GitHub



Weijun-H commented on code in PR #10908:
URL: https://github.com/apache/datafusion/pull/10908#discussion_r1642217621


##
datafusion/functions-array/src/string.rs:
##
@@ -281,6 +281,31 @@ pub(super) fn array_to_string_inner(args: &[ArrayRef]) -> 
Result {
 
 Ok(arg)
 }
+Dictionary(..) => {
+let any_dict_array = arr
+.as_any_dictionary_opt()
+.ok_or_else( || {
+DataFusionError::Internal(
+format!("could not cast {} to AnyDictionaryArray", 
arr.data_type())
+)
+})?;
+macro_rules! array_function {
+($ARRAY_TYPE:ident) => {
+to_string!(
+arg,
+any_dict_array.values(),
+&delimiter,
+&null_string,
+with_null_string,
+$ARRAY_TYPE
+)
+};
+}
+call_array_function!(
+any_dict_array.values().data_type(),
+false
+)

Review Comment:
   Why not use
   ```suggestion
   compute_array_to_string(
   arg,
   any_dict_array.values().clone(),
   delimiter.clone(),
   null_string.clone(),
   with_null_string,
   )?;
   
   Ok(arg)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on PR #10930:
URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2172374561

   cc: @jayzhan211 I have made the changes and also updated the UT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Convert string agg to udaf [datafusion]

2024-06-16 Thread via GitHub



lewiszlw opened a new pull request, #10945:
URL: https://github.com/apache/datafusion/pull/10945

   ## Which issue does this PR close?
   
   
   
   part of https://github.com/apache/datafusion/issues/8708.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

2024-06-16 Thread via GitHub



marvinlanhenke opened a new pull request, #10946:
URL: https://github.com/apache/datafusion/pull/10946

   ## Which issue does this PR close?
   
   
   
   Closes #10926.
   
   ## Rationale for this change
   
   
   - row_group_null_counts, data_page_null_counts
   - data_page_row_counts
   
   return an ArrayRef instead of UInt64Array.
   
   ## What changes are included in this PR?
   
   
   - fixed methods to return correct array type
   - changed data_page_row_counts to fall back on row_page_row_counts if column 
is missing
   - added a test case in `arrow_statistics`
   
   ## Are these changes tested?
   Yes.
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]

2024-06-16 Thread via GitHub



marvinlanhenke commented on PR #10946:
URL: https://github.com/apache/datafusion/pull/10946#issuecomment-2172417984

   @alamb PTAL.
   
   I'm not sure about the assumptions I made in `data_page_row_counts` 
[here](https://github.com/apache/datafusion/pull/10946/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R1053-R1057).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Convert `Hyperloglog` to UDAF [datafusion]

2024-06-16 Thread via GitHub



dharanad commented on issue #10944:
URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172431668

   Hey @jayzhan211 
   I noticed the `approx_distinct` UDAF implemented in 
`datafusion/functions-aggregate/src/approx_distinct.rs`. It appears to be using 
a `HyperLogLog` (HLL) structure defined in 
`datafusion/functions-aggregate/src/hyperloglog.rs`.
   
   What is the expected outcome of this issue. Could you provide some insights 
into the intended use of this UDAF?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org