Re: [I] Add example for writing an `AnalyzerRule` [datafusion]
goldmedal commented on issue #10855: URL: https://github.com/apache/datafusion/issues/10855#issuecomment-2171124147 Hi @alamb, I just want to share how I use user-defined AnalyzerRule. As you know, I'm working on reimplementing the semantic layer engine, [wren-engine](https://github.com/Canner/wren-engine), for LLM using DataFusion. The project is still a work in progress. However, I think I have finished the part of integration with DataFusion. I think it could be a nice use case for `AnalyzerRule`. The basic concept of wren-engine is that user can define a virtual modeling layer to apply on their physical data. For example, given a csv file registered in DataFusion context called, `datafusion.public.customers`. We can wrap a virtual table for it called, `wrenai.default.customers_model`. User can query it like ```sql select * from wrenai.default.customers_model ``` Then , wren engine translate it to a physical query: ```sql SELECT "customers_model"."city", "customers_model"."id", "customers_model"."state" FROM ( SELECT "customers_model"."city", "customers_model"."id", "customers_model"."state" FROM ( SELECT "datafusion"."public"."customers"."city" AS "city", "datafusion"."public"."customers"."id" AS "id", "datafusion"."public"."customers"."state" AS "state" FROM "public"."customers" ) AS "customers_model" ) AS "customers_model" ``` It's a simple use case to show how it works. We have other features, such as `relationship` between models or calculated fields in the model. I don't go into detail about them here. However, I implemented all of them through `AnalyzerRule` and `UserDefinedLogicalNode`. The related PR, https://github.com/Canner/wren-engine/pull/613, is still under review. However, you can find the usage of `AnalyzerRule` in [rule.rs](https://github.com/goldmedal/wren-engine/blob/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/core/src/logical_plan/analyze/rule.rs) and the plan node in [plan.rs](https://github.com/goldmedal/wren-engine/blob/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/core/src/logical_plan/analyze/plan.rs). I also added [some examples](https://github.com/goldmedal/wren-engine/tree/bugfix/fix-to-one-relation-refactor-chain/wren-modeling-rs/wren-example) to show how to use the API or integrate with the DataFusion query flow. Thanks to DataFusion for providing such amazing features, allowing me to implement a more stable and structured converter for the semantic layer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641668286 ## datafusion/proto/src/physical_plan/mod.rs: ## @@ -496,11 +496,14 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode { } AggregateFunction::UserDefinedAggrFunction(udaf_name) => { let agg_udf = registry.udaf(udaf_name)?; +// TODO: 'logical_exprs' is not supported for UDAF yet. +// approx_percentile_cont and approx_percentile_cont_weight are not supported for UDAF from protobuf yet. +let logical_exprs = &[]; Review Comment: I agree. I thought Flight SQL might be a use case for `from_proto`. However, I tried to query `approx_percentile_cont` through Flight SQL, and it worked well. Maybe we can just ignore it and leave this TODO comment until we find any `from_proto` use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Remove `prefer_hash_join` env variable for clickbench [datafusion]
jayzhan211 opened a new pull request, #10933: URL: https://github.com/apache/datafusion/pull/10933 ## Which issue does this PR close? Closes #. ## Rationale for this change clickbench does not expect `--prefer_hash_join`, remove it. It is introduce in #10809 ``` *** DataFusion Benchmark Script COMMAND: run BENCHMARK: clickbench_partitioned DATAFUSION_DIR: /Users/bytedance/arrow-datafusion/benchmarks/.. BRANCH_NAME: rm-hash-join-env-clickbench DATA_DIR: /Users/bytedance/arrow-datafusion/benchmarks/data RESULTS_DIR: /Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench CARGO_COMMAND: cargo run --profile release-nonlto PREFER_HASH_JOIN: true *** RESULTS_FILE: /Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench/clickbench_partitioned.json Running clickbench (partitioned, 100 files) benchmark... Finished `release-nonlto` profile [optimized] target(s) in 0.68s Running `/Users/bytedance/arrow-datafusion/target/release-nonlto/dfbench clickbench --iterations 5 --path /Users/bytedance/arrow-datafusion/benchmarks/data/hits_partitioned --prefer_hash_join true --queries-path /Users/bytedance/arrow-datafusion/benchmarks/queries/clickbench/queries.sql -o /Users/bytedance/arrow-datafusion/benchmarks/results/rm-hash-join-env-clickbench/clickbench_partitioned.json` error: Found argument '--prefer_hash_join' which wasn't expected, or isn't valid in this context USAGE: dfbench clickbench --iterations --path For more information try --help ``` ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Support CartesianProductExec in comet [datafusion-comet]
leoluan2009 commented on PR #442: URL: https://github.com/apache/datafusion-comet/pull/442#issuecomment-2171428443 > @leoluan2009 Would you like to update the plan stability results? Thanks. Hi @viirya sorry for delay reply, I think this pr do not change the query plan for tpcds? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]
alamb commented on issue #10926: URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171460701 > Should we also change the signature and return Result? Yes I think so > Should we also change the signature and return Result? I actually think using `new_null_array` is wrong here. It returns the wrong type -- for example if we are getting row counts for a `Utf8` column, using new_null_arrray will return a `StringArray` rather than a `UInt64Array`. I actually hit this issue when working on https://github.com/apache/datafusion/pull/10924 And I had to change it to ```rust let Some(parquet_index) = self.parquet_index else { let num_row_groups = metadatas.into_iter().count(); return Ok(Arc::new(UInt64Array::from_iter( std::iter::repeat(None).take(num_row_groups), ))); }; ``` To make a test pass -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]
alamb merged PR #10927: URL: https://github.com/apache/datafusion/pull/10927 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve `arrow_statistics` tests [datafusion]
alamb commented on PR #10927: URL: https://github.com/apache/datafusion/pull/10927#issuecomment-2171460836 Thanks everyone for the reviews -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Remove `prefer_hash_join` env variable for clickbench [datafusion]
alamb merged PR #10933: URL: https://github.com/apache/datafusion/pull/10933 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add a benchmark for extracting data page statistics [datafusion]
alamb opened a new issue, #10934: URL: https://github.com/apache/datafusion/issues/10934 ### Is your feature request related to a problem or challenge? As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion/issues/10922 one thing that would be good is to have benchmark overage ### Describe the solution you'd like Add a benchmark for extracting page statistics ### Describe alternatives you've considered Add a benchmark ([source](https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L152)) for extracting data page statistics These are run via ```shell cargo bench --bench parquet_statistic ``` In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75 And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test: https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130 ### Additional context The need for a benchmark also came up in https://github.com/apache/datafusion/pull/10932 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]
alamb commented on PR #10932: URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171467776 cc @xinlifoobar and @marvinlanhenke -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808518 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under the License. +use std::any::Any; +use std::fmt::{Debug, Formatter}; + use arrow::{ array::{ ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, Int64Array, Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array, }, datatypes::DataType, }; +use arrow_schema::Field; -use datafusion_common::{downcast_value, internal_err, DataFusionError, ScalarValue}; -use datafusion_expr::Accumulator; +use datafusion_common::{ +downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, ScalarValue, +}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS}; +use datafusion_expr::utils::format_state_name; +use datafusion_expr::{ +Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility, +}; use datafusion_physical_expr_common::aggregate::tdigest::{ TDigest, TryIntoF64, DEFAULT_MAX_SIZE, }; +use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref; + +make_udaf_expr_and_func!( +ApproxPercentileCont, +approx_percentile_cont, +expression percentile, +"Computes the approximate percentile continuous of a set of numbers", +approx_percentile_cont_udaf +); + +pub struct ApproxPercentileCont { +signature: Signature, +} + +impl Debug for ApproxPercentileCont { +fn fmt(&self, f: &mut Formatter) -> std::fmt::Result { +f.debug_struct("ApproxPercentileCont") +.field("name", &self.name()) +.field("signature", &self.signature) +.finish() +} +} + +impl Default for ApproxPercentileCont { +fn default() -> Self { +Self::new() +} +} + +impl ApproxPercentileCont { +/// Create a new [`ApproxPercentileCont`] aggregate function. +pub fn new() -> Self { +let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() + 1)); +// Accept any numeric value paired with a float64 percentile +for num in NUMERICS { +variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64])); +// Additionally accept an integer number of centroids for T-Digest +for int in INTEGERS { +variants.push(TypeSignature::Exact(vec![ +num.clone(), +DataType::Float64, +int.clone(), +])) +} +} +Self { +signature: Signature::one_of(variants, Volatility::Immutable), Review Comment: I found we can't use `Signature::numeric`. I tried to use it, but I discovered that all the input arguments will have the same type as the first argument. For example, in a SQL query like: ``` select approx_percentile_cont(float64_col, 0.5, 100) from t ``` the input arguments will be `vec![Float64, Float64, Float64]`. However, the last argument should be `int64`. This will cause a failure. I'm not sure if it's related, but I found something weird. Why is `AccumulatorArgs#input_type` not a `vec` but a single datatype? If the number of arguments is greater than 1, how do we handle the other arguments? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641808587 ## datafusion/functions-aggregate/src/approx_percentile_cont_with_weight.rs: ## @@ -15,91 +15,139 @@ // specific language governing permissions and limitations // under the License. -use crate::expressions::ApproxPercentileCont; -use crate::{AggregateExpr, PhysicalExpr}; +use std::any::Any; +use std::fmt::{Debug, Formatter}; + use arrow::{ array::ArrayRef, datatypes::{DataType, Field}, }; -use datafusion_functions_aggregate::approx_percentile_cont::ApproxPercentileAccumulator; + +use datafusion_common::ScalarValue; +use datafusion_common::{not_impl_err, plan_err, Result}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::NUMERICS; +use datafusion_expr::{ +Accumulator, AggregateUDFImpl, Signature, TypeSignature, Volatility, +}; use datafusion_physical_expr_common::aggregate::tdigest::{ Centroid, TDigest, DEFAULT_MAX_SIZE, }; +use datafusion_physical_expr_common::physical_expr::down_cast_any_ref; -use datafusion_common::Result; -use datafusion_common::ScalarValue; -use datafusion_expr::Accumulator; +use crate::approx_percentile_cont::{ApproxPercentileAccumulator, ApproxPercentileCont}; -use crate::aggregate::utils::down_cast_any_ref; -use std::{any::Any, sync::Arc}; +make_udaf_expr_and_func!( +ApproxPercentileContWithWeight, +approx_percentile_cont_with_weight, +expression weight percentile, +"Computes the approximate percentile continuous with weight of a set of numbers", +approx_percentile_cont_with_weight_udaf +); /// APPROX_PERCENTILE_CONT_WITH_WEIGTH aggregate expression -#[derive(Debug)] pub struct ApproxPercentileContWithWeight { +signature: Signature, approx_percentile_cont: ApproxPercentileCont, -column_expr: Arc, -weight_expr: Arc, -percentile_expr: Arc, +} + +impl Debug for ApproxPercentileContWithWeight { +fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { +f.debug_struct("ApproxPercentileContWithWeight") +.field("signature", &self.signature) +.finish() +} +} + +impl Default for ApproxPercentileContWithWeight { +fn default() -> Self { +Self::new() +} } impl ApproxPercentileContWithWeight { /// Create a new [`ApproxPercentileContWithWeight`] aggregate function. -pub fn new( -expr: Vec>, -name: impl Into, -return_type: DataType, -) -> Result { -// Arguments should be [ColumnExpr, WeightExpr, DesiredPercentileLiteral] -debug_assert_eq!(expr.len(), 3); - -let sub_expr = vec![expr[0].clone(), expr[2].clone()]; -let approx_percentile_cont = -ApproxPercentileCont::new(sub_expr, name, return_type)?; - -Ok(Self { -approx_percentile_cont, -column_expr: expr[0].clone(), -weight_expr: expr[1].clone(), -percentile_expr: expr[2].clone(), -}) +pub fn new() -> Self { +Self { +signature: Signature::one_of( Review Comment: Same as https://github.com/apache/datafusion/pull/10917#discussion_r1641808518 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] CSE shorthand alias [datafusion]
MohamedAbdeen21 commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171468852 Hey @peter-toth sorry for the late reply. Looks good, and I like the Alias generator. But I have a couple of comments: 1. Thought we agreed on the `#` prefix, is there a reason for choosing the `__cse_` prefix? DuckDB uses `#XXX`, SQLServer uses `ExprXXX`, Spark uses `_expr#XXX`, all of which look cleaner than `__cse_XXX` 2. Tiny nit, but can you use `into_values` instead of using `into_iter` then ignoring the key? I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171470071 > > External error: statement is expected to fail with error: > > (regex) DataFusion error: Error during planning: .* No function matches the given name and argument types > I usually just remove the error message, although it is not an idea solution, but I don't know why the CI message is different from local too. I modified the expected pattern of error message to match the message thrown in CI flow. The full message is ``` DataFusion error: Error during planning: Error during planning: Coercion from [Utf8, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed. No function matches the given name and argument types 'approx_percentile_cont_with_weight(Utf8, Int8, Float64)'. You might need to add explicit type casts. Candidate functions: approx_percentile_cont_with_weight(Int8, Int8, Float64) approx_percentile_cont_with_weight(Int16, Int16, Float64) approx_percentile_cont_with_weight(Int32, Int32, Float64) approx_percentile_cont_with_weight(Int64, Int64, Float64) approx_percentile_cont_with_weight(UInt8, UInt8, Float64) approx_percentile_cont_with_weight(UInt16, UInt16, Float64) approx_percentile_cont_with_weight(UInt32, UInt32, Float64) approx_percentile_cont_with_weight(UInt64, UInt64, Float64) approx_percentile_cont_with_weight(Float32, Float32, Float64) approx_percentile_cont_with_weight(Float64, Float64, Float64) ``` I noticed that the CI flow is missing the tail part of the error message: `You might need to add explicit type casts`. There might be a length limitation for the error message, but I'm not sure. Just making a note of it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] `StatisticsConverter::row_group_null_counts` incorrect for missing column [datafusion]
marvinlanhenke commented on issue #10926: URL: https://github.com/apache/datafusion/issues/10926#issuecomment-2171471803 > I actually think using `new_null_array` is wrong here. It returns the wrong type I passed `&DataType::UInt64` into `new_null_array(...)`. However, your approach solves the issue with downcasting the result from new_null_array so it makes more sense here. Thank you. Just to confirm my understanding (regarding my second question) `row_group_row_counts` should also return a null_array when we reference a non-existing column correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Push down filters below `Unnest` in sub queries [datafusion]
ahirner opened a new issue, #10935: URL: https://github.com/apache/datafusion/issues/10935 ### Is your feature request related to a problem or challenge? Currently where clauses in a sub query are not pushed below `Unnest` expressions. In conjunction with some other limitations, it composing views with `unnest()` not ideal. In `0.39.0` cli: ``` > CREATE TABLE IF NOT EXISTS v AS VALUES(1,[1,2,3]),(2,[3,4,5]); 0 row(s) fetched. Elapsed 0.008 seconds. ``` We get equal results with or w/o a subquery: ``` > select unnest(column2) from v where column1=2; +---+ | unnest(v.column2) | +---+ | 3 | | 4 | | 5 | +---+ 3 row(s) fetched. Elapsed 0.011 seconds. ``` ``` > select "unnest(v.column2)" from (select unnest(column2), column1 from v) where column1=2; +---+ | unnest(v.column2) | +---+ | 3 | | 4 | | 5 | +---+ 3 row(s) fetched. Elapsed 0.015 seconds. ``` `Filter` and `FilterExec` are above projections with `unnest`, instead of below. Ok: ``` > explain select unnest(column2) from v where column1=2; +---++ | plan_type | plan | +---++ | logical_plan | Unnest: lists[unnest(v.column2)] structs[] | | | Projection: v.column2 AS unnest(v.column2) | | | Filter: v.column1 = Int64(2) | | | TableScan: v projection=[column1, column2] | | physical_plan | UnnestExec | | | RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1 | | | ProjectionExec: expr=[column2@1 as unnest(v.column2)] | | | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: column1@0 = 2 | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---++ ``` Problematic: ``` > explain select "unnest(v.column2)" from (select unnest(column2), column1 from v) where column1=2; +---+---+ | plan_type | plan | +---+---+ | logical_plan | Projection: unnest(v.column2) | | | Filter: v.column1 = Int64(2) | | | Unnest: lists[unnest(v.column2)] structs[] | | | Projection: v.column2 AS unnest(v.column2), v.column1 | | | TableScan: v projection=[column1, column2] | | physical_plan | ProjectionExec: expr=[unnest(v.column2)@0 as unnest(v.column2)] | | | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: column1@1 = 2 | | | RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1| | | UnnestExec | | | ProjectionExec: expr=[column2@1 as unnest(v.column2), column1@0 as column1] | | | MemoryExec: partitions=1, partition_sizes=[1] | | | | +---+---+ ``` ### Describe the solution you'd like `FilterExec` is pushed down to tables that support them. ### Describe alternatives you've considered A workaround is trying to not use subqueries. However, t
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
jayzhan211 commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641813923 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under the License. +use std::any::Any; +use std::fmt::{Debug, Formatter}; + use arrow::{ array::{ ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, Int64Array, Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array, }, datatypes::DataType, }; +use arrow_schema::Field; -use datafusion_common::{downcast_value, internal_err, DataFusionError, ScalarValue}; -use datafusion_expr::Accumulator; +use datafusion_common::{ +downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, ScalarValue, +}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS}; +use datafusion_expr::utils::format_state_name; +use datafusion_expr::{ +Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility, +}; use datafusion_physical_expr_common::aggregate::tdigest::{ TDigest, TryIntoF64, DEFAULT_MAX_SIZE, }; +use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref; + +make_udaf_expr_and_func!( +ApproxPercentileCont, +approx_percentile_cont, +expression percentile, +"Computes the approximate percentile continuous of a set of numbers", +approx_percentile_cont_udaf +); + +pub struct ApproxPercentileCont { +signature: Signature, +} + +impl Debug for ApproxPercentileCont { +fn fmt(&self, f: &mut Formatter) -> std::fmt::Result { +f.debug_struct("ApproxPercentileCont") +.field("name", &self.name()) +.field("signature", &self.signature) +.finish() +} +} + +impl Default for ApproxPercentileCont { +fn default() -> Self { +Self::new() +} +} + +impl ApproxPercentileCont { +/// Create a new [`ApproxPercentileCont`] aggregate function. +pub fn new() -> Self { +let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() + 1)); +// Accept any numeric value paired with a float64 percentile +for num in NUMERICS { +variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64])); +// Additionally accept an integer number of centroids for T-Digest +for int in INTEGERS { +variants.push(TypeSignature::Exact(vec![ +num.clone(), +DataType::Float64, +int.clone(), +])) +} +} +Self { +signature: Signature::one_of(variants, Volatility::Immutable), Review Comment: It is single type because we have not yet meet any function that takes multiple input to build accumulator. If there is, then we might need to extend it to Vec -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
jayzhan211 commented on code in PR #10917: URL: https://github.com/apache/datafusion/pull/10917#discussion_r1641818096 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -15,19 +15,254 @@ // specific language governing permissions and limitations // under the License. +use std::any::Any; +use std::fmt::{Debug, Formatter}; + use arrow::{ array::{ ArrayRef, Float32Array, Float64Array, Int16Array, Int32Array, Int64Array, Int8Array, UInt16Array, UInt32Array, UInt64Array, UInt8Array, }, datatypes::DataType, }; +use arrow_schema::Field; -use datafusion_common::{downcast_value, internal_err, DataFusionError, ScalarValue}; -use datafusion_expr::Accumulator; +use datafusion_common::{ +downcast_value, internal_err, not_impl_err, plan_err, DataFusionError, ScalarValue, +}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS}; +use datafusion_expr::utils::format_state_name; +use datafusion_expr::{ +Accumulator, AggregateUDFImpl, Expr, Signature, TypeSignature, Volatility, +}; use datafusion_physical_expr_common::aggregate::tdigest::{ TDigest, TryIntoF64, DEFAULT_MAX_SIZE, }; +use datafusion_physical_expr_common::aggregate::utils::down_cast_any_ref; + +make_udaf_expr_and_func!( +ApproxPercentileCont, +approx_percentile_cont, +expression percentile, +"Computes the approximate percentile continuous of a set of numbers", +approx_percentile_cont_udaf +); + +pub struct ApproxPercentileCont { +signature: Signature, +} + +impl Debug for ApproxPercentileCont { +fn fmt(&self, f: &mut Formatter) -> std::fmt::Result { +f.debug_struct("ApproxPercentileCont") +.field("name", &self.name()) +.field("signature", &self.signature) +.finish() +} +} + +impl Default for ApproxPercentileCont { +fn default() -> Self { +Self::new() +} +} + +impl ApproxPercentileCont { +/// Create a new [`ApproxPercentileCont`] aggregate function. +pub fn new() -> Self { +let mut variants = Vec::with_capacity(NUMERICS.len() * (INTEGERS.len() + 1)); +// Accept any numeric value paired with a float64 percentile +for num in NUMERICS { +variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64])); +// Additionally accept an integer number of centroids for T-Digest +for int in INTEGERS { +variants.push(TypeSignature::Exact(vec![ +num.clone(), +DataType::Float64, +int.clone(), +])) +} +} +Self { +signature: Signature::one_of(variants, Volatility::Immutable), Review Comment: > the input arguments will be vec![Float64, Float64, Float64] Oh, yes. It will coerce types to the unify one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Push down filters below `Unnest` in sub queries [datafusion]
jayzhan211 commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171495707 For this kind of query `select unnest(unnest([[1],[1,2],[1,2,3]]));`, I think we can follow what Duckdb is. Unnest with recursive, and max_depth `select unnest([[1],[1,2],[1,2,3]], recursive:= true);` or `select unnest([[1],[1,2],[1,2,3]], max_depth := 2);` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Improve performance of Parquet statistics conversion [datafusion]
marvinlanhenke commented on PR #10932: URL: https://github.com/apache/datafusion/pull/10932#issuecomment-2171497286 > I think this is a nice code impeovement ❤️ It is indeed. Thanks @Weijun-H really nice refactor. > I am not sure this will improve the performance I'll guess in terms of performance it won't make any difference. However, it is more readable and thats always great, so thanks again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] x NOT IN y works but NOT (x IN y) doesn't [datafusion]
akoshchiy commented on issue #10830: URL: https://github.com/apache/datafusion/issues/10830#issuecomment-2171512487 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: rewriting NOT IN () to anti join [datafusion]
akoshchiy opened a new pull request, #10936: URL: https://github.com/apache/datafusion/pull/10936 ## Which issue does this PR close? Closes #10830 . ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? Yes. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improve performance for grouping by variable length columns (strings) [datafusion]
jayzhan211 commented on issue #9403: URL: https://github.com/apache/datafusion/issues/9403#issuecomment-2171579091 I try to an approach that take multiple columns into consideration but found that the time spend on `insert_accounted` largely increase. ```rust self.map.insert_accounted( new_header, |header| header.hash, &mut self.map_size, ); ``` The hash entry includes vector like ```rust struct Entry where O: OffsetSizeTrait, { /// hash of the value (stored to avoid recomputing it in hash table check) hash: u64, /// if len =< [`SHORT_VALUE_LEN`]: the data inlined /// if len > [`SHORT_VALUE_LEN`], the offset of where the data starts offset_or_inline: Vec, /// length of the value, in bytes (use O here so we use only i32 for /// strings, rather 64 bit usize) len: Vec>, group_id: usize, } ``` It seems that we need avoid adding `Vec` into hash entry. It seems `ArrowBytesMap`'s idea couldn't help much :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [DRAFT] Multiple group by optimization [datafusion]
jayzhan211 opened a new pull request, #10937: URL: https://github.com/apache/datafusion/pull/10937 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]
vaibhawvipul commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171678704 looks nice, please let me know when it is ready for review. Note, can you also run the relevant scala test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add example for writing SQL analysis using DataFusion structures [datafusion]
LorrensP-2158466 opened a new pull request, #10938: URL: https://github.com/apache/datafusion/pull/10938 ## Which issue does this PR close? Closes #10871 . ## Rationale for this change ## What changes are included in this PR? Added an example of an SQL analysis where we count joins in a SQL query. ## Are these changes tested? Ran it on some other queries and the join counts where right. ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171687704 Thanks @jayzhan211 approved it. : ) If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move `DynamicFileCatalog` back to core [datafusion]
goldmedal commented on PR #10745: URL: https://github.com/apache/datafusion/pull/10745#issuecomment-2171692026 Hi @alamb, is there any update on this topic? Or maybe I can help by splitting out some traits into the new crate `datafusion-catalog` first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
jayzhan211 commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171694652 > Thanks @jayzhan211 approved it. : ) If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs. Sure! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
jayzhan211 merged PR #10917: URL: https://github.com/apache/datafusion/pull/10917 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Convert `ApproxPercentileCont` and ``ApproxPercentileContWithWeight` to UDAF [datafusion]
jayzhan211 closed issue #10870: Convert `ApproxPercentileCont` and ``ApproxPercentileContWithWeight` to UDAF URL: https://github.com/apache/datafusion/issues/10870 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF [datafusion]
goldmedal commented on PR #10917: URL: https://github.com/apache/datafusion/pull/10917#issuecomment-2171695849 Thanks again @jayzhan211 :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] use ScalarValue::to_pyarrow to convert to python object [datafusion-python]
andygrove merged PR #731: URL: https://github.com/apache/datafusion-python/pull/731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]
andygrove closed issue #729: Simpify `PyExpr::python_value` by using `ScalarValue::into_py` URL: https://github.com/apache/datafusion-python/issues/729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Simpify `PyExpr::python_value` by using `ScalarValue::into_py` [datafusion-python]
andygrove closed issue #729: Simpify `PyExpr::python_value` by using `ScalarValue::into_py` URL: https://github.com/apache/datafusion-python/issues/729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] CSE shorthand alias [datafusion]
peter-toth commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171754117 > Hey @peter-toth sorry for the late reply. > > Looks good, and I like the Alias generator. But I have a couple of comments: > > 1. Thought we agreed on the `#` prefix. Is there a reason for choosing the `__cse_` prefix? DuckDB uses `#XXX`, SQLServer uses `ExprXXX` (can't verify this atm), Spark uses `_expr#XXX`, all of which look cleaner than `__cse_XXX` > 2. Tiny nit, but can you use `into_values` instead of using `into_iter` then ignoring the key? > > I don't mind either options, a PR against this one keeps the entire review history and can make reviews easier, but a new PR can be easier to maintain as you'll be able to directly push commits. No problem, I've opened https://github.com/apache/datafusion/pull/10939. Yes and actually I'm fine with that as well, but I found a few `AliasGenerator` usecases in the existing code and wanted to follow the conventions of https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/decorrelate_predicate_subquery.rs#L249 and https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/scalar_subquery_to_join.rs#L245 with the `__cse` prefix. I've also changed `into_iter()` to `into_values ()` in https://github.com/apache/datafusion/pull/10939/commits/6b1d1e34abd4c9da4a550db99471c09f5add09ff. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Use shorter aliases in CSE [datafusion]
peter-toth commented on PR #10939: URL: https://github.com/apache/datafusion/pull/10939#issuecomment-2171754406 cc @MohamedAbdeen21, @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] CSE shorthand alias [datafusion]
MohamedAbdeen21 closed pull request #10868: CSE shorthand alias URL: https://github.com/apache/datafusion/pull/10868 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] CSE shorthand alias [datafusion]
MohamedAbdeen21 commented on PR #10868: URL: https://github.com/apache/datafusion/pull/10868#issuecomment-2171757651 Became part of #10939. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Push down filters below `Unnest` in sub queries [datafusion]
duongcongtoai commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171768257 I think this is related https://github.com/apache/datafusion/issues/10660 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: rewriting NOT IN () to anti join [datafusion]
viirya commented on code in PR #10936: URL: https://github.com/apache/datafusion/pull/10936#discussion_r1641921370 ## datafusion/optimizer/src/decorrelate_predicate_subquery.rs: ## @@ -1099,6 +1126,29 @@ mod tests { Ok(()) } +#[test] +fn wrapped_not_in_subquery() -> Result<()> { +let table_scan = test_table_scan()?; +let plan = LogicalPlanBuilder::from(table_scan) +.filter(not(in_subquery(col("c"), test_subquery_with_name("sq")?)))? Review Comment: Could you add another test for `not(not_in_subquery)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Push down filters below `Unnest` in sub queries [datafusion]
ahirner commented on issue #10935: URL: https://github.com/apache/datafusion/issues/10935#issuecomment-2171783792 Thanks for the pointers @duongcongtoai I think the recursive and max_depth options can be a special workaround, but introduce a good amount of complexity. Fixing the push down makes single `unnest` composable in general. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]
dharanad commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2171794774 @jayzhan211 / @alamb can you please help me with a review here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] refactor: dont include fallback in match on mem_pool_type [datafusion]
tshauck opened a new pull request, #10940: URL: https://github.com/apache/datafusion/pull/10940 ## Which issue does this PR close? Closes #. ## Rationale for this change Using the wildcard pattern has resulted in some minor bugs in the past in DataFusion. So this PR makes a minor tweak in the CLI the `args.mem_pool_type` to explicitly handle all the enum variants and `None` case. ## What changes are included in this PR? Just update a match statement. ## Are these changes tested? Compiled the CLI and ran it to check, otherwise tests remain the same. ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] refactor: remove extra default in max rows [datafusion]
tshauck opened a new pull request, #10941: URL: https://github.com/apache/datafusion/pull/10941 ## Which issue does this PR close? Closes #. ## Rationale for this change Currently `-h` in the datafusion CLI shows the default for max rows twice. ## What changes are included in this PR? This PR removes the value from the argument help, so the default is only shown once. ## Are these changes tested? Yes, compiled and looked at the help. ## Are there any user-facing changes? Minor, CLI help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]
jayzhan211 commented on code in PR #10930: URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642021123 ## datafusion/functions-aggregate/src/bit_and_or_xor.rs: ## @@ -0,0 +1,400 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Defines `BitAnd`, `BitOr`, `BitXor` and `BitXor DISTINCT` aggregate accumulators + +use std::any::Any; +use std::collections::HashSet; +use std::fmt::{Display, Formatter}; + +use ahash::RandomState; +use arrow::array::{Array, ArrayRef, AsArray}; +use arrow::datatypes::{ +ArrowNativeType, ArrowNumericType, DataType, Int16Type, Int32Type, Int64Type, +Int8Type, UInt16Type, UInt32Type, UInt64Type, UInt8Type, +}; +use arrow_schema::Field; + +use datafusion_common::cast::as_list_array; +use datafusion_common::{exec_err, not_impl_err, Result, ScalarValue}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::INTEGERS; +use datafusion_expr::utils::format_state_name; +use datafusion_expr::{Accumulator, AggregateUDFImpl, Signature, Volatility}; + +/// `accumulator_helper` is a macro accepting (ArrowPrimitiveType, BitwiseOperationType) +macro_rules! accumulator_helper { +($t:ty, $opr:expr) => { +match $opr { +BitwiseOperationType::And => Ok(Box::>::default()), +BitwiseOperationType::Or => Ok(Box::>::default()), +BitwiseOperationType::Xor => Ok(Box::>::default()), +BitwiseOperationType::XorDistinct => { +Ok(Box::>::default()) +} +} +}; +} + +/// AND, OR and XOR only supports a subset of numeric types +/// +/// `args` is [AccumulatorArgs] +/// `opr` is [BitwiseOperationType] +macro_rules! downcast_bitwise_accumulator { +($args:ident, $opr:expr) => { +match $args.data_type { +DataType::Int8 => accumulator_helper!(Int8Type, $opr), +DataType::Int16 => accumulator_helper!(Int16Type, $opr), +DataType::Int32 => accumulator_helper!(Int32Type, $opr), +DataType::Int64 => accumulator_helper!(Int64Type, $opr), +DataType::UInt8 => accumulator_helper!(UInt8Type, $opr), +DataType::UInt16 => accumulator_helper!(UInt16Type, $opr), +DataType::UInt32 => accumulator_helper!(UInt32Type, $opr), +DataType::UInt64 => accumulator_helper!(UInt64Type, $opr), +_ => { +not_impl_err!( +"{} not supported for {}: {}", +stringify!($opr), +$args.name, +$args.data_type +) +} +} +}; +} + +/// Simplifies the creation of User-Defined Aggregate Functions (UDAFs) for performing bitwise operations in a declarative manner. +/// +/// `EXPR_FN` identifier used to name the generated expression function. +/// `AGGREGATE_UDF_FN` is an identifier used to name the underlying UDAF function. +/// `OPR_TYPE` is an expression that evaluates to the type of bitwise operation to be performed. +macro_rules! make_bitwise_udaf_expr_and_func { +($EXPR_FN:ident, $AGGREGATE_UDF_FN:ident, $OPR_TYPE:expr) => { +make_udaf_expr!($EXPR_FN, expr_y expr_x, concat!("Returns the bitwise", stringify!($OPR_TYPE), "of a group of values"), $AGGREGATE_UDF_FN); +create_func!($EXPR_FN, $AGGREGATE_UDF_FN, BitwiseOperation::new($OPR_TYPE, stringify!($EXPR_FN))); +} +} + +make_bitwise_udaf_expr_and_func!(bit_and, bit_and_udaf, BitwiseOperationType::And); +make_bitwise_udaf_expr_and_func!(bit_or, bit_or_udaf, BitwiseOperationType::Or); +make_bitwise_udaf_expr_and_func!(bit_xor, bit_xor_udaf, BitwiseOperationType::Xor); + +/// The different types of bitwise operations that can be performed. +#[derive(Debug, Clone, Eq, PartialEq)] +enum BitwiseOperationType { +And, +Or, +Xor, +/// `XorDistinct` is a variation of the bitwise XOR operation specifically for the scenario of BitXor DISTINCT +XorDistinct, +} + +impl Display for BitwiseOperationType { +fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { +write!(f, "{:?}", self) +} +} + +/// [BitwiseOperation] struct encapsulates informa
[I] Convert `Avg` to UDAF [datafusion]
jayzhan211 opened a new issue, #10942: URL: https://github.com/apache/datafusion/issues/10942 ### Is your feature request related to a problem or challenge? Similar to others issue in #8708 ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Convert `Min` and `Max` to UDAF [datafusion]
jayzhan211 opened a new issue, #10943: URL: https://github.com/apache/datafusion/issues/10943 ### Is your feature request related to a problem or challenge? Similar to other issues in #8708 I think the change of this might be large, could split it in several PR. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Convert `Hyperloglog` to UDAF [datafusion]
jayzhan211 opened a new issue, #10944: URL: https://github.com/apache/datafusion/issues/10944 ### Is your feature request related to a problem or challenge? Similar to other issues in #8708 ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor: remove extra default in max rows [datafusion]
jonahgao merged PR #10941: URL: https://github.com/apache/datafusion/pull/10941 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: RewriteCycle API for short-circuiting optimizer loops [datafusion]
jayzhan211 commented on code in PR #10386: URL: https://github.com/apache/datafusion/pull/10386#discussion_r1642086691 ## datafusion/optimizer/src/rewrite_cycle.rs: ## @@ -0,0 +1,262 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/// An API for executing a sequence of [TreeNodeRewriter]s in multiple passes. +/// +/// See [RewriteCycle] for more information. +/// +use std::ops::ControlFlow; + +use datafusion_common::{ +tree_node::{Transformed, TreeNode, TreeNodeRewriter}, +Result, +}; + +/// A builder with methods for executing a "rewrite cycle". +/// +/// Often the results of one optimization rule can uncover more optimizations in other optimization +/// rules. A sequence of optimization rules can be ran in multiple "passes" until there are no +/// more optmizations to make. +/// +/// The [RewriteCycle] handles logic for running these multi-pass loops. +/// It applies a sequence of [TreeNodeRewriter]s to a [TreeNode] by calling +/// [TreeNode::rewrite] in a loop - passing the output of one rewrite as the input to the next +/// rewrite - until [RewriteCycle::max_cycles] is reached or until every [TreeNode::rewrite] +/// returns a [Transformed::no] result in a consecutive sequence. +#[derive(Debug)] +pub struct RewriteCycle { +max_cycles: usize, +} + +impl Default for RewriteCycle { +fn default() -> Self { +Self::new() +} +} + +impl RewriteCycle { +/// The default maximum number of completed cycles to run before terminating the rewrite loop. +/// You can override this default with [Self::with_max_cycles] +pub const DEFAULT_MAX_CYCLES: usize = 3; + +/// Creates a new [RewriteCycle] with default options. +pub fn new() -> Self { +Self { +max_cycles: Self::DEFAULT_MAX_CYCLES, +} +} +/// Sets the [Self::max_cycles] to run before terminating the rewrite loop. +pub fn with_max_cycles(mut self, max_cycles: usize) -> Self { +self.max_cycles = max_cycles; +self +} + +/// The maximum number of completed cycles to run before terminating the rewrite loop. +/// Defaults to [Self::DEFAULT_MAX_CYCLES]. +pub fn max_cycles(&self) -> usize { +self.max_cycles +} + +/// Runs a rewrite cycle on the given [TreeNode] using the given callback function to +/// explicitly handle the cycle iterations. +/// +/// The callback function is given a [RewriteCycleState], which manages the short-circuiting +/// logic of the loop. The function is expected to call [RewriteCycleState::rewrite] for each +/// individual [TreeNodeRewriter] in the cycle. [RewriteCycleState::rewrite] returns a [RewriteCycleControlFlow] +/// result, indicating whether the loop should break or continue. +/// +/// ```rust +/// +/// use arrow::datatypes::{Schema, Field, DataType}; +/// use datafusion_expr::{col, lit}; +/// use datafusion_common::{DataFusionError, ToDFSchema}; +/// use datafusion_expr::execution_props::ExecutionProps; +/// use datafusion_expr::simplify::SimplifyContext; +/// use datafusion_optimizer::rewrite_cycle::RewriteCycle; +/// use datafusion_optimizer::simplify_expressions::{Simplifier, ConstEvaluator}; +/// +/// // Create the schema +/// let schema = Schema::new(vec![ +/// Field::new("c1", DataType::Int64, true), +/// Field::new("c2", DataType::UInt32, true), +/// ]).to_dfschema_ref().unwrap(); +/// +/// // Create the rewriters +/// let props = ExecutionProps::new(); +/// let context = SimplifyContext::new(&props) +///.with_schema(schema); +/// let mut simplifier = Simplifier::new(&context); +/// let mut const_evaluator = ConstEvaluator::try_new(&props).unwrap(); +/// +/// // ((c4 < 1 or c3 < 2) and c3 < 3) and false +/// let expr = col("c2") +/// .lt(lit(1)) +/// .or(col("c1").lt(lit(2))) +/// .and(col("c1").lt(lit(3))) +/// .and(lit(false)); +/// +/// // run the rewrite cycle loop +/// let (expr, info) = RewriteCycle::new() +/// .with_max_cycles(4) +/// .each_cycle(expr, |cycle_state| { +/// cycle_state +///
Re: [I] Convert `Min` and `Max` to UDAF [datafusion]
dharanad commented on issue #10943: URL: https://github.com/apache/datafusion/issues/10943#issuecomment-2172176854 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Convert `Hyperloglog` to UDAF [datafusion]
dharanad commented on issue #10944: URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172177142 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]
dharanad commented on code in PR #10930: URL: https://github.com/apache/datafusion/pull/10930#discussion_r1642130995 ## datafusion/functions-aggregate/src/bit_and_or_xor.rs: ## @@ -0,0 +1,400 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Defines `BitAnd`, `BitOr`, `BitXor` and `BitXor DISTINCT` aggregate accumulators + +use std::any::Any; +use std::collections::HashSet; +use std::fmt::{Display, Formatter}; + +use ahash::RandomState; +use arrow::array::{Array, ArrayRef, AsArray}; +use arrow::datatypes::{ +ArrowNativeType, ArrowNumericType, DataType, Int16Type, Int32Type, Int64Type, +Int8Type, UInt16Type, UInt32Type, UInt64Type, UInt8Type, +}; +use arrow_schema::Field; + +use datafusion_common::cast::as_list_array; +use datafusion_common::{exec_err, not_impl_err, Result, ScalarValue}; +use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; +use datafusion_expr::type_coercion::aggregates::INTEGERS; +use datafusion_expr::utils::format_state_name; +use datafusion_expr::{Accumulator, AggregateUDFImpl, Signature, Volatility}; + +/// `accumulator_helper` is a macro accepting (ArrowPrimitiveType, BitwiseOperationType) +macro_rules! accumulator_helper { +($t:ty, $opr:expr) => { +match $opr { +BitwiseOperationType::And => Ok(Box::>::default()), +BitwiseOperationType::Or => Ok(Box::>::default()), +BitwiseOperationType::Xor => Ok(Box::>::default()), +BitwiseOperationType::XorDistinct => { +Ok(Box::>::default()) +} +} +}; +} + +/// AND, OR and XOR only supports a subset of numeric types +/// +/// `args` is [AccumulatorArgs] +/// `opr` is [BitwiseOperationType] +macro_rules! downcast_bitwise_accumulator { +($args:ident, $opr:expr) => { +match $args.data_type { +DataType::Int8 => accumulator_helper!(Int8Type, $opr), +DataType::Int16 => accumulator_helper!(Int16Type, $opr), +DataType::Int32 => accumulator_helper!(Int32Type, $opr), +DataType::Int64 => accumulator_helper!(Int64Type, $opr), +DataType::UInt8 => accumulator_helper!(UInt8Type, $opr), +DataType::UInt16 => accumulator_helper!(UInt16Type, $opr), +DataType::UInt32 => accumulator_helper!(UInt32Type, $opr), +DataType::UInt64 => accumulator_helper!(UInt64Type, $opr), +_ => { +not_impl_err!( +"{} not supported for {}: {}", +stringify!($opr), +$args.name, +$args.data_type +) +} +} +}; +} + +/// Simplifies the creation of User-Defined Aggregate Functions (UDAFs) for performing bitwise operations in a declarative manner. +/// +/// `EXPR_FN` identifier used to name the generated expression function. +/// `AGGREGATE_UDF_FN` is an identifier used to name the underlying UDAF function. +/// `OPR_TYPE` is an expression that evaluates to the type of bitwise operation to be performed. +macro_rules! make_bitwise_udaf_expr_and_func { +($EXPR_FN:ident, $AGGREGATE_UDF_FN:ident, $OPR_TYPE:expr) => { +make_udaf_expr!($EXPR_FN, expr_y expr_x, concat!("Returns the bitwise", stringify!($OPR_TYPE), "of a group of values"), $AGGREGATE_UDF_FN); +create_func!($EXPR_FN, $AGGREGATE_UDF_FN, BitwiseOperation::new($OPR_TYPE, stringify!($EXPR_FN))); +} +} + +make_bitwise_udaf_expr_and_func!(bit_and, bit_and_udaf, BitwiseOperationType::And); +make_bitwise_udaf_expr_and_func!(bit_or, bit_or_udaf, BitwiseOperationType::Or); +make_bitwise_udaf_expr_and_func!(bit_xor, bit_xor_udaf, BitwiseOperationType::Xor); + +/// The different types of bitwise operations that can be performed. +#[derive(Debug, Clone, Eq, PartialEq)] +enum BitwiseOperationType { +And, +Or, +Xor, +/// `XorDistinct` is a variation of the bitwise XOR operation specifically for the scenario of BitXor DISTINCT +XorDistinct, +} + +impl Display for BitwiseOperationType { +fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { +write!(f, "{:?}", self) +} +} + +/// [BitwiseOperation] struct encapsulates informati
Re: [I] Modulus can produce incorrect results in some cases [datafusion-comet]
vaibhawvipul commented on issue #521: URL: https://github.com/apache/datafusion-comet/issues/521#issuecomment-2172247315 working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support dictionary data type in array_to_string [datafusion]
Weijun-H commented on code in PR #10908: URL: https://github.com/apache/datafusion/pull/10908#discussion_r1642217621 ## datafusion/functions-array/src/string.rs: ## @@ -281,6 +281,31 @@ pub(super) fn array_to_string_inner(args: &[ArrayRef]) -> Result { Ok(arg) } +Dictionary(..) => { +let any_dict_array = arr +.as_any_dictionary_opt() +.ok_or_else( || { +DataFusionError::Internal( +format!("could not cast {} to AnyDictionaryArray", arr.data_type()) +) +})?; +macro_rules! array_function { +($ARRAY_TYPE:ident) => { +to_string!( +arg, +any_dict_array.values(), +&delimiter, +&null_string, +with_null_string, +$ARRAY_TYPE +) +}; +} +call_array_function!( +any_dict_array.values().data_type(), +false +) Review Comment: Why not use ```suggestion compute_array_to_string( arg, any_dict_array.values().clone(), delimiter.clone(), null_string.clone(), with_null_string, )?; Ok(arg) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Convert BitAnd, BitOr, BitXor to UDAF [datafusion]
dharanad commented on PR #10930: URL: https://github.com/apache/datafusion/pull/10930#issuecomment-2172374561 cc: @jayzhan211 I have made the changes and also updated the UT. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Convert string agg to udaf [datafusion]
lewiszlw opened a new pull request, #10945: URL: https://github.com/apache/datafusion/pull/10945 ## Which issue does this PR close? part of https://github.com/apache/datafusion/issues/8708. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]
marvinlanhenke opened a new pull request, #10946: URL: https://github.com/apache/datafusion/pull/10946 ## Which issue does this PR close? Closes #10926. ## Rationale for this change - row_group_null_counts, data_page_null_counts - data_page_row_counts return an ArrayRef instead of UInt64Array. ## What changes are included in this PR? - fixed methods to return correct array type - changed data_page_row_counts to fall back on row_page_row_counts if column is missing - added a test case in `arrow_statistics` ## Are these changes tested? Yes. ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix: StatisticsConverter `counts` for missing columns [datafusion]
marvinlanhenke commented on PR #10946: URL: https://github.com/apache/datafusion/pull/10946#issuecomment-2172417984 @alamb PTAL. I'm not sure about the assumptions I made in `data_page_row_counts` [here](https://github.com/apache/datafusion/pull/10946/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R1053-R1057). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Convert `Hyperloglog` to UDAF [datafusion]
dharanad commented on issue #10944: URL: https://github.com/apache/datafusion/issues/10944#issuecomment-2172431668 Hey @jayzhan211 I noticed the `approx_distinct` UDAF implemented in `datafusion/functions-aggregate/src/approx_distinct.rs`. It appears to be using a `HyperLogLog` (HLL) structure defined in `datafusion/functions-aggregate/src/hyperloglog.rs`. What is the expected outcome of this issue. Could you provide some insights into the intended use of this UDAF? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org