[I] Allow fine-tune planner steps [datafusion]

2024-07-30 Thread via GitHub
waynexia opened a new issue, #11715: URL: https://github.com/apache/datafusion/issues/11715 ### Is your feature request related to a problem or challenge? The planning phase generates and optimizes both logical and physical plans. IIRC we only provide one method to "optimize and creat

[PR] doc: Update outdated spark.comet.columnar.shuffle.enabled configuration doc [datafusion-comet]

2024-07-30 Thread via GitHub
wForget opened a new pull request, #738: URL: https://github.com/apache/datafusion-comet/pull/738 ## Which issue does this PR close? Closes #737 ## Rationale for this change ## What changes are included in this PR? Use `spark.comet.exec.shuf

[PR] expose some fields on session state [datafusion]

2024-07-30 Thread via GitHub
waynexia opened a new pull request, #11716: URL: https://github.com/apache/datafusion/pull/11716 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/11715 ## Rationale for this change As stated in #11715, this patch implem

[I] Check saved hash first during probing bucket in hash map [datafusion]

2024-07-30 Thread via GitHub
Rachelint opened a new issue, #11717: URL: https://github.com/apache/datafusion/issues/11717 ### Is your feature request related to a problem or challenge? Now two part aggregate hash table is used in datafusion, and we actually saved the `hashes` of `groups` in the `hash table` part.

[PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint opened a new pull request, #11718: URL: https://github.com/apache/datafusion/pull/11718 ## Which issue does this PR close? Closes #11717 ## Rationale for this change See #11717 ## What changes are included in this PR? See title. ##

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2257725002 The clickbench result ``` ┏━━┳┳━━┳━━━┓ ┃ Query┃ main ┃ check-hash-first ┃Change ┃ ┡━

Re: [PR] Extract `CoalesceBatchesStream` to a struct [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11610: URL: https://github.com/apache/datafusion/pull/11610#issuecomment-2257897586 This PR is now ready for review when someone has a chance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Extract `CoalesceBatchesStream` to a struct [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11610: URL: https://github.com/apache/datafusion/pull/11610#discussion_r1696642951 ## datafusion/physical-plan/src/coalesce_batches.rs: ## @@ -249,84 +270,178 @@ impl CoalesceBatchesStream { let input_batch = self.input.poll_next_unpin

Re: [PR] [Draft; Benchmark shows No change] Reuse hash [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on PR #11708: URL: https://github.com/apache/datafusion/pull/11708#issuecomment-2257951807 https://github.com/user-attachments/assets/2636524b-a827-483c-883a-ed6910facd4e";> -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [I] Update ClickBench benchmarks with DataFusion 40 [datafusion]

2024-07-30 Thread via GitHub
alamb commented on issue #11567: URL: https://github.com/apache/datafusion/issues/11567#issuecomment-2257970511 I expect StringView to help as well as @korowa 's https://github.com/apache/datafusion/pull/11627 In case anyone else is interested, I did an analysis of the various benchm

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696707728 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_rows,

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696720506 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_r

[I] Improve parquet ListingTable speed with parquet metadata (short clickbench queries) [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new issue, #11719: URL: https://github.com/apache/datafusion/issues/11719 ### Is your feature request related to a problem or challenge? I spent some time looking at the ClickBench results with DataFusion 40.0.0 https://github.com/apache/datafusion/issues/11567#i

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696734828 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696741435 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [I] Update ClickBench benchmarks with DataFusion 40 [datafusion]

2024-07-30 Thread via GitHub
alamb commented on issue #11567: URL: https://github.com/apache/datafusion/issues/11567#issuecomment-2258049992 I looked at some short queries and found one potential improvement https://github.com/apache/datafusion/issues/11719 I also looked at Q38 ```sql SELECT "URL", COUNT(*)

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696734828 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [I] Update ClickBench benchmarks with DataFusion 40 [datafusion]

2024-07-30 Thread via GitHub
alamb commented on issue #11567: URL: https://github.com/apache/datafusion/issues/11567#issuecomment-2258053998 I am pretty sure Q18 would be helped with https://github.com/apache/datafusion/issues/9403 -- maybe we'll find a way to do that shortly ```sql SELECT "UserID", extract(m

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696758426 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696758426 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696734828 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] Initial changes to support using udaf min/max for statistics and opti… [datafusion]

2024-07-30 Thread via GitHub
edmondop commented on code in PR #11696: URL: https://github.com/apache/datafusion/pull/11696#discussion_r1696769348 ## datafusion/expr/src/udaf.rs: ## @@ -249,6 +249,14 @@ impl AggregateUDF { pub fn simplify(&self) -> Option { self.inner.simplify() } + +/

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696770853 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] rfc: optional skipping partial aggregation [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11627: URL: https://github.com/apache/datafusion/pull/11627#issuecomment-225809 This seems to repeatably help TPCH sf1 queries ``` Benchmark tpch_sf1.json ┏━━┳━━━┳━━┳━━━┓

[I] `APPROX_PERCENTILE_CONT` doesn't respect nulls and gives incorrect answer [datafusion]

2024-07-30 Thread via GitHub
Dandandan opened a new issue, #11720: URL: https://github.com/apache/datafusion/issues/11720 ### Describe the bug ```sql select APPROX_PERCENTILE_CONT(v, 0.5) FROM (VALUES (1), (2), (3), (NULL), (NULL), (NULL)) as t (v); +--+ | APPROX_P

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258092675 I am running benchmarks against this PR now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696758426 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

[PR] Respect nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan opened a new pull request, #11721: URL: https://github.com/apache/datafusion/pull/11721 ## Which issue does this PR close? Closes #11720 ## Rationale for this change ## What changes are included in this PR? ## Are these changes tes

Re: [PR] Initial changes to support using udaf min/max for statistics and opti… [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on code in PR #11696: URL: https://github.com/apache/datafusion/pull/11696#discussion_r1696791166 ## datafusion/expr/src/udaf.rs: ## @@ -249,6 +249,14 @@ impl AggregateUDF { pub fn simplify(&self) -> Option { self.inner.simplify() } + +

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696791284 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_rows,

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696836354 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

Re: [PR] Reduce repetition in try_process_group_by_unnest and try_process_unnest [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11714: URL: https://github.com/apache/datafusion/pull/11714#discussion_r1696852266 ## datafusion/sql/src/utils.rs: ## @@ -287,6 +287,28 @@ pub(crate) fn value_to_string(value: &Value) -> Option { } } +pub(crate) fn transform_bottom_unnests

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1696853665 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1696853918 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if args

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1696853665 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258204358 Interestingly my benchmarks against this PR showed more mixed results: ``` Benchmark clickbench_1.json ┏━━┳━━

[PR] Use `cargo release` in benchmarks [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new pull request, #11722: URL: https://github.com/apache/datafusion/pull/11722 ## Which issue does this PR close? N/A ## Rationale for this change As we get into serious benchmarking / optimization mode, ensuring the benchmark results are as close to what

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258215227 > Interestingly my benchmarks against this PR showed more mixed results: > > ``` > > Benchmark clickbench_1.json > > ┏━━━

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696872197 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_r

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258228500 > @alamb Possible due to my main branch is not the latest? I am running it again in local after pulling. Thanks @Rachelint -- I am also trying again to see if I did something wr

Re: [I] CoalesceBatchesStream poll_next_inner function bug [datafusion]

2024-07-30 Thread via GitHub
alamb closed issue #1879: CoalesceBatchesStream poll_next_inner function bug URL: https://github.com/apache/datafusion/issues/1879 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] CoalesceBatchesStream poll_next_inner function bug [datafusion]

2024-07-30 Thread via GitHub
alamb commented on issue #1879: URL: https://github.com/apache/datafusion/issues/1879#issuecomment-2258230649 Closing this one -- please feel free to reopen it @JasonLi-cn if you think it is still an issue -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258232534 > > @alamb Possible due to my main branch is not the latest? I am running it again in local after pulling. > > Thanks @Rachelint -- I am also trying again to see if I did som

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258239919 > > @alamb Possible due to my main branch is not the latest? I am running it again in local after pulling. > > Thanks @Rachelint -- I am also trying again to see if I did som

Re: [PR] Initial changes to support using udaf min/max for statistics and opti… [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 merged PR #11696: URL: https://github.com/apache/datafusion/pull/11696 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Initial changes to support using udaf min/max for statistics and opti… [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on PR #11696: URL: https://github.com/apache/datafusion/pull/11696#issuecomment-2258238514 Thanks @edmondop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comme

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258245095 I can run the benchmark too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[PR] Use `StringView` by default when reading from parquet [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new pull request, #11723: URL: https://github.com/apache/datafusion/pull/11723 Draft until: - [ ] get tests passing - [ ] Run benchmarks and fix any regressions - [ ] Close https://github.com/apache/datafusion/issues/10918 and file remaining follow on work for StringV

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258274014 `release-nonlto` mode ```rust ┏━━┳┳━━┳━━━┓ ┃ Query┃ main ┃ check-hash-first ┃

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258287047 @jayzhan211 interesting... q33 1.27x faster... Maybe it is really related to hardware... -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
2010YOUY01 commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696934383 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_r

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
jayzhan211 commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258315701 `release mode` ```rust Benchmark clickbench_1.json ┏━━┳┳━━┳━━━┓ ┃

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258338448 @jayzhan211 Maybe hash table size should be taken into consider, too. If few collision exist, this pr added an extract u64 comparison actually :think: . -- This is an automate

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on code in PR #11718: URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696979055 ## datafusion/physical-plan/src/aggregates/group_values/row.rs: ## @@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows { batch_hashes.resize(n_ro

[PR] Enhance the formatting for Column [datafusion]

2024-07-30 Thread via GitHub
goldmedal opened a new pull request, #11724: URL: https://github.com/apache/datafusion/pull/11724 ## Which issue does this PR close? No corresponding issue. ## Rationale for this change Consider the following case: ```rust let plan = LogicalPlanBuilder::table_sc

Re: [PR] Custom planning behavior for selecting wildcard expression [datafusion]

2024-07-30 Thread via GitHub
goldmedal closed pull request #11673: Custom planning behavior for selecting wildcard expression URL: https://github.com/apache/datafusion/pull/11673 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[PR] chore: Bump DataFusion to rev 35c2e7e [datafusion-comet]

2024-07-30 Thread via GitHub
andygrove opened a new pull request, #740: URL: https://github.com/apache/datafusion-comet/pull/740 ## Which issue does this PR close? N/A ## Rationale for this change Upgrade to latest DF so that we can catch any regressions before the next release.

Re: [PR] Ease restrictions on cross-timezone comparison [datafusion]

2024-07-30 Thread via GitHub
jeffreyssmith2nd commented on code in PR #11711: URL: https://github.com/apache/datafusion/pull/11711#discussion_r1697161978 ## datafusion/expr/src/type_coercion/binary.rs: ## @@ -1725,6 +1774,27 @@ mod tests { DataType::LargeBinary ); +// Timesta

Re: [PR] Reduce repetition in try_process_group_by_unnest and try_process_unnest [datafusion]

2024-07-30 Thread via GitHub
JasonLi-cn commented on code in PR #11714: URL: https://github.com/apache/datafusion/pull/11714#discussion_r1697166685 ## datafusion/sql/src/utils.rs: ## @@ -287,6 +287,28 @@ pub(crate) fn value_to_string(value: &Value) -> Option { } } +pub(crate) fn transform_bottom_un

Re: [PR] Reduce repetition in try_process_group_by_unnest and try_process_unnest [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11714: URL: https://github.com/apache/datafusion/pull/11714#discussion_r1697170500 ## datafusion/sql/src/utils.rs: ## @@ -287,6 +287,28 @@ pub(crate) fn value_to_string(value: &Value) -> Option { } } +pub(crate) fn transform_bottom_unnests

Re: [I] Generate GroupByHash output in multiple `RecordBatch`es rather than one large one [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on issue #9562: URL: https://github.com/apache/datafusion/issues/9562#issuecomment-2258640507 Actually, I found the `slice` function is not trivial because the eager computation for the null count. One way I could think to solve it is make the computation lazy, see arr

Re: [I] Generate GroupByHash output in multiple `RecordBatch`es rather than one large one [datafusion]

2024-07-30 Thread via GitHub
Rachelint commented on issue #9562: URL: https://github.com/apache/datafusion/issues/9562#issuecomment-2258651637 Maybe I can try it and compare the effect with `lazy null count computation`. If @guojidan has not worked for this. -- This is an automated message from the Apache Git Servi

Re: [PR] Ease restrictions on cross-timezone comparison [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11711: URL: https://github.com/apache/datafusion/pull/11711#discussion_r1697213926 ## datafusion/expr/src/type_coercion/binary.rs: ## @@ -1725,6 +1774,27 @@ mod tests { DataType::LargeBinary ); +// Timestamps +

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1697215919 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if args

Re: [I] Read Iceberg table is not support [datafusion-comet]

2024-07-30 Thread via GitHub
huaxingao commented on issue #694: URL: https://github.com/apache/datafusion-comet/issues/694#issuecomment-2258704477 @dpengpeng We still need to have a Comet binary release for Iceberg to use Comet. The iceberg integration PR has not been merged yet. I will keep you updated on this. --

Re: [PR] feat: Add GetStructField expression [datafusion-comet]

2024-07-30 Thread via GitHub
andygrove commented on code in PR #731: URL: https://github.com/apache/datafusion-comet/pull/731#discussion_r1697229086 ## native/spark-expr/src/structs.rs: ## @@ -125,3 +125,103 @@ impl PartialEq for CreateNamedStruct { .unwrap_or(false) } } + +#[derive(Debug

Re: [PR] Provide actionable error messaging due to resource exhaustion. [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11665: URL: https://github.com/apache/datafusion/pull/11665#discussion_r1697242963 ## datafusion/execution/src/memory_pool/pool.rs: ## @@ -240,6 +249,149 @@ fn insufficient_capacity_err( resources_datafusion_err!("Failed to allocate additional

[PR] WIP: Add PyExpr to_variant conversions [datafusion-python]

2024-07-30 Thread via GitHub
Michael-J-Ward opened a new pull request, #793: URL: https://github.com/apache/datafusion-python/pull/793 # Which issue does this PR close? Closes #781. # Rationale for this change The user bug report filed > When a SQL query contains a InList Expr, I can't get the In

[PR] fix: Remove castting on decimals with a small precision to decimal256 [datafusion-comet]

2024-07-30 Thread via GitHub
kazuyukitanimura opened a new pull request, #741: URL: https://github.com/apache/datafusion-comet/pull/741 ## Which issue does this PR close? Part of #670 ## Rationale for this change This PR improves the native execution performance on decimals with a small precision

Re: [PR] fix: Remove castting on decimals with a small precision to decimal256 [datafusion-comet]

2024-07-30 Thread via GitHub
kazuyukitanimura commented on PR #741: URL: https://github.com/apache/datafusion-comet/pull/741#issuecomment-2258776596 ``` OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Mac OS X 14.5 Apple M1 Max TPCDS Micro Benchmarks: Best Time(ms) Avg Time(ms) Stdev(ms)Rate

Re: [PR] fix: Remove `skip.surefire.tests` mvn property [datafusion-comet]

2024-07-30 Thread via GitHub
viirya commented on PR #739: URL: https://github.com/apache/datafusion-comet/pull/739#issuecomment-2258781288 cc @edmondop -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] fix: skip negative scale checks for creating decimals [datafusion-comet]

2024-07-30 Thread via GitHub
parthchandra commented on code in PR #723: URL: https://github.com/apache/datafusion-comet/pull/723#discussion_r1697308894 ## common/src/main/java/org/apache/comet/vector/CometVector.java: ## @@ -73,31 +86,35 @@ public boolean isFixedLength() { @Override public Decimal get

Re: [PR] fix: optimize isNullAt [datafusion-comet]

2024-07-30 Thread via GitHub
parthchandra commented on code in PR #732: URL: https://github.com/apache/datafusion-comet/pull/732#discussion_r1697327613 ## common/src/main/java/org/apache/comet/vector/CometPlainVector.java: ## @@ -153,11 +153,7 @@ public CDataDictionaryProvider getDictionaryProvider() {

Re: [PR] fix: optimize isNullAt [datafusion-comet]

2024-07-30 Thread via GitHub
kazuyukitanimura commented on PR #732: URL: https://github.com/apache/datafusion-comet/pull/732#issuecomment-2258883057 cc @andygrove @comphead @huaxingao @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] feat: Support count AggregateUDF for window function [datafusion-comet]

2024-07-30 Thread via GitHub
huaxingao commented on PR #736: URL: https://github.com/apache/datafusion-comet/pull/736#issuecomment-2258890024 cc @andygrove @kazuyukitanimura @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] chore: Bump DataFusion to rev 35c2e7e [datafusion-comet]

2024-07-30 Thread via GitHub
andygrove merged PR #740: URL: https://github.com/apache/datafusion-comet/pull/740 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

Re: [PR] Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11666: URL: https://github.com/apache/datafusion/pull/11666#issuecomment-2258901715 > Sure, but it is better to remove `input_types` before next release @lewiszlw Filed https://github.com/apache/datafusion/issues/11725 to track the cleanup Thanks agin -

Re: [PR] fix: skip negative scale checks for creating decimals [datafusion-comet]

2024-07-30 Thread via GitHub
kazuyukitanimura commented on code in PR #723: URL: https://github.com/apache/datafusion-comet/pull/723#discussion_r1697361683 ## common/src/main/java/org/apache/comet/vector/CometVector.java: ## @@ -73,31 +86,35 @@ public boolean isFixedLength() { @Override public Decimal

Re: [PR] Rename `input_type` --> `input_types` on AggregateFunctionExpr / AccumulatorArgs / StateFieldsArgs [datafusion]

2024-07-30 Thread via GitHub
alamb merged PR #11666: URL: https://github.com/apache/datafusion/pull/11666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Optimize `struct` and `named_struct` functions [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11688: URL: https://github.com/apache/datafusion/pull/11688#discussion_r1697363091 ## datafusion/functions/src/core/named_struct.rs: ## @@ -165,3 +162,73 @@ impl ScalarUDFImpl for NamedStructFunc { named_struct_expr(args) } } + +#[cf

Re: [PR] Optimize `struct` and `named_struct` functions [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11688: URL: https://github.com/apache/datafusion/pull/11688#issuecomment-2258904316 Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look -- This is an automated message from the A

Re: [PR] Change `FileSinkConfig.object_store_url` from `ObjectStoreUrl` to `Url` [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11705: URL: https://github.com/apache/datafusion/pull/11705#issuecomment-2258904995 Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look -- This is an automated message from the A

Re: [I] Expression Simplifier doesn't consider associativity (`(i + 1) + 2)` is not simplified to `i + 3`) [datafusion]

2024-07-30 Thread via GitHub
tinfoil-knight commented on issue #11594: URL: https://github.com/apache/datafusion/issues/11594#issuecomment-2258914588 DuckDB v1.0.0 (https://shell.duckdb.org/) behaves the same as Apache Datafusion. Doesn't consider associativity in expressions with a column reference. ``` duck

Re: [PR] fix: Remove `skip.surefire.tests` mvn property [datafusion-comet]

2024-07-30 Thread via GitHub
edmondop commented on PR #739: URL: https://github.com/apache/datafusion-comet/pull/739#issuecomment-2258921658 I am not sure what's the problem, can you see the original comment from @andygrove on the merged Pr? -- This is an automated message from the Apache Git Service. To respond to t

Re: [PR] fix: optimize isNullAt [datafusion-comet]

2024-07-30 Thread via GitHub
viirya commented on code in PR #732: URL: https://github.com/apache/datafusion-comet/pull/732#discussion_r1697381153 ## common/src/main/java/org/apache/comet/vector/CometPlainVector.java: ## @@ -153,11 +153,7 @@ public CDataDictionaryProvider getDictionaryProvider() { @Over

Re: [PR] Fix #11692: Improve doc comments within macros [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11694: URL: https://github.com/apache/datafusion/pull/11694#issuecomment-2258929222 I played around with this PR a bit (as I think docs are quite important) -- thanks @Rafferty97 I found I could fix the link in the docs like this: ```rust #[doc =

[I] `regr_count` should return an int [datafusion]

2024-07-30 Thread via GitHub
Michael-J-Ward opened a new issue, #11726: URL: https://github.com/apache/datafusion/issues/11726 ### Describe the bug `regr_count` currently returns a `Float64`. ### To Reproduce ``` DataFusion CLI v40.0.0 > select regr_count(1, 1); +

[PR] Minor: Add example for `ScalarUDF::call` [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new pull request, #11727: URL: https://github.com/apache/datafusion/pull/11727 ## Which issue does this PR close? N/A ## Rationale for this change While reviewing https://github.com/apache/datafusion/pull/11694 I found that this function could be better docume

Re: [PR] fix: optimize isNullAt [datafusion-comet]

2024-07-30 Thread via GitHub
andygrove merged PR #732: URL: https://github.com/apache/datafusion-comet/pull/732 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@da

[I] doc warnings about private items [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new issue, #11728: URL: https://github.com/apache/datafusion/issues/11728 ### Describe the bug When I run `cargo doc` locally I get a bunch of warnings ### To Reproduce ```shell cargo doc --document-private-items --no-deps --workspace ``` War

[PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-07-30 Thread via GitHub
alamb opened a new pull request, #11729: URL: https://github.com/apache/datafusion/pull/11729 ## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/11728 ## Rationale for this change Warnings in `cargo doc` sometimes mask real bugs as well

Re: [PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11729: URL: https://github.com/apache/datafusion/pull/11729#discussion_r1697415298 ## .github/workflows/rust.yml: ## @@ -234,7 +234,7 @@ jobs: rust-version: stable - name: Run cargo doc run: | - export RUSTDOCFLA

Re: [PR] Fix documentation warnings, make CsvExecBuilder and Unparsed pub [datafusion]

2024-07-30 Thread via GitHub
alamb commented on code in PR #11729: URL: https://github.com/apache/datafusion/pull/11729#discussion_r1697416102 ## datafusion/core/src/datasource/physical_plan/parquet/mod.rs: ## @@ -191,9 +191,9 @@ pub use writer::plan_to_parquet; /// # Execution Overview /// /// * Step 1:

Re: [PR] Check hashes first during probing the aggr hash table [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11718: URL: https://github.com/apache/datafusion/pull/11718#issuecomment-2258989626 Here is another run showing improvement: Details ``` Comparing main_base and check-hash-first Benchmark clickbench_1.json -

Re: [I] Blog post with DataFusion July - Sep 2024 [datafusion]

2024-07-30 Thread via GitHub
Blizzara commented on issue #11631: URL: https://github.com/apache/datafusion/issues/11631#issuecomment-2258994326 @alamb for Substrait - maybe the work @Lordworms has been doing to add the TPC-H tests would be good at least? From my side, I don't know if there's any precise milestone as su

Re: [PR] Use upstream `StatisticsConverter` from arrow-rs in DataFusion [datafusion]

2024-07-30 Thread via GitHub
alamb commented on PR #11479: URL: https://github.com/apache/datafusion/pull/11479#issuecomment-2259011724 @Ted-Jiang I wonder if you have a moment to review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[PR] fix: regr_count now returns Uint64 [datafusion]

2024-07-30 Thread via GitHub
Michael-J-Ward opened a new pull request, #11731: URL: https://github.com/apache/datafusion/pull/11731 ## Which issue does this PR close? Closes #11726. ## Rationale for this change Consistent with `count` expectations and postgres, `regr_count` aggregate functio

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1697459557 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1697460058 ## datafusion/functions-aggregate/src/approx_percentile_cont.rs: ## @@ -104,6 +105,12 @@ impl ApproxPercentileCont { None }; +if

Re: [PR] Handle nulls in approx_percentile_cont [datafusion]

2024-07-30 Thread via GitHub
Dandandan commented on code in PR #11721: URL: https://github.com/apache/datafusion/pull/11721#discussion_r1697462050 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -1237,6 +1237,12 @@ SELECT (ABS(1 - CAST(approx_percentile_cont(c11, 0.9) AS DOUBLE) / 0.834) < 0.05

  1   2   >