Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-15 Thread via GitHub
berkaysynnada commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2808569749 > @berkaysynnada any luck? I can take a look tomorrow but don't want to duplicate effort I was interrupted, sorry. I'll send my part in a few hours, and ping you --

Re: [I] [DISCUSS] Add open table format support. [datafusion-ballista]

2025-04-15 Thread via GitHub
milenkovicm commented on issue #1241: URL: https://github.com/apache/datafusion-ballista/issues/1241#issuecomment-2808534387 I agree that adding support for popular table formats would be a valuable addition to Ballista, but I don't believe Ballista is the main hurdle in achieving that sup

[I] Maybe session memory leak [datafusion-ballista]

2025-04-15 Thread via GitHub
mmooyyii opened a new issue, #1242: URL: https://github.com/apache/datafusion-ballista/issues/1242 **Describe the bug** add a print at InMemoryJobState.create_session ```rust ballista/scheduler/src/cluster/memory.rs line 427 async fn create_session( &self,

[PR] Update version to 47.0.0, add CHANGELOG [datafusion]

2025-04-15 Thread via GitHub
xudong963 opened a new pull request, #15731: URL: https://github.com/apache/datafusion/pull/15731 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/15072 ## Rationale for this change ## What changes are included in th

Re: [PR] Add support for `PRINT` statement for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
iffyio commented on code in PR #1811: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1811#discussion_r2046111411 ## tests/sqlparser_mssql.rs: ## @@ -2053,3 +2053,37 @@ fn parse_drop_trigger() { } ); } + +#[test] +fn parse_print() { +let print_str

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
iffyio commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2046107175 ## src/parser/mod.rs: ## @@ -5135,6 +5146,63 @@ impl<'a> Parser<'a> { })) } +/// Parse `CREATE FUNCTION` for [SQL Server] +/// +

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
iffyio commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2046079970 ## src/ast/spans.rs: ## @@ -777,11 +778,9 @@ impl Spanned for ConditionalStatements { ConditionalStatements::Sequence { statements } => {

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-04-15 Thread via GitHub
2010YOUY01 commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2046041614 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -431,12 +422,16 @@ impl ExternalSorter { let batches_to_spill = std::mem::take(globally_sorted_bat

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
2010YOUY01 commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2808240189 > > Benchmark results: (I think there is no significant regression for an extra round of re-spill, if it's running on a machine with fast SSDs) > > It seems to me that there

[PR] Minor: remove unused logic for limit pushdown [datafusion]

2025-04-15 Thread via GitHub
zhuqi-lucas opened a new pull request, #15730: URL: https://github.com/apache/datafusion/pull/15730 ## Which issue does this PR close? This PR removed the unused logic, why it's unused? Because now we have ## Rationale for this change Because n

[I] [DISCUSS] Add open table format support. [datafusion-ballista]

2025-04-15 Thread via GitHub
liurenjie1024 opened a new issue, #1241: URL: https://github.com/apache/datafusion-ballista/issues/1241 Currently popular open table formats such as iceberg, deltalake all have rust implementation, should we consider adding native support for them in ballista so that users could access thes

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
kosiew commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045951368 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,164 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +Retu

Re: [PR] fix: miss output ordering during projection [datafusion]

2025-04-15 Thread via GitHub
xudong963 commented on PR #15683: URL: https://github.com/apache/datafusion/pull/15683#issuecomment-2808130612 Also, I noticed the `output_ordering` here: https://github.com/apache/datafusion/blob/main/datafusion/datasource/src/file_scan_config.rs#L807 isn't checked. Maybe we can ext

Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-15 Thread via GitHub
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2808117814 > 🤖: Benchmark completed > > Details > > ``` > Comparing HEAD and concat_batches_for_sort > > Benchmark clickbench_1.json > -

Re: [PR] Perf: Support automatically concat_batches for sort which will improve performance [datafusion]

2025-04-15 Thread via GitHub
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2808114536 > > I think the most efficient way would be to sort the indices to the arrays in one step followed by interleave, without either concat or sort followed by merge which would bene

Re: [D] DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? [datafusion]

2025-04-15 Thread via GitHub
GitHub user phillipleblanc added a comment to the discussion: DISCUSSION: Anyone around for the Databricks Data & AI Summit in San Francisco June 9–12? That time works for me, but ideally we could meet somewhere closer to the venue (Moscone Center). Perhaps we could meet in the conference lobb

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
acking-you commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045899326 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,199 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +

[PR] Aovid computing unnecessary statstics [datafusion]

2025-04-15 Thread via GitHub
xudong963 opened a new pull request, #15729: URL: https://github.com/apache/datafusion/pull/15729 ## Which issue does this PR close? - Follow up https://github.com/apache/datafusion/pull/15671 ## Rationale for this change ## What changes are includ

Re: [I] When `datafusion.execution.parquet.coerce_int96` is set, timestamp type is still reported as Timestamp(nanoseconds) [datafusion]

2025-04-15 Thread via GitHub
andygrove commented on issue #15721: URL: https://github.com/apache/datafusion/issues/15721#issuecomment-2808043188 Thanks @alamb. @mbutrovich fyi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] feat: add `with_group_indices_order_mode` function for `GroupsAccumulator` to help create specialized impl [datafusion]

2025-04-15 Thread via GitHub
jayzhan211 commented on code in PR #15022: URL: https://github.com/apache/datafusion/pull/15022#discussion_r2045819769 ## datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs: ## @@ -163,6 +177,50 @@ impl GroupsAccumulatorAdapter { /// invokes f(accum

Re: [PR] Refactor regexp slt tests [datafusion]

2025-04-15 Thread via GitHub
comphead commented on PR #15709: URL: https://github.com/apache/datafusion/pull/15709#issuecomment-2807835159 > @comphead my understanding is that when a test file includes another file using the `include` directive (like `include ./init_data.slt.part`), the sqllogictest runner would pr

Re: [PR] chore: correct name of pipelines for native_datafusion ci workflow [datafusion-comet]

2025-04-15 Thread via GitHub
codecov-commenter commented on PR #1653: URL: https://github.com/apache/datafusion-comet/pull/1653#issuecomment-2807808240 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1653?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: transfer Apache Spark runtime conf to native engine [datafusion-comet]

2025-04-15 Thread via GitHub
comphead commented on PR #1649: URL: https://github.com/apache/datafusion-comet/pull/1649#issuecomment-2807790879 Some part were rolled back in #1101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]

2025-04-15 Thread via GitHub
changsun20 commented on PR #15696: URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2807774575 > Thanks @changsun20 if I understood correctly #14434 is for emitting events for the users, the same way it is done for Errors, but without halting the query. Thank you for

[PR] chore: correct name of pipelines for native_datafusion ci workflow [datafusion-comet]

2025-04-15 Thread via GitHub
parthchandra opened a new pull request, #1653: URL: https://github.com/apache/datafusion-comet/pull/1653 The name of the pipelines in the native_datafusion ci workflow was confusingly called native_iceberg_compat -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
alamb commented on code in PR #15694: URL: https://github.com/apache/datafusion/pull/15694#discussion_r2045649141 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -811,58 +822,199 @@ impl BinaryExpr { } } +enum ShortCircuitStrategy<'a> { +None, +Retur

Re: [PR] Add support for parenthesized subquery as `IN` predicate [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
adamchainz commented on PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#issuecomment-2807691385 ![image](https://github.com/user-attachments/assets/826c335e-53c1-462f-9fc1-232fa4606994) choo choo! -- This is an automated message from the Apache Git Servi

Re: [PR] Attach Diagnostic to syntax errors [datafusion]

2025-04-15 Thread via GitHub
alamb commented on code in PR #15680: URL: https://github.com/apache/datafusion/pull/15680#discussion_r2045643084 ## datafusion/sql/src/parser.rs: ## @@ -561,17 +577,13 @@ impl<'a> DFParser<'a> { if token == Token::EOF || token == Token::SemiColon {

Re: [PR] Add support for parenthesized subquery as `IN` predicate [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
alamb commented on PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#issuecomment-2807686630 The code train keeps on moving in this repo. very impressive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] Add support for `GO` batch delimiter in SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1809: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1809#discussion_r2045640591 ## src/parser/mod.rs: ## @@ -15058,6 +15069,57 @@ impl<'a> Parser<'a> { } } +/// Parse [Statement::Go] +fn parse_go(&mut self

Re: [PR] Refactor regexp slt tests [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15709: URL: https://github.com/apache/datafusion/pull/15709#issuecomment-2807670836 > Seems like most recent pipeline failure is probably related to #15725. Waiting for this to merge. Merged! -- This is an automated message from the Apache Git Service. To res

Re: [PR] Attach Diagnostic to syntax errors [datafusion]

2025-04-15 Thread via GitHub
logan-keede commented on code in PR #15680: URL: https://github.com/apache/datafusion/pull/15680#discussion_r2045629257 ## datafusion/sql/src/parser.rs: ## @@ -561,17 +577,13 @@ impl<'a> DFParser<'a> { if token == Token::EOF || token == Token::SemiColon {

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2045620030 ## datafusion/common/src/config.rs: ## @@ -337,6 +337,13 @@ config_namespace! { /// batches and merged. pub sort_in_place_threshold_bytes: usize

Re: [PR] Enable setting default values for target_partitions and planning_concurrency [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15712: URL: https://github.com/apache/datafusion/pull/15712#issuecomment-2807645196 Merged up from main to try and get a clean CI rnu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Attach Diagnostic to syntax errors [datafusion]

2025-04-15 Thread via GitHub
alamb commented on code in PR #15680: URL: https://github.com/apache/datafusion/pull/15680#discussion_r2045605667 ## datafusion/sql/src/parser.rs: ## @@ -356,9 +355,12 @@ impl<'a> DFParserBuilder<'a> { self } -pub fn build(self) -> Result, ParserError> { +

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-04-15 Thread via GitHub
rluvaton commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2045582560 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -431,12 +422,16 @@ impl ExternalSorter { let batches_to_spill = std::mem::take(globally_sorted_batch

Re: [PR] Attach Diagnostic to syntax errors [datafusion]

2025-04-15 Thread via GitHub
logan-keede commented on code in PR #15680: URL: https://github.com/apache/datafusion/pull/15680#discussion_r2045577390 ## datafusion/sql/tests/cases/diagnostic.rs: ## @@ -20,16 +20,18 @@ use insta::assert_snapshot; use std::{collections::HashMap, sync::Arc}; use datafusion_

Re: [PR] Attach Diagnostic to syntax errors [datafusion]

2025-04-15 Thread via GitHub
logan-keede commented on code in PR #15680: URL: https://github.com/apache/datafusion/pull/15680#discussion_r2045577390 ## datafusion/sql/tests/cases/diagnostic.rs: ## @@ -20,16 +20,18 @@ use insta::assert_snapshot; use std::{collections::HashMap, sync::Arc}; use datafusion_

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
rluvaton commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2807586373 tested my fuzz tests with this pr and al of them are failing currently -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2045572622 ## src/ast/mod.rs: ## @@ -4050,6 +4051,13 @@ pub enum Statement { arguments: Vec, options: Vec, }, +/// Return (Mssql) +

Re: [PR] Add support for parenthesized subquery as `IN` predicate [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
adamchainz commented on PR #1793: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1793#issuecomment-2807571169 Yay, thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[PR] Return field instead of datatype for `return_type_from_args` [datafusion]

2025-04-15 Thread via GitHub
timsaucer opened a new pull request, #15728: URL: https://github.com/apache/datafusion/pull/15728 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes teste

Re: [I] We do not respect ignoreNulls in first_value / last_value aggregates [datafusion-comet]

2025-04-15 Thread via GitHub
andygrove commented on issue #1630: URL: https://github.com/apache/datafusion-comet/issues/1630#issuecomment-2807519498 > I will take a look at this one. Thanks @anuragmantri. There is a closely related issue https://github.com/apache/datafusion-comet/issues/1646 that you may want to

Re: [I] CI is failing on main: error: the lock file /__w/datafusion/datafusion/Cargo.lock needs to be updated [datafusion]

2025-04-15 Thread via GitHub
alamb closed issue #15724: CI is failing on main: error: the lock file /__w/datafusion/datafusion/Cargo.lock needs to be updated URL: https://github.com/apache/datafusion/issues/15724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Update checked in Cargo.lock file to get clean CI [datafusion]

2025-04-15 Thread via GitHub
alamb merged PR #15725: URL: https://github.com/apache/datafusion/pull/15725 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Update checked in Cargo.lock file to get clean CI [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15725: URL: https://github.com/apache/datafusion/pull/15725#issuecomment-2807462466 Thanks @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] We do not respect ignoreNulls in first_value / last_value aggregates [datafusion-comet]

2025-04-15 Thread via GitHub
anuragmantri commented on issue #1630: URL: https://github.com/apache/datafusion-comet/issues/1630#issuecomment-2807456576 I will take a look at this one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2045450414 ## src/parser/mod.rs: ## @@ -5135,6 +5142,69 @@ impl<'a> Parser<'a> { })) } +/// Parse `CREATE FUNCTION` for [MsSql] +/// +

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2045437472 ## src/parser/mod.rs: ## @@ -5135,6 +5142,69 @@ impl<'a> Parser<'a> { })) } +/// Parse `CREATE FUNCTION` for [MsSql] +/// +

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2045435742 ## src/ast/mod.rs: ## @@ -8368,6 +8387,22 @@ pub enum CreateFunctionBody { /// /// [BigQuery]: https://cloud.google.com/bigquery/docs/refe

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
rluvaton commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2807435301 > > Thank you, can you please take the fuzz test that I created in my pr and add it to yours, making sure it will pass (it will require you updating row_hash.rs file > > @rlu

[PR] test: add fuzz test for doing aggregation with larger than memory groups [datafusion]

2025-04-15 Thread via GitHub
rluvaton opened a new pull request, #15727: URL: https://github.com/apache/datafusion/pull/15727 ## Which issue does this PR close? N/A ## Rationale for this change Adding failing tests to show the current problem with aggregation when there is a need for spilling and we

Re: [PR] Refactor regexp slt tests [datafusion]

2025-04-15 Thread via GitHub
kumarlokesh commented on PR #15709: URL: https://github.com/apache/datafusion/pull/15709#issuecomment-2807425699 Seems like most recent pipeline failure is probably related to https://github.com/apache/datafusion/pull/15725. Waiting for this to merge. -- This is an automated message from

Re: [PR] Refactor regexp slt tests [datafusion]

2025-04-15 Thread via GitHub
kumarlokesh commented on PR #15709: URL: https://github.com/apache/datafusion/pull/15709#issuecomment-2807420910 > thanks @kumarlokesh I like it. However my understanding was sqllogictest runner starts all files in parallel? do we have a guarantee data will be initiated before tests run?

[PR] Coerce and simplify FixedSizeBinary equality to literal binary [datafusion]

2025-04-15 Thread via GitHub
leoyvens opened a new pull request, #15726: URL: https://github.com/apache/datafusion/pull/15726 ## Which issue does this PR close? - Closes #15686. - Alternative to #15687. ## Rationale for this change See https://github.com/apache/datafusion/issues/15686#issuecommen

Re: [PR] Refactor regexp slt tests [datafusion]

2025-04-15 Thread via GitHub
kumarlokesh commented on PR #15709: URL: https://github.com/apache/datafusion/pull/15709#issuecomment-2807387628 > Beyond the addition of null into the test data as mentioned by @goldmedal I think this is a nice and clean refactor - thank you @kumarlokesh @Omega359 @goldmedal addresse

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807288470 🤖: Benchmark completed Details ``` Comparing HEAD and improve_topk Benchmark clickbench_extended.json ┏

Re: [PR] Add `CREATE FUNCTION` support for SQL Server [datafusion-sqlparser-rs]

2025-04-15 Thread via GitHub
aharpervc commented on code in PR #1808: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1808#discussion_r2045317219 ## src/parser/mod.rs: ## @@ -15017,6 +15075,13 @@ impl<'a> Parser<'a> { } } +fn parse_return(&mut self) -> Result { +let

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807182655 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807179626 I'll run it again to see if the results are repeatable (For the queries that run in very small time limits I think tie benchmarks are noisy) -- This is an automated message from the

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807176187 > Would maybe nice to have the benchmarking code to automatically apply the changes onto master to avoid this? My [benchmarking script](https://github.com/alamb/datafusion-bench

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2807165302 > Thank you, can you please take the fuzz test that I created in my pr and add it to yours, making sure it will pass (it will require you updating row_hash.rs file @rluvaton

Re: [PR] Cascaded spill merge and re-spill [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2807165186 > Benchmark results: (I think there is no significant regression for an extra round of re-spill, if it's running on a machine with fast SSDs) It seems to me that there is a 30% r

Re: [PR] Enable setting default values for target_partitions and planning_concurrency [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15712: URL: https://github.com/apache/datafusion/pull/15712#issuecomment-2807154053 CI failure seems unrelated - https://github.com/apache/datafusion/issues/15724 -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807153175 This slowdown in the queries probably is because the branch didn't have the upgrade to arrow 55 in it https://github.com/apache/datafusion/pull/15466 > 🤖: Benchmark completed

Re: [PR] doc : update RepartitionExec display tree [datafusion]

2025-04-15 Thread via GitHub
alamb merged PR #15710: URL: https://github.com/apache/datafusion/pull/15710 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix internal error in sort when hitting memory limit [datafusion]

2025-04-15 Thread via GitHub
alamb merged PR #15692: URL: https://github.com/apache/datafusion/pull/15692 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Internal error in ExternalSorter when running with memory limit [datafusion]

2025-04-15 Thread via GitHub
alamb closed issue #15675: Internal error in ExternalSorter when running with memory limit URL: https://github.com/apache/datafusion/issues/15675 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-04-15 Thread via GitHub
Dandandan commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2807141666 FYI: It probably involves some hackery or redesign as a `RowValues` currently takes an owned `Rows`. -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Replace `return_type` and `return_type_with_args` with `return_field` [datafusion]

2025-04-15 Thread via GitHub
timsaucer closed pull request #15722: Replace `return_type` and `return_type_with_args` with `return_field` URL: https://github.com/apache/datafusion/pull/15722 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] Set DataFusion runtime configurations through SQL interface [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15594: URL: https://github.com/apache/datafusion/pull/15594#issuecomment-2807139683 Thanks -- I recommend we merge this after the 47.0.0 release is created (likely tomorrow) -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [I] CI is failing on main: error: the lock file /__w/datafusion/datafusion/Cargo.lock needs to be updated [datafusion]

2025-04-15 Thread via GitHub
alamb commented on issue #15724: URL: https://github.com/apache/datafusion/issues/15724#issuecomment-2807136484 This appears to have started after the following PR for some reason - https://github.com/apache/datafusion/pull/15703 -- This is an automated message from the Apache Git Servi

[I] CI is failing on main: error: the lock file /__w/datafusion/datafusion/Cargo.lock needs to be updated [datafusion]

2025-04-15 Thread via GitHub
alamb opened a new issue, #15724: URL: https://github.com/apache/datafusion/issues/15724 ### Describe the bug CI is failing on main: error: the lock file /__w/datafusion/datafusion/Cargo.lock needs to be updated ### To Reproduce https://github.com/apache/datafusion/actio

[PR] Update checked in Cargo.lock file to get clean CI [datafusion]

2025-04-15 Thread via GitHub
alamb opened a new pull request, #15725: URL: https://github.com/apache/datafusion/pull/15725 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15724 ## Rationale for this change Get a clean CI run ## What changes are i

Re: [I] When `datafusion.execution.parquet.coerce_int96` is set, timestamp type is still reported as Timestamp(nanoseconds) [datafusion]

2025-04-15 Thread via GitHub
alamb commented on issue #15721: URL: https://github.com/apache/datafusion/issues/15721#issuecomment-2807128903 - FWIW I don't think this is a blocker for #15072 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15537: URL: https://github.com/apache/datafusion/pull/15537#issuecomment-2807126281 I created a PR with some SLT tests for this feature - https://github.com/apache/datafusion/pull/15723 -- This is an automated message from the Apache Git Service. To respond to the

[PR] Add slt tests for `datafusion.execution.parquet.coerce_int96` setting [datafusion]

2025-04-15 Thread via GitHub
alamb opened a new pull request, #15723: URL: https://github.com/apache/datafusion/pull/15723 ## Which issue does this PR close? - Follow on to https://github.com/apache/datafusion/pull/15537 ## Rationale for this change Testing a feature using slt tests ensures i

Re: [PR] Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing [datafusion]

2025-04-15 Thread via GitHub
alamb commented on code in PR #15537: URL: https://github.com/apache/datafusion/pull/15537#discussion_r2045205455 ## datafusion/sqllogictest/test_files/information_schema.slt: ## @@ -296,6 +297,7 @@ datafusion.execution.parquet.bloom_filter_fpp NULL (writing) Sets bloom filter

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807124292 🤖: Benchmark completed Details ``` Comparing HEAD and improve_topk Benchmark clickbench_extended.json ┏

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807123847 > 🤖: Benchmark completed > > Details > > > > ``` > Comparing HEAD and short-and-optmize > > Benchmark clickbench_extended.json

[PR] chore: update python deps to 45 [datafusion-ballista]

2025-04-15 Thread via GitHub
milenkovicm opened a new pull request, #1240: URL: https://github.com/apache/datafusion-ballista/pull/1240 # Which issue does this PR close? Closes #. # Rationale for this change Update python devs to 45 # What changes are included in this PR? # Are there a

[PR] Replace `return_type` and `return_type_with_args` with `return_field` [datafusion]

2025-04-15 Thread via GitHub
timsaucer opened a new pull request, #15722: URL: https://github.com/apache/datafusion/pull/15722 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes teste

[I] When `datafusion.execution.parquet.coerce_int96` is set, timestamp type is still reported as Timestamp(nanoseconds) [datafusion]

2025-04-15 Thread via GitHub
alamb opened a new issue, #15721: URL: https://github.com/apache/datafusion/issues/15721 ### Describe the bug `datafusion.execution.parquet.coerce_int96` is supposed to > If true, parquet reader will read columns of physical type int96 as originating from a different resoluti

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-04-15 Thread via GitHub
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2807093566 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Reuse Rows allocation in SortPreservingMergeStream / `RowCursorStream` [datafusion]

2025-04-15 Thread via GitHub
acking-you commented on issue #15720: URL: https://github.com/apache/datafusion/issues/15720#issuecomment-2807093065 It looks pretty interesting. I'll give it a try after I wake up tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
Dandandan commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807089262 > I think this would be nicer (and tie in better with future work 😉) if we essentially followed the structure of #15301 but do the filtering in `TopK` or `SortExec`: > > 1.

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-15 Thread via GitHub
adriangb commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2807089857 > I'm working on the failures now @berkaysynnada any luck? I can take a look tomorrow but don't want to duplicate effort -- This is an automated message from the Apache Git

[I] Reuse Rows allocation in SortPreservingMergeExec [datafusion]

2025-04-15 Thread via GitHub
Dandandan opened a new issue, #15720: URL: https://github.com/apache/datafusion/issues/15720 ### Is your feature request related to a problem or challenge? While reviewing our Sort code, I found `Rows` is being allocated within `RowCursorStream` for each batch (via `RowConverter::conv

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2807039992 🤖: Benchmark completed Details ``` Comparing HEAD and short-and-optmize Benchmark clickbench_extended.json

Re: [PR] Optimize TopK with threshold filter ~1.4x speedup [datafusion]

2025-04-15 Thread via GitHub
alamb commented on PR #15697: URL: https://github.com/apache/datafusion/pull/15697#issuecomment-2807040133 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

Re: [I] get error value if timestamp represented by the INT96 in the parquet file [datafusion]

2025-04-15 Thread via GitHub
alamb commented on issue #9981: URL: https://github.com/apache/datafusion/issues/9981#issuecomment-2807039266 I believe this is now closed with the following PR from @mbutrovich - https://github.com/apache/datafusion/pull/15537 -- This is an automated message from the Apache Git Servic

Re: [I] get error value if timestamp represented by the INT96 in the parquet file [datafusion]

2025-04-15 Thread via GitHub
alamb closed issue #9981: get error value if timestamp represented by the INT96 in the parquet file URL: https://github.com/apache/datafusion/issues/9981 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-15 Thread via GitHub
alamb commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2807027482 > @alamb I am hoping that we can merge https://github.com/apache/datafusion/pull/15537 for this release. It was just rebased now that the arrow-rs upgrade is merged. > Tha

Re: [PR] STRING_AGG missing functionality [datafusion]

2025-04-15 Thread via GitHub
alamb merged PR #14412: URL: https://github.com/apache/datafusion/pull/14412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Implement `distinct` and `order by` clause for `string_agg` aggregate function [datafusion]

2025-04-15 Thread via GitHub
alamb closed issue #8260: Implement `distinct` and `order by` clause for `string_agg` aggregate function URL: https://github.com/apache/datafusion/issues/8260 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] chore(deps): bump flate2 from 1.1.0 to 1.1.1 [datafusion]

2025-04-15 Thread via GitHub
comphead merged PR #15703: URL: https://github.com/apache/datafusion/pull/15703 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
Dandandan commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2800863670 Very cool. It would be nice to run some e2e benchmarks (TPC-H, clickbench) with this to see the impact here. -- This is an automated message from the Apache Git Service. To re

Re: [PR] Apply pre-selection and computation skipping to short-circuit optimization [datafusion]

2025-04-15 Thread via GitHub
acking-you commented on PR #15694: URL: https://github.com/apache/datafusion/pull/15694#issuecomment-2801015314 > | However, one point needs to be confirmed: [filter_record_batch](https://docs.rs/arrow-select/54.2.1/src/arrow_select/filter.rs.html#202-205) will retain rows that are null.

[I] Partitioned `ListingTable` read files after logical plan ser/de [datafusion]

2025-04-15 Thread via GitHub
milenkovicm opened a new issue, #15718: URL: https://github.com/apache/datafusion/issues/15718 ### Describe the bug Partitioned `ListingTable` logical plan round trip fails to produce valid schema after deserialisation ### To Reproduce ```rust let session_sta

Re: [PR] User defined window functions blog post [datafusion-site]

2025-04-15 Thread via GitHub
getChan commented on code in PR #66: URL: https://github.com/apache/datafusion-site/pull/66#discussion_r2045138173 ## content/blog/2025-04-17-user-defined-window-functions.md: ## @@ -0,0 +1,427 @@ +--- +layout: post +title: User defined Window Functions in DataFusion +date: 202

  1   2   3   >