Re: [PR] feat: Make parquet_encryption a non-default feature [datafusion]

2025-08-14 Thread via GitHub
miroim commented on PR #17137: URL: https://github.com/apache/datafusion/pull/17137#issuecomment-3190751439 Should we also enable `parquet_encryption` for the `cargo test (macos-aarch64)` here? https://github.com/apache/datafusion/blob/5c370fa620eb05d07ad9ef70b5a8a959c46cefe6/.github/wor

Re: [PR] feat: implement_ansi_eval_mode_arithmetic [datafusion-comet]

2025-08-14 Thread via GitHub
coderfender commented on PR #2136: URL: https://github.com/apache/datafusion-comet/pull/2136#issuecomment-3190700221 There is a test failure in Spark 4.0 where division operation is failing (as expected in ANSI mode) -- This is an automated message from the Apache Git Service. To respon

Re: [PR] feat: implement QUALIFY clause [datafusion]

2025-08-14 Thread via GitHub
haohuaijin commented on PR #16933: URL: https://github.com/apache/datafusion/pull/16933#issuecomment-3190663166 Thanks @alamb @jonahgao for reviews, i add the document like `HAVING` clause. -- This is an automated message from the Apache Git Service. To respond to the message, please log o

[PR] Fix dynamic filter pushdown in HashJoinExec::swap_inputs [datafusion]

2025-08-14 Thread via GitHub
adriangb opened a new pull request, #17201: URL: https://github.com/apache/datafusion/pull/17201 Fixes https://github.com/apache/datafusion/issues/17196 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

2025-08-14 Thread via GitHub
jonathanc-n commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3190633209 Its a bit late for me so I'll take a look at the code tomorrow. It'd be interesting if we can somehow have partition aware expressions, even though it doesn't quite make sense, m

Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

2025-08-14 Thread via GitHub
jonathanc-n commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3190624755 In the case of hash join spilling this might be a bit difficult. I'm planning on putting out a proposal for hash join spilling in the next few days. To give you a quick rundown t

Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3190521073 @nuno-faria @jonathanc-n would you mind reviewing this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] Testing: Try test optimize performance for coalesce [datafusion]

2025-08-14 Thread via GitHub
zhuqi-lucas commented on PR #17193: URL: https://github.com/apache/datafusion/pull/17193#issuecomment-3190510145 > 🤖: Benchmark completed > > Details > > ``` > Comparing HEAD and test_optimize_performance > > Benchmark tpch_mem_sf1.json > ---

[PR] Allow `generate_series` to be serialized via protobuf [datafusion]

2025-08-14 Thread via GitHub
cetra3 opened a new pull request, #17200: URL: https://github.com/apache/datafusion/pull/17200 ## Which issue does this PR close? No issues raised for this one, just something we're bumping into with another change. ## Rationale for this change Allows `generate_series` t

[PR] fix: add_cast_compatibility_comet_long_decimal [datafusion-comet]

2025-08-14 Thread via GitHub
coderfender opened a new pull request, #2160: URL: https://github.com/apache/datafusion-comet/pull/2160 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these chang

Re: [PR] Use pager and allow configuration via `\pset` [datafusion]

2025-08-14 Thread via GitHub
github-actions[bot] commented on PR #15597: URL: https://github.com/apache/datafusion/pull/15597#issuecomment-3190411093 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] feat(benchmark): collect benchmarks for last 5 versions in line protocol format [datafusion]

2025-08-14 Thread via GitHub
github-actions[bot] closed pull request #15846: feat(benchmark): collect benchmarks for last 5 versions in line protocol format URL: https://github.com/apache/datafusion/pull/15846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] Implement `partition_statistics` API for `RepartitionExec` [datafusion]

2025-08-14 Thread via GitHub
xudong963 commented on code in PR #17061: URL: https://github.com/apache/datafusion/pull/17061#discussion_r2278054549 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -755,10 +756,43 @@ impl ExecutionPlan for RepartitionExec { } fn partition_statistics(&self

Re: [PR] chore: CometExecRule code cleanup [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2159: URL: https://github.com/apache/datafusion-comet/pull/2159#issuecomment-3190306758 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2159?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] docs: Update to support try arithmetic functions [datafusion-comet]

2025-08-14 Thread via GitHub
coderfender commented on code in PR #2143: URL: https://github.com/apache/datafusion-comet/pull/2143#discussion_r2277997609 ## docs/source/user-guide/expressions.md: ## @@ -44,6 +44,15 @@ The following Spark expressions are currently available. Any known compatibility | Integr

[PR] chore: CometExecRule code cleanup [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove opened a new pull request, #2159: URL: https://github.com/apache/datafusion-comet/pull/2159 ## Which issue does this PR close? N/A ## Rationale for this change Cleaning up some convoluted code ## What changes are included in this PR?

Re: [PR] feat: init_ansi_mode_enabled [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2136: URL: https://github.com/apache/datafusion-comet/pull/2136#issuecomment-3190145187 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2136?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Make parquet_encryption a non-default feature [datafusion]

2025-08-14 Thread via GitHub
adamreeve commented on PR #17137: URL: https://github.com/apache/datafusion/pull/17137#issuecomment-3190083005 :+1: makes sense to me too. There is already a `cargo check` run in CI that just enables the `parquet_encryption` feature: https://github.com/apache/datafusion/blob/5c370

Re: [PR] feat: Make parquet_encryption a non-default feature [datafusion]

2025-08-14 Thread via GitHub
corwinjoy commented on PR #17137: URL: https://github.com/apache/datafusion/pull/17137#issuecomment-3190052755 This change looks reasonable to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] support filter pushdown with left-semi joins in HashJoinExec [datafusion]

2025-08-14 Thread via GitHub
jonathanc-n commented on code in PR #17153: URL: https://github.com/apache/datafusion/pull/17153#discussion_r2277796066 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -369,6 +370,106 @@ pub struct HashJoinExec { dynamic_filter: Arc, } +/// Check if a physical

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
Dandandan commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189945322 * Another thing is that datafusion will store the data for each unique combination, rather than only once for each unique value. For columns with larger data and muliple colum

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

2025-08-14 Thread via GitHub
BlakeOrth commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3189937566 @alamb I think additional observability tooling is almost always a positive development. That being said, I have to be completely honest with you and note that I'm ultimately

Re: [PR] chore: Simplify approach to avoiding memory corruption due to buffer reuse [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2156: URL: https://github.com/apache/datafusion-comet/pull/2156#issuecomment-3189887768 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2156?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
rluvaton commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189876593 Performance improvement we can do is to calculate the ordering once and not once per accumulator when we have order by -- This is an automated message from the Apache Git Se

Re: [PR] Fix: ListingTableFactory hive column detection [datafusion]

2025-08-14 Thread via GitHub
BlakeOrth commented on code in PR #17050: URL: https://github.com/apache/datafusion/pull/17050#discussion_r2277696974 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -802,6 +802,7 @@ impl ListingOptions { .rev() .skip(1) // get

Re: [PR] Fix: ListingTableFactory hive column detection [datafusion]

2025-08-14 Thread via GitHub
BlakeOrth commented on PR #17050: URL: https://github.com/apache/datafusion/pull/17050#issuecomment-3189831540 > Thank you for your patience No worries at all, it's pretty obvious to me you have quite a few plates spinning. I wanted to ensure you had time to review, but simultaneously

Re: [PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove commented on PR #2155: URL: https://github.com/apache/datafusion-comet/pull/2155#issuecomment-3189824597 Thanks for the review @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] Inputs to SortMergeJoin should be wrapped in a CopyExec with UnpackOrDeepCopy if the inputs can reuse batches [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove closed issue #2151: Inputs to SortMergeJoin should be wrapped in a CopyExec with UnpackOrDeepCopy if the inputs can reuse batches URL: https://github.com/apache/datafusion-comet/issues/2151 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove merged PR #2155: URL: https://github.com/apache/datafusion-comet/pull/2155 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Fix: ListingTableFactory hive column detection [datafusion]

2025-08-14 Thread via GitHub
BlakeOrth commented on code in PR #17050: URL: https://github.com/apache/datafusion/pull/17050#discussion_r2277674825 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -802,6 +802,7 @@ impl ListingOptions { .rev() .skip(1) // get

Re: [PR] feat: init_ansi_mode_enabled [datafusion-comet]

2025-08-14 Thread via GitHub
coderfender commented on PR #2136: URL: https://github.com/apache/datafusion-comet/pull/2136#issuecomment-3189809289 Added test cases for additiona , subtraction and multiplication use cases. Given that spark inherently converts decimal operands to doubles (and decimals in case of integral

Re: [I] Improve performance of `datafusion-cli` when reading from remote storage [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3189798794 I have been thinking about this @BlakeOrth -- and when I was trying to write up this plan, I realized it wasn't actually clear to me what was happening and thus I didn't know wha

Re: [PR] Fix: ListingTableFactory hive column detection [datafusion]

2025-08-14 Thread via GitHub
alamb commented on code in PR #17050: URL: https://github.com/apache/datafusion/pull/17050#discussion_r2277629705 ## datafusion/core/src/datasource/listing_table_factory.rs: ## @@ -63,16 +63,33 @@ impl TableProviderFactory for ListingTableFactory { ))?

Re: [PR] Fix: ListingTableFactory hive column detection [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #17050: URL: https://github.com/apache/datafusion/pull/17050#issuecomment-3189702547 I am looking at this issue a bit more carefully -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
Omega359 commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189698266 fwiw I use duckdb for dedup and while it's noticeably better than DF memory wise it is still a concern for it as well. -- This is an automated message from the Apache Git Ser

Re: [I] Dynamic Filter Pushdown causes JOIN to return incorrect results [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on issue #17188: URL: https://github.com/apache/datafusion/issues/17188#issuecomment-3189690982 I think we could do something like: 1. Keep track of how many output partitions we have in HashJoinExec. 2. Keep track of the overall bounds amongst all partitions. 3. O

Re: [PR] perf: Only perform deep copies for Parquet scans [ignore] [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2158: URL: https://github.com/apache/datafusion-comet/pull/2158#issuecomment-3189659004 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2158?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Avoid schema deep clone in PruningExpressionBuilder [datafusion]

2025-08-14 Thread via GitHub
etolbakov commented on issue #17198: URL: https://github.com/apache/datafusion/issues/17198#issuecomment-3189616411 Hey Piotr @findepi If that's not an urgent one I'd like to give it a go. -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [PR] feat: support Utf8View for more args of `regexp_replace` [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277380381 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -125,13 +121,13 @@ impl ScalarUDFImpl for RegexpReplaceFunc { fn return_type(&self, arg_types: &[

Re: [I] Dynamic Filter Pushdown is being applied to the wrong table [datafusion]

2025-08-14 Thread via GitHub
nuno-faria commented on issue #17196: URL: https://github.com/apache/datafusion/issues/17196#issuecomment-3189605659 Left side is always the build side, but the relation assigned to the left side can swap with the one on the right. In that case, the join won't use a `collectLeft` mode. --

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
rluvaton commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189595746 I'm surprised we got the error and we did not spill -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Add a way to get what takes memory [datafusion]

2025-08-14 Thread via GitHub
rluvaton commented on issue #16904: URL: https://github.com/apache/datafusion/issues/16904#issuecomment-3189588724 With this feature, issues like this are much easier to debug: - https://github.com/apache/datafusion/issues/17169 -- This is an automated message from the Apache Git Servic

[PR] perf: Only perform deep copies for Parquet scans [ignore] [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove opened a new pull request, #2158: URL: https://github.com/apache/datafusion-comet/pull/2158 ## Which issue does this PR close? N/A Follows on from https://github.com/apache/datafusion-comet/pull/2156 ## Rationale for this change Stop perfo

Re: [PR] perf: Avoid deep copies for ScanExecs that do not use mutable Parquet buffers [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove closed pull request #2149: perf: Avoid deep copies for ScanExecs that do not use mutable Parquet buffers URL: https://github.com/apache/datafusion-comet/pull/2149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [I] Dynamic Filter Pushdown is being applied to the wrong table [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on issue #17196: URL: https://github.com/apache/datafusion/issues/17196#issuecomment-3189576101 But won't that still result in the left side being the build side and the right side being the probe side -> let side is the dynamic filter source and we push them down into th

Re: [I] Dynamic Filter Pushdown causes JOIN to return incorrect results [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on issue #17188: URL: https://github.com/apache/datafusion/issues/17188#issuecomment-3189570326 Looking at this a bit and I now question the implementation in #16445 altogether: it seems that in `Partitioned` mode we build a hash table for each partition -> we would need

Re: [I] Dynamic Filter Pushdown is being applied to the wrong table [datafusion]

2025-08-14 Thread via GitHub
nuno-faria commented on issue #17196: URL: https://github.com/apache/datafusion/issues/17196#issuecomment-3189563371 > Hmm I thought DataFusion didn't do smart switching of sides i.e. left side is always build side and right is always probe side? When there are statistics the planner

Re: [PR] Miscellaneous cleanups [datafusion]

2025-08-14 Thread via GitHub
findepi commented on code in PR #17189: URL: https://github.com/apache/datafusion/pull/17189#discussion_r2277488346 ## datafusion/expr/src/logical_plan/extension.rs: ## @@ -57,7 +57,7 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync { fn schema(&self) -> &DFSch

[PR] Remove redundant `plan` from extension's check_invariants [datafusion]

2025-08-14 Thread via GitHub
findepi opened a new pull request, #17199: URL: https://github.com/apache/datafusion/pull/17199 In `UserDefinedLogicalNode::check_invariants`, the actual plan to check for invariants is `self`. The `plan` is always `LogicalPlan::Extension` and provides no further information. It's confusin

Re: [PR] Miscellaneous cleanups [datafusion]

2025-08-14 Thread via GitHub
findepi commented on code in PR #17189: URL: https://github.com/apache/datafusion/pull/17189#discussion_r2277484931 ## datafusion/pruning/src/pruning_predicate.rs: ## @@ -978,6 +978,7 @@ impl<'a> PruningExpressionBuilder<'a> { } }; +// TOD

Re: [PR] Miscellaneous cleanups [datafusion]

2025-08-14 Thread via GitHub
findepi commented on code in PR #17189: URL: https://github.com/apache/datafusion/pull/17189#discussion_r2277485443 ## datafusion-examples/examples/pruning.rs: ## @@ -187,10 +187,10 @@ impl PruningStatistics for MyCatalog { } fn create_pruning_predicate(expr: Expr, schema: &

[I] Avoid schema deep clone in PruningExpressionBuilder [datafusion]

2025-08-14 Thread via GitHub
findepi opened a new issue, #17198: URL: https://github.com/apache/datafusion/issues/17198 Avoid schema deep clone in https://github.com/apache/datafusion/blob/6870cc180f3dd72583919d01bf983fabb700434c/datafusion/pruning/src/pruning_predicate.rs#L981 by passing SchemaRef to that function.

Re: [PR] Miscellaneous cleanups [datafusion]

2025-08-14 Thread via GitHub
findepi commented on code in PR #17189: URL: https://github.com/apache/datafusion/pull/17189#discussion_r2277481985 ## datafusion/expr/src/logical_plan/extension.rs: ## @@ -57,7 +57,7 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync { fn schema(&self) -> &DFSch

Re: [I] Dynamic Filter Pushdown is being applied to the wrong table [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on issue #17196: URL: https://github.com/apache/datafusion/issues/17196#issuecomment-3189512211 Hmm I thought DataFusion didn't do smart switching of sides i.e. left side is always build side and right is always probe side? -- This is an automated message from the Apach

Re: [I] `string_agg` does not respect `ORDER BY` on `49.0.0` [datafusion]

2025-08-14 Thread via GitHub
LiaCastaneda commented on issue #17011: URL: https://github.com/apache/datafusion/issues/17011#issuecomment-3189507953 > > We just verified it gets fixed in [#17129](https://github.com/apache/datafusion/pull/17129) 👍 > > Note we (well really [@AdamGS](https://github.com/AdamGS) )

Re: [PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove commented on PR #2155: URL: https://github.com/apache/datafusion-comet/pull/2155#issuecomment-3189494802 > oh so do we use SMJ by default? > > ``` > val COMET_REPLACE_SMJ: ConfigEntry[Boolean] = > conf(s"$COMET_EXEC_CONFIG_PREFIX.replaceSortMergeJoin") >

[I] feat: Support nested Array literals [datafusion-comet]

2025-08-14 Thread via GitHub
comphead opened a new issue, #2157: URL: https://github.com/apache/datafusion-comet/issues/2157 ### What is the problem the feature request solves? _No response_ ### Describe the potential solution _No response_ ### Additional context _No response_ -- Thi

Re: [PR] [main] Update version to 49.0.1 and add changelog (#17175) [datafusion]

2025-08-14 Thread via GitHub
alamb merged PR #17191: URL: https://github.com/apache/datafusion/pull/17191 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] [main] Update version to 49.0.1 and add changelog (#17175) [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #17191: URL: https://github.com/apache/datafusion/pull/17191#issuecomment-3189476965 Thank you @comphead for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Rewrite Nested Loop Join executor for 5× speed and 1% memory usage [datafusion]

2025-08-14 Thread via GitHub
comphead merged PR #16996: URL: https://github.com/apache/datafusion/pull/16996 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Rewrite Nested Loop Join executor for 5× speed and 1% memory usage [datafusion]

2025-08-14 Thread via GitHub
comphead commented on PR #16996: URL: https://github.com/apache/datafusion/pull/16996#issuecomment-3189449241 > > @2010YOUY01 I think you can run extended tests on your forked repo from `Actions` providing a branch? > > Ah, yes. I forget pushing to private clone's main branch can also

Re: [PR] support filter pushdown with left-semi joins in HashJoinExec [datafusion]

2025-08-14 Thread via GitHub
nuno-faria commented on code in PR #17153: URL: https://github.com/apache/datafusion/pull/17153#discussion_r2277409592 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -369,6 +370,106 @@ pub struct HashJoinExec { dynamic_filter: Arc, } +/// Check if a physical e

Re: [PR] support filter pushdown with left-semi joins in HashJoinExec [datafusion]

2025-08-14 Thread via GitHub
nuno-faria commented on PR #17153: URL: https://github.com/apache/datafusion/pull/17153#issuecomment-3189444721 Thanks @adriangb for working on this. I think it would be nice to add a sqllogictest example showing the pushdown. Here is an example: ```sql copy (select i as k f

Re: [PR] disable HashJoinExec sideways information passing for partitioned queries [datafusion]

2025-08-14 Thread via GitHub
adriangb commented on PR #17197: URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3189436991 I think we can support this but need to do some thinking / writing code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

[PR] disable HashJoinExec sideways information passing for partitioned queries [datafusion]

2025-08-14 Thread via GitHub
adriangb opened a new pull request, #17197: URL: https://github.com/apache/datafusion/pull/17197 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

[I] Dynamic Filter Pushdown is being applied to the wrong table [datafusion]

2025-08-14 Thread via GitHub
nuno-faria opened a new issue, #17196: URL: https://github.com/apache/datafusion/issues/17196 ### Describe the bug In the following example, the Dynamic Filter Pushdown is built from `t2 (k, v)` and correctly pushed to `t1 (k)`, to reduce the number of scanned rows from `t1`:

Re: [PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277380381 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -125,13 +121,13 @@ impl ScalarUDFImpl for RegexpReplaceFunc { fn return_type(&self, arg_types: &[

Re: [PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277380381 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -125,13 +121,13 @@ impl ScalarUDFImpl for RegexpReplaceFunc { fn return_type(&self, arg_types: &[

Re: [PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277380381 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -125,13 +121,13 @@ impl ScalarUDFImpl for RegexpReplaceFunc { fn return_type(&self, arg_types: &[

Re: [PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277379010 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -650,8 +665,8 @@ mod tests { vec!["afooc", "acd", "afoocd1234567890123", "123456

Re: [PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich commented on code in PR #17195: URL: https://github.com/apache/datafusion/pull/17195#discussion_r2277377451 ## datafusion/functions/src/regex/regexpreplace.rs: ## @@ -398,12 +394,37 @@ fn _regexp_replace_early_abort( /// Note: If the array is empty or the first argum

[PR] feat: better Utf8View support for `regexp_replace`, signature cleanup [datafusion]

2025-08-14 Thread via GitHub
mbutrovich opened a new pull request, #17195: URL: https://github.com/apache/datafusion/pull/17195 ## Which issue does this PR close? - Closes #. ## Rationale for this change #11667 added some `Uft8View` support for `regexp_replace`, but only for the firs

[PR] chore: Simplify approach to avoiding memory corruption due to buffer reuse [wip] [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove opened a new pull request, #2156: URL: https://github.com/apache/datafusion-comet/pull/2156 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2155: URL: https://github.com/apache/datafusion-comet/pull/2155#issuecomment-3189319131 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2155?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Testing: Try test optimize performance for coalesce [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #17193: URL: https://github.com/apache/datafusion/pull/17193#issuecomment-3189279156 🤖: Benchmark completed Details ``` Comparing HEAD and test_optimize_performance Benchmark tpch_mem_sf1.json

Re: [PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove commented on code in PR #2155: URL: https://github.com/apache/datafusion-comet/pull/2155#discussion_r2277229709 ## native/core/src/execution/planner.rs: ## @@ -2611,8 +2614,10 @@ impl From for DataFusionError { /// data corruption from reusing the input batch. fn can

[PR] fix: Add CopyExec to inputs to SortMergeJoinExec [datafusion-comet]

2025-08-14 Thread via GitHub
andygrove opened a new pull request, #2155: URL: https://github.com/apache/datafusion-comet/pull/2155 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/2151 ## Rationale for this change ## What changes are included

Re: [I] Tracking PR to update useDecimal128 in Iceberg [datafusion-comet]

2025-08-14 Thread via GitHub
hsiang-c commented on issue #2095: URL: https://github.com/apache/datafusion-comet/issues/2095#issuecomment-3189182920 Pending in https://github.com/apache/iceberg/pull/13665 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Testing: Try test optimize performance for coalesce [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #17193: URL: https://github.com/apache/datafusion/pull/17193#issuecomment-3189179385 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [I] [iceberg] Tracking PR to update Iceberg to enable Comet native execution with Iceberg [datafusion-comet]

2025-08-14 Thread via GitHub
hsiang-c commented on issue #2094: URL: https://github.com/apache/datafusion-comet/issues/2094#issuecomment-3189173250 Pending in https://github.com/apache/iceberg/pull/13793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] feat: implement QUALIFY clause [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #16933: URL: https://github.com/apache/datafusion/pull/16933#issuecomment-3189170759 > @alamb @jayzhan211 , could you please take a look at this PR? If everything looks good, perhaps we can merge this as the initial version of qualify, and then @Vedin can follow up wit

Re: [PR] feat: implement QUALIFY clause [datafusion]

2025-08-14 Thread via GitHub
alamb commented on code in PR #16933: URL: https://github.com/apache/datafusion/pull/16933#discussion_r2277174123 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4186,6 +4187,47 @@ fn test_select_distinct_order_by() { ); } +#[test] +fn test_select_qualify_basic() { +

Re: [I] [iceberg] Error loading in-memory sorter check class path [datafusion-comet]

2025-08-14 Thread via GitHub
parthchandra closed issue #1982: [iceberg] Error loading in-memory sorter check class path URL: https://github.com/apache/datafusion-comet/issues/1982 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [I] Add support for registering files in the Arrow IPC stream format as tables using `register_arrow` or similar [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #16688: URL: https://github.com/apache/datafusion/issues/16688#issuecomment-3189115601 That sounds right to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [I] `string_agg` does not respect `ORDER BY` on `49.0.0` [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17011: URL: https://github.com/apache/datafusion/issues/17011#issuecomment-3189112531 > We just verified it gets fixed in [#17129](https://github.com/apache/datafusion/pull/17129) 👍 Note we (well really @AdamGS ) also backported the fix to 49.0.1: https://gi

Re: [PR] Migrate core test to insta part 3 [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #16978: URL: https://github.com/apache/datafusion/pull/16978#issuecomment-3189092136 Thank you so much @Chen-Yuan-Lai -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] `string_agg` does not respect `ORDER BY` on `49.0.0` [datafusion]

2025-08-14 Thread via GitHub
LiaCastaneda commented on issue #17011: URL: https://github.com/apache/datafusion/issues/17011#issuecomment-3189090373 We just verified it gets fixed in #17129 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189084096 I think there are two issues we should explore: 1. Keeping within the memory budget (basically make the accounting better when possible) 2. Improving DataFusion's ability to

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3189076433 I believe @Omega359 has also spoken about this usecase being a challenge (basically deduplicating large datasets). While DataFusion's grouping operator is already pretty o

[I] `cargo test --test fuzz` fails with "Too many open files" [datafusion]

2025-08-14 Thread via GitHub
alamb opened a new issue, #17194: URL: https://github.com/apache/datafusion/issues/17194 ### Describe the bug - While verifying 49.0.1: https://github.com/apache/datafusion/issues/17036 @milenkovicm reported ([email](https://lists.apache.org/thread/g94ztqk5hlgxd0lnmhofsm5lowchn

Re: [PR] Rewrite Nested Loop Join executor for 5× speed and 1% memory usage [datafusion]

2025-08-14 Thread via GitHub
2010YOUY01 commented on PR #16996: URL: https://github.com/apache/datafusion/pull/16996#issuecomment-3189039109 > @2010YOUY01 I think you can run extended tests on your forked repo from `Actions` providing a branch? Ah, yes. I forget pushing to private clone's main branch can also tri

Re: [PR] feat: Add config option to log fallback reasons [datafusion-comet]

2025-08-14 Thread via GitHub
codecov-commenter commented on PR #2154: URL: https://github.com/apache/datafusion-comet/pull/2154#issuecomment-3189035731 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2154?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3188990678 BTW I can easily reproduce the reported symptoms using `datafusion-cli` locally: ```shell > SELECT DISTINCT ON ("ADDRESS1", "ADDRESS2", "ADDRESS3", "POSTCODE") *

Re: [I] Disproportionate memory use for `DISTINCT ON` query [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17169: URL: https://github.com/apache/datafusion/issues/17169#issuecomment-3188983831 It may also be related to the fact that the group by hash code can allocate large single swaths of memory @Rachelint was working on improvements in this area - https://

Re: [PR] Migrate core test to insta part 3 [datafusion]

2025-08-14 Thread via GitHub
Chen-Yuan-Lai commented on PR #16978: URL: https://github.com/apache/datafusion/pull/16978#issuecomment-3188982652 > Maybe we can try to port a few tests in one of the files to use iinsta to make sure we are good with the pattern before applying the pattern to the entire thing > > F

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-08-14 Thread via GitHub
alamb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-3188967321 @nuno-faria has reported a bug related to this PR: - https://github.com/apache/datafusion/issues/17188 -- This is an automated message from the Apache Git Service. To respond to th

Re: [I] Dynamic Filter Pushdown causes JOIN to return incorrect results [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17188: URL: https://github.com/apache/datafusion/issues/17188#issuecomment-3188966302 Added to list of bugs we need to fix for 50: - https://github.com/apache/datafusion/issues/16799 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Rewrite Nested Loop Join executor for 5× speed and 1% memory usage [datafusion]

2025-08-14 Thread via GitHub
comphead commented on PR #16996: URL: https://github.com/apache/datafusion/pull/16996#issuecomment-3188943040 @2010YOUY01 I think you can run extended tests on your forked repo from `Actions` providing a branch? -- This is an automated message from the Apache Git Service. To respond to th

Re: [I] Release DataFusion `49.0.1` (patch) [datafusion]

2025-08-14 Thread via GitHub
alamb commented on issue #17036: URL: https://github.com/apache/datafusion/issues/17036#issuecomment-3188933262 And we have a release candidate out for voting: https://lists.apache.org/thread/x95ffh7m3m91qg31sd38155pfsdntnsb -- This is an automated message from the Apache Git Service. To

Re: [PR] Miscellaneous cleanups [datafusion]

2025-08-14 Thread via GitHub
alamb commented on code in PR #17189: URL: https://github.com/apache/datafusion/pull/17189#discussion_r2277004846 ## datafusion/expr/src/logical_plan/extension.rs: ## @@ -57,7 +57,7 @@ pub trait UserDefinedLogicalNode: fmt::Debug + Send + Sync { fn schema(&self) -> &DFSchem

  1   2   >