Re: [PR] feat: Support RightMark join for NestedLoop and Hash join [datafusion]

2025-05-31 Thread via GitHub
Dandandan commented on code in PR #16083: URL: https://github.com/apache/datafusion/pull/16083#discussion_r2118781458 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -975,6 +996,12 @@ pub(crate) fn adjust_indices_by_join_type( // the left_indices will not be u

Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-31 Thread via GitHub
chenkovsky closed pull request #16185: fix: Fix `EquivalenceClass` calculation for Union queries URL: https://github.com/apache/datafusion/pull/16185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Always add parentheses when formatting`BinaryExpr` with `SchemaDisplay` [datafusion]

2025-05-31 Thread via GitHub
hendrikmakait commented on code in PR #16209: URL: https://github.com/apache/datafusion/pull/16209#discussion_r2118214706 ## datafusion/expr/src/expr.rs: ## @@ -2500,7 +2500,7 @@ impl Display for SchemaDisplay<'_> { } } Expr::BinaryExpr

Re: [PR] Add change to VARCHAR in the upgrade guide [datafusion]

2025-05-31 Thread via GitHub
comphead commented on code in PR #16216: URL: https://github.com/apache/datafusion/pull/16216#discussion_r2118165176 ## docs/source/library-user-guide/upgrading.md: ## @@ -21,6 +21,55 @@ ## DataFusion `48.0.0` +### `VARCHAR` SQL type changed to map to `Utf8View` Arrow type

Re: [PR] Handle dicts for distinct count [datafusion]

2025-05-31 Thread via GitHub
blaginin commented on PR #15871: URL: https://github.com/apache/datafusion/pull/15871#issuecomment-2925530843 thank you for the review! i really like your fuzzy testing idea - will push soon (and respond to the new comments) -- This is an automated message from the Apache Git Service. To

Re: [PR] Handle dicts for distinct count [datafusion]

2025-05-31 Thread via GitHub
alamb commented on code in PR #15871: URL: https://github.com/apache/datafusion/pull/15871#discussion_r2118139282 ## datafusion/functions-aggregate/src/count.rs: ## @@ -179,6 +179,107 @@ impl Count { } } } +fn get_primitive_type_accumulator(data_type: &DataType) -

Re: [PR] fix: Fix `EquivalenceClass` calculation for Union queries [datafusion]

2025-05-31 Thread via GitHub
alamb commented on PR #16185: URL: https://github.com/apache/datafusion/pull/16185#issuecomment-2925525241 - This PR might also be superceded by https://github.com/apache/datafusion/pull/16217 -- This is an automated message from the Apache Git Service. To respond to the message, please l

[PR] Chore: implement bit_not as ScalarUDFImpl [datafusion-comet]

2025-05-31 Thread via GitHub
kazantsev-maksim opened a new pull request, #1825: URL: https://github.com/apache/datafusion-comet/pull/1825 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1819 Closes #. See https://github.com/apache/datafusion-comet/issues/1

Re: [PR] Minor: remove unused IPCWriter [datafusion]

2025-05-31 Thread via GitHub
alamb commented on PR #16215: URL: https://github.com/apache/datafusion/pull/16215#issuecomment-2925498438 Thanks @comphead and @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [I] Treat truncated parquet stats as inexact [datafusion]

2025-05-31 Thread via GitHub
CookiePieWw commented on issue #15976: URL: https://github.com/apache/datafusion/issues/15976#issuecomment-2925153931 Thanks for your feedback! I found `ValueStatistics` has already have [`max_is_exact`](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#metho

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2925152088 > This is exactly what I want to avoid -- there are too many execs and we shouldn't have a structure where this is repeated everywhere in code. Isn't that somewhat unavoidable

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2925042496 > I just updated the PR also support sort exec, also added end to end testing. This is exactly what I want to avoid -- there are too many execs and we shouldn't have a struc

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924937197 > This is an interesting find, and I think it would be great if we could somehow utilize this (or something similar to this). I was just studying the implementation. It would

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924931929 > I am currently against merging this as is, unless we get stuck in our search to find a proper way to solve this. Let's keep hacking at finding a good solution -- we are making

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924751830 I am currently against merging this as is, unless we get stuck in our search to find a proper way to solve this. Let's keep hacking at finding a good solution -- we are making prog

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924752646 > https://docs.rs/tokio/latest/tokio/task/coop/fn.consume_budget.html This is an interesting find, and I think it would be great if we could somehow utilize this (or somethin

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117541964 ## benchmarks/src/sort_tpch.rs: ## @@ -294,7 +297,7 @@ impl RunOpt { let mut stream = execute_stream(physical_plan.clone(), state.task_ctx())?;

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117536617 ## benchmarks/compare.py: ## @@ -156,8 +173,8 @@ def compare( console.print(table) # Calculate averages -avg_baseline_time = total_baseline_time

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924691085 > but it's kind of similar to the count-batches-and-yield solution. Thank you @pepijnve , this is really helpful, so our choice seems like good enough: 1. For performanc

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-31 Thread via GitHub
Kontinuation commented on PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#issuecomment-2924690797 > I have another thought on this. Any number of users have developed custom `AWSCredentialsProvider`s in Java but we would not have corresponding implementations in Rust (t

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on PR #16182: URL: https://github.com/apache/datafusion/pull/16182#issuecomment-2924686568 Thank you for review. I’ve applied your suggestions, and I’ll run it a few more times and check the output display once again before requesting a final review. If there are any ot

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924684413 Just as an FYI, I was reading through the tokio docs yesterday and found https://docs.rs/tokio/latest/tokio/task/coop/index.html#cooperative-scheduling which seems very similar to t

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-05-31 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2924675924 > @ozankabak > > > Pipeline-breaking operators creating cancellation issues is a universal problem -- this is not an `AggregateExec` issue. Therefore, I firmly believe tha

Re: [PR] Update tpch, clickbench, sort_tpch to mark failed queries [datafusion]

2025-05-31 Thread via GitHub
ding-young commented on code in PR #16182: URL: https://github.com/apache/datafusion/pull/16182#discussion_r2117532342 ## benchmarks/src/clickbench.rs: ## @@ -128,36 +128,70 @@ impl RunOpt { let ctx = SessionContext::new_with_config_rt(config, rt_builder.build_arc()?);

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-31 Thread via GitHub
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2924659989 So if we're to change `RecordBatch` to `Vec` in the hashjoin, I think this is what's roughly needed: * Change `batch` to `Vec` in `JoinLeftData` * Initialize `JoinLef

Re: [PR] feat: Translate Hadoop S3A configurations to object_store configurations [datafusion-comet]

2025-05-31 Thread via GitHub
Kontinuation commented on code in PR #1817: URL: https://github.com/apache/datafusion-comet/pull/1817#discussion_r2117529719 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadFromS3Suite.scala: ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (AS