Re: [PR] [Test] Upgrade to edition 2024 [datafusion]
Dandandan commented on PR #15805: URL: https://github.com/apache/datafusion/pull/15805#issuecomment-2820649623 Nice experiment, closing. No compilation improvement just from `cargo check` that I observed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] doc: Adding Feldera as known user [datafusion]
xudong963 merged PR #15799: URL: https://github.com/apache/datafusion/pull/15799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] add support for XMLTABLE(...) [datafusion-sqlparser-rs]
lovasoa commented on PR #1817: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1817#issuecomment-2820406341 Hi @alamb ! I'd love to get your feedback on this :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: AQE creating a non-supported Final HashAggregate post-shuffle [datafusion-comet]
EmilyMatt commented on PR #1390: URL: https://github.com/apache/datafusion-comet/pull/1390#issuecomment-2820434223 > @EmilyMatt I see that plan stability tests are failing. Do you plan to update the golden files? I was unable to, I tried following the guide, as well as tried various changes, none of them worked, the compilation process was weird using the command in the docs, and did not contain my changes nor lots of other recent changes in main I think I'll close this PR and open it when it is more relevant, as at the moment it is only a performance boost, not a crash fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` [datafusion]
Rachelint opened a new issue, #15803: URL: https://github.com/apache/datafusion/issues/15803 ### Is your feature request related to a problem or challenge? I found some unnecessary functions are called, due to using `unwrap_or` rather than `unwrap_or_else` in funcation call case (some fucntion calls may be not really cheap). I have eliminate some of them in `SessionStateBuilder::build`(#15800 ), but I think we should use lints to ensure it. ### Describe the solution you'd like - Add `or_fun_call` lint to crates - Make clippy again by eliminating the function call in `unwrap_or` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` [datafusion]
Rachelint opened a new issue, #15802: URL: https://github.com/apache/datafusion/issues/15802 ### Is your feature request related to a problem or challenge? I found some unnecessary functions are called, due to using `unwrap_or` rather than `unwrap_or_else` in funcation call case (some fucntion calls may be not really cheap). I have eliminate some of them in `SessionStateBuilder::build`(#15800 ), but I think we should use lints to ensure it. ### Describe the solution you'd like - Add `or_fun_call` lint to crates - Make clippy again by eliminating the function call in `unwrap_or` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: AQE creating a non-supported Final HashAggregate post-shuffle [datafusion-comet]
EmilyMatt closed pull request #1390: fix: AQE creating a non-supported Final HashAggregate post-shuffle URL: https://github.com/apache/datafusion-comet/pull/1390 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Deprecate ExprSchemable functions [datafusion]
ajita-asthana commented on issue #15798: URL: https://github.com/apache/datafusion/issues/15798#issuecomment-2821104895 I want to work on this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix: fetch is missing in `replace_order_preserving_variants` method during `EnforceDistribution` optimizer [datafusion]
xudong963 opened a new pull request, #15808: URL: https://github.com/apache/datafusion/pull/15808 ## Which issue does this PR close? - Closes #. ## Rationale for this change If SPM contains `fetch`, then fetch will be missing in the `replace_order_preserving_variants` method during the `EnforceDistribution` optimizer. ## What changes are included in this PR? To avoid missing `fetch` ## Are these changes tested? Yes, ut ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update to 2024 edition [datafusion]
Dandandan opened a new issue, #15804: URL: https://github.com/apache/datafusion/issues/15804 ### Is your feature request related to a problem or challenge? At some point we would like to migrate to the 2024 which contains some improvements to the language / compiler syntax and libraries, compile times (mainly doctests), etc. ### Describe the solution you'd like ``` > cargo update > cargo fix --edition ``` Followed by updating Rust version / edition and fixing issues ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` [datafusion]
Rachelint closed issue #15803: Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` URL: https://github.com/apache/datafusion/issues/15803 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] [Test] Upgrade to edition 2024 [datafusion]
Dandandan opened a new pull request, #15805: URL: https://github.com/apache/datafusion/pull/15805 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` [datafusion]
Rachelint commented on issue #15803: URL: https://github.com/apache/datafusion/issues/15803#issuecomment-2820522360 Sorry, due to network problem, this issue are created twice, just close this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): bump half from 2.5.0 to 2.6.0 [datafusion]
xudong963 merged PR #15806: URL: https://github.com/apache/datafusion/pull/15806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] minor: `executor_shutdown_while_running` test has race condition [datafusion-ballista]
milenkovicm merged PR #1248: URL: https://github.com/apache/datafusion-ballista/pull/1248 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Suspicious slow test in Ballista [datafusion-ballista]
milenkovicm closed issue #235: Suspicious slow test in Ballista URL: https://github.com/apache/datafusion-ballista/issues/235 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Merging Statistics is slow [datafusion]
robert3005 opened a new issue, #15809: URL: https://github.com/apache/datafusion/issues/15809 ### Is your feature request related to a problem or challenge? If you have a datasource that reports a sum statistic then merging statistics for a file group is considerably slower when compared to cases without sum statistic. Profiling reveals it's mostly due to `Precision::add` doing a lot of extra work. ### Describe the solution you'd like Ideally we'd transpose the stats aggregation and aggregate them column wise instead of column group wise ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support non UTF-8 in CSV files [datafusion]
houseme commented on issue #15756: URL: https://github.com/apache/datafusion/issues/15756#issuecomment-2820310701 > Or fully support utf8? 期待支持全部的utf-8[偷笑] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] refactor filter pushdown apis [datafusion]
adriangb opened a new pull request, #15801: URL: https://github.com/apache/datafusion/pull/15801 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2053476865 ## datafusion/physical-optimizer/src/push_down_filter.rs: ## @@ -382,7 +383,7 @@ impl PhysicalOptimizerRule for PushdownFilter { context .transform_up(|node| { -if node.plan.as_any().downcast_ref::().is_some() { +if node.plan.as_any().downcast_ref::().is_some() || node.plan.as_any().downcast_ref::().is_some() { Review Comment: Needs fixing of some failing tests, cleanup of the plethora of helper methods I added and a lot of docs but here's the idea: https://github.com/apache/datafusion/pull/15801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2053476865 ## datafusion/physical-optimizer/src/push_down_filter.rs: ## @@ -382,7 +383,7 @@ impl PhysicalOptimizerRule for PushdownFilter { context .transform_up(|node| { -if node.plan.as_any().downcast_ref::().is_some() { +if node.plan.as_any().downcast_ref::().is_some() || node.plan.as_any().downcast_ref::().is_some() { Review Comment: Needs fixing of some failing tests, cleanup of the plethora of helper methods I added and a lot of docs but here's the idea: https://github.com/apache/datafusion/pull/15801. The points are: - No downcast matching / hardcoding of implementations - Only recurses once / no retrying - Does no cloning / copying for branches that have no changes - Doesn't insert new operators -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] [Test] Upgrade to edition 2024 [datafusion]
Dandandan closed pull request #15805: [Test] Upgrade to edition 2024 URL: https://github.com/apache/datafusion/pull/15805 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Eliminate the function call in `xxx_or (e.g. unwrap_or("".to_string())` [datafusion]
Rachelint commented on issue #15802: URL: https://github.com/apache/datafusion/issues/15802#issuecomment-2820528686 Recommend to do add it crate by crate(one pr for one crate), like what are done in https://github.com/apache/datafusion/issues/11143 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(deps): bump half from 2.5.0 to 2.6.0 [datafusion]
dependabot[bot] opened a new pull request, #15806: URL: https://github.com/apache/datafusion/pull/15806 Bumps [half](https://github.com/VoidStarKat/half-rs) from 2.5.0 to 2.6.0. Release notes Sourced from https://github.com/VoidStarKat/half-rs/releases";>half's releases. 2.6.0 Changed Fixed some incorrect minimum supported versions of dependencies that weren't caught due to improper Cargo.lock: num-traits 0.2.14 -> 0.2.16 zerocopy 0.8.0 -> 0.8.23 arbitrary 1.3.2 -> 1.4.1 Added f16 and bf16 now implement Immutable and KnownLayout for zerocopy crate. By [https://github.com/usamoi";>@usamoi]. Changelog Sourced from https://github.com/VoidStarKat/half-rs/blob/main/CHANGELOG.md";>half's changelog. [2.6.0] - 2024-04-08 Changed Fixed some incorrect minimum supported versions of dependencies that weren't caught due to improper Cargo.lock: num-traits 0.2.14 -> 0.2.16 zerocopy 0.8.0 -> 0.8.23 arbitrary 1.3.2 -> 1.4.1 Added f16 and bf16 now implement Immutable and KnownLayout for zerocopy crate. By [https://github.com/usamoi";>@usamoi]. Commits https://github.com/VoidStarKat/half-rs/commit/b1894e955031e0b1be2b06e5bbab4c3cdc65e883";>b1894e9 Bump version to 2.6.0 https://github.com/VoidStarKat/half-rs/commit/ec91af59c8db3de1e7d64be5bd55c9d38ac04a8b";>ec91af5 Fix min dependency versions https://github.com/VoidStarKat/half-rs/commit/bee721b27a8f38d506fc3c6f66df99111a2f0bc1";>bee721b update changelog https://github.com/VoidStarKat/half-rs/commit/2ba131f218a5dde62f8f04d3e473481db79e918e";>2ba131f Merge pull request https://redirect.github.com/VoidStarKat/half-rs/issues/125";>#125 from tensorchord/2.5.0-zerocopy https://github.com/VoidStarKat/half-rs/commit/55c36c1b2c37e9db917383773f29c107e1f2d849";>55c36c1 Update unmaintained/outdated CI actions https://github.com/VoidStarKat/half-rs/commit/7580fd6ed13261dbde0cfcc982469a2b6047e7e4";>7580fd6 Update repository link https://github.com/VoidStarKat/half-rs/commit/9dc2559e95f295aaa8ffef5e0529167731dffd47";>9dc2559 derive Immutable and KnownLayout for f16 and bf16 See full diff in https://github.com/VoidStarKat/half-rs/compare/v2.5.0...v2.6.0";>compare view [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Feature/benchmark config from env [datafusion]
ctsk commented on PR #15782: URL: https://github.com/apache/datafusion/pull/15782#issuecomment-2820537298 I can't see an easy way to check what environment variables were actually picked up. I've opted to add some logging to `ConfigOptions::from_env` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add `or_fun_call` and `unnecessary_lazy_evaluations` lints on `core` [datafusion]
Rachelint opened a new pull request, #15807: URL: https://github.com/apache/datafusion/pull/15807 ## Which issue does this PR close? Part of #15802 - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
jayzhan211 commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2053975043 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5006,7 +5006,7 @@ SELECT column5, avg(column1) FROM d GROUP BY column5; query I?? SELECT column5, column1, avg(column1) OVER (PARTITION BY column5 ORDER BY column1 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as window_avg -FROM d WHERE column1 IS NOT NULL; +FROM d WHERE column1 IS NOT NULL ORDER BY column5 DESC; Review Comment: You probably need to rebase, I find the result in main branch is the same as your ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5006,7 +5006,7 @@ SELECT column5, avg(column1) FROM d GROUP BY column5; query I?? SELECT column5, column1, avg(column1) OVER (PARTITION BY column5 ORDER BY column1 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as window_avg -FROM d WHERE column1 IS NOT NULL; +FROM d WHERE column1 IS NOT NULL ORDER BY column5 DESC; Review Comment: You probably need to rebase, I find the result in main branch is the same as yours -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
jayzhan211 commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2053975043 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5006,7 +5006,7 @@ SELECT column5, avg(column1) FROM d GROUP BY column5; query I?? SELECT column5, column1, avg(column1) OVER (PARTITION BY column5 ORDER BY column1 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as window_avg -FROM d WHERE column1 IS NOT NULL; +FROM d WHERE column1 IS NOT NULL ORDER BY column5 DESC; Review Comment: You probably need to rebase. partition by is similar to group by, so you won't get the deterministic result. You need `rowsort` for this kind of query https://github.com/apache/datafusion/blob/449c744c8369ecc3436d39010ceb70b271389c5f/datafusion/sqllogictest/test_files/aggregate.slt#L5008-L5015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
jayzhan211 commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2053975043 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5006,7 +5006,7 @@ SELECT column5, avg(column1) FROM d GROUP BY column5; query I?? SELECT column5, column1, avg(column1) OVER (PARTITION BY column5 ORDER BY column1 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as window_avg -FROM d WHERE column1 IS NOT NULL; +FROM d WHERE column1 IS NOT NULL ORDER BY column5 DESC; Review Comment: You probably need to rebase, I find the result in main branch is the same as yours https://github.com/apache/datafusion/blob/449c744c8369ecc3436d39010ceb70b271389c5f/datafusion/sqllogictest/test_files/aggregate.slt#L5008-L5015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
jayzhan211 commented on code in PR #15769: URL: https://github.com/apache/datafusion/pull/15769#discussion_r2053975043 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5006,7 +5006,7 @@ SELECT column5, avg(column1) FROM d GROUP BY column5; query I?? SELECT column5, column1, avg(column1) OVER (PARTITION BY column5 ORDER BY column1 ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) as window_avg -FROM d WHERE column1 IS NOT NULL; +FROM d WHERE column1 IS NOT NULL ORDER BY column5 DESC; Review Comment: You probably need to rebase https://github.com/apache/datafusion/blob/449c744c8369ecc3436d39010ceb70b271389c5f/datafusion/sqllogictest/test_files/aggregate.slt#L5008-L5015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce Async User Defined Functions [datafusion]
berkaysynnada commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2821209325 What's the status of this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
qstommyshu commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054082743 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: Hi @brayanjuls , the reason why this happen is there is no parameters being passed to the sql, but the sql itself contains positional arguments like `$1` and `$2`, please refer to https://datafusion.apache.org/user-guide/sql/prepared_statements.html#inferred-types. The simple way you can fix it is by adding the position arguments type to the `sql` like how it was done in [`test_prepare_statement_to_plan_one_param`](https://github.com/apache/datafusion/blob/2f7cc9beb15c467320e5baf75def0d79f9558ee2/datafusion/sql/tests/sql_integration.rs#L4828-L4855). If you want to dig into the details about it, you can check [how planner parse `PREPARE` statement](https://github.com/apache/datafusion/blob/2f7cc9beb15c467320e5baf75def0d79f9558ee2/datafusion/sql/src/planner.rs#L194-L208) and transfer it into a logical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054367997 ## spark/src/main/scala/org/apache/comet/DataTypeSupport.scala: ## @@ -37,8 +39,7 @@ trait DataTypeSupport { private def isGloballySupported(dt: DataType): Boolean = dt match { case ByteType | ShortType -if CometSparkSessionExtensions.usingDataFusionParquetExec(SQLConf.get) && - !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() => +if usingParquetExecWithIncompatTypes(CometConf.COMET_NATIVE_SCAN_IMPL.get(SQLConf.get)) => Review Comment: This config could change during the Spark session lifetime (we often change it during tests, for example) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add Extension Type / Metadata support for Scalar UDFs [datafusion]
timsaucer commented on PR #15646: URL: https://github.com/apache/datafusion/pull/15646#issuecomment-2821605182 > Hi @timsaucer. I'd like to review this before getting merged. If it's not blocking anything, would it be okay to wait until weekend? I would greatly appreciate the review, and I think it is worth waiting to get another look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
comphead commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054307233 ## spark/src/main/scala/org/apache/comet/DataTypeSupport.scala: ## @@ -37,8 +39,7 @@ trait DataTypeSupport { private def isGloballySupported(dt: DataType): Boolean = dt match { case ByteType | ShortType -if CometSparkSessionExtensions.usingDataFusionParquetExec(SQLConf.get) && - !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() => +if usingParquetExecWithIncompatTypes(CometConf.COMET_NATIVE_SCAN_IMPL.get(SQLConf.get)) => Review Comment: can we create a global lazy val to reduce number of `CometConf.COMET_NATIVE_SCAN_IMPL.get(SQLConf.get)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
etseidl opened a new issue, #15812: URL: https://github.com/apache/datafusion/issues/15812 ### Describe the bug This was mentioned in https://github.com/apache/datafusion/issues/15742#issuecomment-2815595171 and discussed in detail in https://github.com/apache/parquet-format/pull/221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form `x [gt|lt] literal`. Consider a column consisting of `[1.0, 0.0, -1.0, NaN, -2.0]`, the max will be 1 and the min -2. A query like `select * from ... where x > 2` will return no rows because no chunk exists where `max > 2`. ### To Reproduce ```sql > select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16'); +---+ | x | +---+ +---+ 0 row(s) fetched. ``` ### Expected behavior The above query should return a single row containing `NaN`. ### Additional context The Parquet community is considering changes to allow for `NaN` in statistics, with the currently favored approach being adding a new `ColumnOrder` to the specification. This will correct the issue above, but datafusion will need to check the `ColumnOrder` to know whether or not floating point statistics can be trusted. Also note that if/when https://github.com/apache/parquet-format/pull/221 is merged, other predicates such as `isnan(x)` might be candidates for pruning, but that is an optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Introduce Async User Defined Functions [datafusion]
goldmedal commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2821640679 > What's the status of this PR? It's ready to review. I'm still waiting for someone to help review it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: match Maven plugin versions with Spark 3.5 [datafusion-comet]
andygrove commented on PR #1668: URL: https://github.com/apache/datafusion-comet/pull/1668#issuecomment-2821464372 CI builds are failing with errors such as: ``` [error] /__w/datafusion-comet/datafusion-comet/apache-spark/core/target/scala-2.13/src_managed/main/org/apache/spark/status/protobuf/StoreTypes.java:12908:20: object protobuf is not a member of package com.google [error] com.google.protobuf.ExtensionRegistryLite extensionRegistry) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
comphead commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054555242 ## spark/src/main/scala/org/apache/comet/DataTypeSupport.scala: ## @@ -37,8 +39,7 @@ trait DataTypeSupport { private def isGloballySupported(dt: DataType): Boolean = dt match { case ByteType | ShortType -if CometSparkSessionExtensions.usingDataFusionParquetExec(SQLConf.get) && - !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() => +if usingParquetExecWithIncompatTypes(CometConf.COMET_NATIVE_SCAN_IMPL.get(SQLConf.get)) => Review Comment: fair enough, it can be a function. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
etseidl commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2822027266 > Where is `ColumnOrder` available? In theory we have access to the full metadata of the parquet file when building the pruning predicate batches / predicate. It's an array in the `FileMetaData`, so defined once for the entire file. I've been trying to trace where the plans get built from, and it seems like there are two paths...one from `ParquetSource::with_predicate`, which seems to happen before the file is opened, and then the path from `ParquetOpener::open` where we do have access to the file metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
adriangb commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2822044138 Long term I think it will only happen in `ParquetOpener::open` https://github.com/apache/datafusion/pull/15769 precisely for reasons like this 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
hsiang-c commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054576057 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: @comphead Thanks for the feedback. Perhaps the field itself is sufficient? It has field name and filed type. Here is an example: `StructField(weight_kg,ArrayType(ShortType,true),true)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Inline table scan drops projection [datafusion]
vadimpiven opened a new issue, #15810: URL: https://github.com/apache/datafusion/issues/15810 ### Describe the bug PR https://github.com/apache/datafusion/pull/15201 introduced a bug here https://github.com/apache/datafusion/blob/9730404028a91a7fe875ea3f88bafdbcb305ae6c/datafusion/expr/src/logical_plan/builder.rs#L501 - there is a check for filters, but there is no check for projection. As a result - projection is dropped. ### To Reproduce - Create `TableView` over `EmptyRelation` with 2 columns - Call `LogicalPlanBuilder::scan` with projection `[0]` - Call `println!("{plan}")` and check that there is no projection ### Expected behavior Projection must be preserved ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore: Make Aggregate transformation more compact [datafusion-comet]
EmilyMatt opened a new pull request, #1670: URL: https://github.com/apache/datafusion-comet/pull/1670 ## Which issue does this PR close? Does not close anything, but is probably a step towards #1669. ## Rationale for this change There is needless nesting and variable assignment, since this is a huge function already we probably want to keep things compact. Checking the two boolean conditions with an or operator feels a bit cleaner ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `or_fun_call` and `unnecessary_lazy_evaluations` lints on `core` [datafusion]
Rachelint merged PR #15807: URL: https://github.com/apache/datafusion/pull/15807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add `or_fun_call` and `unnecessary_lazy_evaluations` lints on `core` [datafusion]
Rachelint commented on PR #15807: URL: https://github.com/apache/datafusion/pull/15807#issuecomment-2821213910 @jayzhan211 @2010YOUY01 Thanks for reviewing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Allow UDFs to return custom `Diagnostic` [datafusion]
eliaperantoni commented on issue #15276: URL: https://github.com/apache/datafusion/issues/15276#issuecomment-2821421644 > [@eliaperantoni](https://github.com/eliaperantoni) I had a follow up question to this regarding the `diagnose` trait function. We want to call the diagnose function during logical planning and pass in `ReturnTypeArgs` and `FnCallSpans` so that the implementor can create a `Diagnostic`. In [#14430](https://github.com/apache/datafusion/issues/14430), the `Diagnostic` is attached to an existing error (via the `with_diagnostic` method) and is then bubbled up. Should the `diagnose` trait function act similarly here? For example, would the `diagnose` function be called [here](https://github.com/apache/datafusion/blob/19a1e58071cd60e74faabab805d97fd269418925/datafusion/expr/src/expr_schema.rs#L435) for a Scalar UDF? And inside the `diagnose` function is where a Scalar UDF implementation would add notes to enrich the error message? Since we're passing in `ReturnTypeArgs` and `FnCallSpans` we can use `ReturnTypeArgs` to determine the erroneous argument and `FnCallSpans` to highlight wher e it is located. Is that right? @jsai28 I'm extremely sorry for the delay 🙏 I've been very busy with other things. But everything you said is absolutely correct. The only think I'd point out is that I think it would be the `diagnose` function itself that creates an error, because otherwise there's none. And then `diagnose` would attach one or more `Diagnostic` to that error. This means that the programmer of a custom UDF has to build and return an error of some kind, even though they might just be interested in creating `Diagnostic`s since that's what ends up being seen by the user. Perhaps we could change the return type of `diagnose` to be `Vec`, and if a non empty vec is returned, then we create a standard error (eg. `DataFusionError::Plan("udf is invalid")`) and attach the diagnostics to it? What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Inline table scan drops projection [datafusion]
vadimpiven commented on issue #15810: URL: https://github.com/apache/datafusion/issues/15810#issuecomment-2821420869 @qstommyshu I have pushed the fix https://github.com/apache/datafusion/pull/15811 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: transfer Apache Spark runtime conf to native engine [datafusion-comet]
andygrove commented on code in PR #1649: URL: https://github.com/apache/datafusion-comet/pull/1649#discussion_r2054186718 ## native/core/src/execution/jni_api.rs: ## @@ -198,6 +199,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_createPlan( task_attempt_id: jlong, debug_native: jboolean, explain_native: jboolean, +spark_conf: JObject, Review Comment: An alternate approach could be to encode the Spark config in protobuf format and pass those bytes into native code rather than have native code make calls to the JVM to iterate over the map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Add option to adjust writer buffer size for query output [datafusion]
m09526 commented on code in PR #15747: URL: https://github.com/apache/datafusion/pull/15747#discussion_r2054291448 ## datafusion/datasource/src/write/mod.rs: ## @@ -88,6 +91,21 @@ pub async fn create_writer( file_compression_type.convert_async_writer(buf_writer) } +/// Returns an [`AsyncWrite`] which writes to the given object store location Review Comment: Thank you for the review @alamb, I'll get on to those improvements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add Extension Type / Metadata support for Scalar UDFs [datafusion]
paleolimbot commented on code in PR #15646: URL: https://github.com/apache/datafusion/pull/15646#discussion_r2054292174 ## datafusion/core/tests/user_defined/user_defined_scalar_functions.rs: ## @@ -1367,3 +1372,342 @@ async fn register_alltypes_parquet(ctx: &SessionContext) -> Result<()> { async fn plan_and_collect(ctx: &SessionContext, sql: &str) -> Result> { ctx.sql(sql).await?.collect().await } + +#[derive(Debug)] +struct MetadataBasedUdf { +name: String, +signature: Signature, +metadata: HashMap, +} + +impl MetadataBasedUdf { +fn new(metadata: HashMap) -> Self { +// The name we return must be unique. Otherwise we will not call distinct +// instances of this UDF. This is a small hack for the unit tests to get unique +// names, but you could do something more elegant with the metadata. +let name = format!("metadata_based_udf_{}", metadata.len()); +Self { +name, +signature: Signature::exact(vec![DataType::UInt64], Volatility::Immutable), +metadata, +} +} +} + +impl ScalarUDFImpl for MetadataBasedUdf { +fn as_any(&self) -> &dyn Any { +self +} + +fn name(&self) -> &str { +&self.name +} + +fn signature(&self) -> &Signature { +&self.signature +} + +fn return_type(&self, _args: &[DataType]) -> Result { +unimplemented!( +"this should never be called since return_field_from_args is implemented" +); +} + +fn return_field_from_args(&self, _args: ReturnFieldArgs) -> Result { +Ok(Field::new(self.name(), DataType::UInt64, true) +.with_metadata(self.metadata.clone())) +} Review Comment: Thanks - I'll do testing as well on our suite when this merges and see if we can remove our internal workaround 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
comphead commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054294073 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) Review Comment: do we need this asserts? it can lead to runtime exceptions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
comphead commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054295647 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: perhaps it would be easier to specify data type as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
adriangb commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2821897604 I'm not immediately sure. Is the point that the result of `max(2.0, NaN)` depends on how you define the ordering of floating point numbers wrt NaN, which has two variations? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
adriangb commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2821901778 If so the simplest short term solution would be to not write stats for containers that have NaN. At least results would then be correct. How do we handle this with nulls 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
qstommyshu commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054496588 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: I didn't check the details for parsing "PREPARE" statement yet, but I think the type inference behaviour depends on how the planner parser is implemented, it could be either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
qstommyshu commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054496588 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: I didn't check the details for "PREPARE" statement yet, but I think the type inference behaviour depends on how the planner parser is implemented, it could be either way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
qstommyshu commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054082743 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: Hi @brayanjuls , I've taken a look at your tests. The reason why this happen is there is no parameters being passed to the sql, but the sql itself contains positional arguments like `$1` and `$2`, please refer to https://datafusion.apache.org/user-guide/sql/prepared_statements.html#inferred-types. The simple way you can fix it is by adding the position arguments type to the `sql` like how it was done in [`test_prepare_statement_to_plan_one_param`](https://github.com/apache/datafusion/blob/main/datafusion/sql/tests/sql_integration.rs). If you want to dig into the details about it, you can check [how planner parse `PREPARE` statement](https://github.com/apache/datafusion/blob/main/datafusion/sql/src/planner.rs) and transfer it into a logical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
qstommyshu commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054082743 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: Hi @brayanjuls , I've taken a look at your tests. The reason why this happen is there is no parameters being passed to the sql, but the sql itself contains positional arguments like `$1` and `$2`, please refer to https://datafusion.apache.org/user-guide/sql/prepared_statements.html#inferred-types. The simple way you can fix it is by adding the position arguments type to the `sql` like how it was done in [`test_prepare_statement_to_plan_one_param`](https://github.com/apache/datafusion/blob/2f7cc9beb15c467320e5baf75def0d79f9558ee2/datafusion/sql/tests/sql_integration.rs#L4828-L4855). If you want to dig into the details about it, you can check [how planner parse `PREPARE` statement](https://github.com/apache/datafusion/blob/2f7cc9beb15c467320e5baf75def0d79f9558ee2/datafusion/sql/src/planner.rs#L194-L208) and transfer it into a logical plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Inline table scan drops projection [datafusion]
qstommyshu commented on issue #15810: URL: https://github.com/apache/datafusion/issues/15810#issuecomment-2821345483 I can take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Preserve projection for inline scan [datafusion]
xudong963 commented on code in PR #15811: URL: https://github.com/apache/datafusion/pull/15811#discussion_r2054259450 ## datafusion/expr/src/logical_plan/builder.rs: ## @@ -498,7 +498,7 @@ impl LogicalPlanBuilder { TableScan::try_new(table_name, table_source, projection, filters, fetch)?; // Inline TableScan -if table_scan.filters.is_empty() { +if table_scan.projection.is_none() && table_scan.filters.is_empty() { Review Comment: Thank you, nice find! I'm wondering if we can still inline `TableScan` and wrap `Projection` on top of the view's logical plan. In this way, maybe the plan will be easier to optimize in the optimizer. -- This can be a further experiment, I think the PR is good to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add Extension Type / Metadata support for Scalar UDFs [datafusion]
berkaysynnada commented on PR #15646: URL: https://github.com/apache/datafusion/pull/15646#issuecomment-2821563176 Hi @timsaucer. I'd like to review this before getting merged. If it's not blocking anything, would it be okay to wait until weekend? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Investigate unstable benchmark results on macOS [datafusion-comet]
mbutrovich commented on issue #1648: URL: https://github.com/apache/datafusion-comet/issues/1648#issuecomment-2821858616 I've also only ever used jemalloc in the past. I'm not sure what the discussion was at the time to pursue mimalloc for Comet. @mdcallag has some recent writing on the topic, althrough RocksDB is a different workload than what we're doing: https://smalldatum.blogspot.com/2025/04/battle-of-mallocators.html https://smalldatum.blogspot.com/2025/04/battle-of-mallocators-part-2.html The Germans :de: did a bake-off a few years ago and they chose jemalloc for Umbra (and likely CedarDB): https://www.adms-conf.org/2019-camera-ready/durner_adms19.pdf -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Investigate unstable benchmark results on macOS [datafusion-comet]
mbutrovich commented on issue #1648: URL: https://github.com/apache/datafusion-comet/issues/1648#issuecomment-2821860775 > However, I'm uncertain whether jemalloc can still be used with LD_PRELOAD when Comet is compiled with mimalloc. The last time I attempted to dynamically override malloc it didn't work. Yeah I don't think you want to mix and match like that. Does LD_PRELOAD also change the allocator for the JVM? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Investigate unstable benchmark results on macOS [datafusion-comet]
Kontinuation commented on issue #1648: URL: https://github.com/apache/datafusion-comet/issues/1648#issuecomment-2821878876 > Does LD_PRELOAD also change the allocator for the JVM? Memory allocated by `Unsafe_AllocateMemory0` (Arrow native memory, Spark off-heap memory) uses the allocator. The large JVM heap regions seems to be directly allocated using syscalls such as mmap, which is not affected by allocator. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Refactor CometSparkSessionExtensions.scala [datafusion-comet]
EmilyMatt opened a new issue, #1669: URL: https://github.com/apache/datafusion-comet/issues/1669 ### What is the problem the feature request solves? This is a huge file, containing multiple rules, I think it can benefit from splitting it into smaller, more manageable files. It can probably also bear a general cleanup. ### Describe the potential solution There is already a rules package containing the "RewriteMergeSort" rule, I think we can add a ScanRule file, and ExecRule file, and move the relevant code to each file respectively. A general refactor in both rules will probably be welcome as well. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: match Maven plugin versions with Spark 3.5 [datafusion-comet]
codecov-commenter commented on PR #1668: URL: https://github.com/apache/datafusion-comet/pull/1668#issuecomment-2821568973 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1668?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 57.43%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`d92c255`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/d92c2559d12be4e776df6c07301489b1019debcb?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 156 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1668 +/- ## + Coverage 56.12% 57.43% +1.30% - Complexity 976 1021 +45 Files 119 125 +6 Lines 1174312508 +765 Branches 2251 2354 +103 + Hits 6591 7184 +593 - Misses 4012 4136 +124 - Partials 1140 1188 +48 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1668?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add union_tag scalar function [datafusion]
Omega359 commented on PR #14687: URL: https://github.com/apache/datafusion/pull/14687#issuecomment-2821576525 @alamb, thoughts on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Emit warning with Diagnostic when doing = Null [datafusion]
eliaperantoni commented on PR #15696: URL: https://github.com/apache/datafusion/pull/15696#issuecomment-2821404985 > The warning detection is integrated during `BinaryExpr` processing, which should naturally limit it to predicate contexts. Statements like `UPDATE users SET password = NULL` won't trigger false warnings by default. Could you confirm this approach is acceptable? Absolutely, that sounds good. And very thoughtful of you to check that `UPDATE` is not affected :) > In usual cases, the left operand is an `Identifier`. The current implementation combines the identifier's left span with NULL's right span for precise highlighting. For rare non-identifier cases (e.g., some complex expressions that I can't immediately come up with one right now), we fall back to using just the NULL span. This balances precision with robustness. Nice! 💯 > While I've added tests for basic and multiple `= NULL` cases, could you suggest any edge cases or additional scenarios that should be validated? Perhaps accessing fields of structs? eg: ```sql SELECT get_field({'x': null}, 'x') = null; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Preserve projection for inline scan [datafusion]
vadimpiven opened a new pull request, #15811: URL: https://github.com/apache/datafusion/pull/15811 ## Which issue does this PR close? - Closes #15810 ## Rationale for this change PR https://github.com/apache/datafusion/pull/15201 introduced a bug here https://github.com/apache/datafusion/blob/9730404028a91a7fe875ea3f88bafdbcb305ae6c/datafusion/expr/src/logical_plan/builder.rs#L501 - there is a check for filters, but there is no check for projection. As a result - projection is dropped. ## What changes are included in this PR? This PR fixes incorrect if condition. ## Are these changes tested? Introduced unit test covering the change. ## Are there any user-facing changes? No user-facing changes. This PR reverts fixes API brake introduced in 47.0.0 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Investigate unstable benchmark results on macOS [datafusion-comet]
Dandandan commented on issue #1648: URL: https://github.com/apache/datafusion-comet/issues/1648#issuecomment-2821590325 FYI: We found mimalloc to be unstable for running long term (keeps memory allocated but seems doesn't release it over time => running into OOMs much more easily). Switching to jemalloc fixed this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054660077 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: @comphead @parthchandra @mbutrovich any thoughts on the approach we should take here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
codecov-commenter commented on PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#issuecomment-2822167010 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `82.60870%` with `4 lines` in your changes missing coverage. Please review. > Project coverage is 57.48%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`fbc1cf1`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/fbc1cf1c3e8e6fbe511ad36bbdbc5c3f2a225db1?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 156 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [...org/apache/comet/CometSparkSessionExtensions.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2FCometSparkSessionExtensions.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9Db21ldFNwYXJrU2Vzc2lvbkV4dGVuc2lvbnMuc2NhbGE=) | 82.35% | [1 Missing and 2 partials :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | | [.../main/scala/org/apache/comet/DataTypeSupport.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2FDataTypeSupport.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9EYXRhVHlwZVN1cHBvcnQuc2NhbGE=) | 80.00% | [0 Missing and 1 partial :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1667 +/- ## + Coverage 56.12% 57.48% +1.35% - Complexity 976 1021 +45 Files 119 125 +6 Lines 1174312525 +782 Branches 2251 2357 +106 + Hits 6591 7200 +609 - Misses 4012 4136 +124 - Partials 1140 1189 +49 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1667?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Refactor Memory Pools [datafusion-comet]
andygrove merged PR #1662: URL: https://github.com/apache/datafusion-comet/pull/1662 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Fix ILIKE expression support in SQL unparser [datafusion]
ewgenius opened a new pull request, #15820: URL: https://github.com/apache/datafusion/pull/15820 ## Which issue does this PR close? SQL Unparser incorrectly handles `ILIKE` expressions, handling all as `LIKE`, ignoring `case_insensitive` flag. ## Rationale for this change Case sensitivity shouldn't be ignored in `ILIKE` expressions, as it's a part of the SQL standard. ## What changes are included in this PR? Added correct handling for `case_insensitive` flag in `ILIKE` expressions in SQL unparser ## Are these changes tested? Yes, covered by additional SQL roundtrip test in plan_to_sql, fixed and extended expressions tests for LIKE/ILIKE. ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] pipe column orderings into pruning predicate creation [datafusion]
adriangb commented on code in PR #15821: URL: https://github.com/apache/datafusion/pull/15821#discussion_r2055221571 ## datafusion-examples/examples/advanced_parquet_index.rs: ## @@ -300,8 +300,11 @@ impl IndexTableProvider { // In this example, we use the PruningPredicate's literal guarantees to // analyze the predicate. In a real system, using // `PruningPredicate::prune` would likely be easier to do. -let pruning_predicate = -PruningPredicate::try_new(Arc::clone(predicate), self.schema())?; +let pruning_predicate = PruningPredicate::try_new( +Arc::clone(predicate), +self.schema(), +vec![ColumnOrdering::Unknown; self.schema().fields().len()], +)?; Review Comment: We could add a new signature to avoid API churn, but I wanted to make it explicit for now to see all of the callsites -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
adriangb commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2823016671 @etseidl I opened https://github.com/apache/datafusion/pull/15821 to prove we can pipe the info in. Would you like to continue that PR since you seem to know what the right thing to do with that information is? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve documentation for `FileSource`, `DataSource` and `DataSourceExec` [datafusion]
adriangb commented on PR #15766: URL: https://github.com/apache/datafusion/pull/15766#issuecomment-2823040352 I didn't have time to do the initialization diagram but here is execution: ``` ┌─┐ │ │ │ DataSourceExec│ │ │ └─┘ │ │ ┌──▼──┐ │ │ │ DataSource │ │ │ └─┘ │ ┌──┴┐ ┌──▼──┐ ┌──▼──┐ │ │ │ │ │ FileScanConfig│ │ MemorySourceConfig │ │ │ │ │ └─┘ └─┘ │ │ │ ┌──▼──┐ │ │ │ FileSource │ │ │ └─┘ │ │ │ ┌──▼──┐ │ ArrowSource │ │ ... │ │ParquetSource│ └─┘ │ │ │ ┌──▼──┐ │ │ │ParquetOpener│ │ │ └─┘ │ │ │ ┌──▼──┐ │ │ │ RecordBatch Stream │ │ │ └─┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve documentation for `FileSource`, `DataSource` and `DataSourceExec` [datafusion]
adriangb commented on code in PR #15766: URL: https://github.com/apache/datafusion/pull/15766#discussion_r2055236859 ## datafusion/datasource/src/file.rs: ## @@ -37,9 +37,14 @@ use datafusion_physical_plan::DisplayFormatType; use object_store::ObjectStore; -/// Common file format behaviors needs to implement. +/// file format specific behaviors for elements in [`DataSource`] /// -/// See implementation examples such as `ParquetSource`, `CsvSource` +/// See more details on specific implementations: +/// * [`ArrowSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ArrowSource.html) +/// * [`AvroSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.AvroSource.html) +/// * [`CsvSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.CsvSource.html) +/// * [`JsonSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.JsonSource.html) +/// * [`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html) Review Comment: ```suggestion /// * [`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html) /// [`DataSource`]: crate::source::DataSource ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Improve documentation for `FileSource`, `DataSource` and `DataSourceExec` [datafusion]
adriangb commented on code in PR #15766: URL: https://github.com/apache/datafusion/pull/15766#discussion_r2055236859 ## datafusion/datasource/src/file.rs: ## @@ -37,9 +37,14 @@ use datafusion_physical_plan::DisplayFormatType; use object_store::ObjectStore; -/// Common file format behaviors needs to implement. +/// file format specific behaviors for elements in [`DataSource`] /// -/// See implementation examples such as `ParquetSource`, `CsvSource` +/// See more details on specific implementations: +/// * [`ArrowSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ArrowSource.html) +/// * [`AvroSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.AvroSource.html) +/// * [`CsvSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.CsvSource.html) +/// * [`JsonSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.JsonSource.html) +/// * [`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html) Review Comment: To fix cargo doc: ```suggestion /// * [`ParquetSource`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.ParquetSource.html) /// [`DataSource`]: crate::source::DataSource ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Pruning of floating point Parquet columns is incorrect when `NaN` is present [datafusion]
etseidl commented on issue #15812: URL: https://github.com/apache/datafusion/issues/15812#issuecomment-2823047999 Wow, thanks @adriangb! I'll start on it tomorrow! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Refactor CometSparkSessionExtensions.scala [datafusion-comet]
andygrove commented on issue #1669: URL: https://github.com/apache/datafusion-comet/issues/1669#issuecomment-2821372388 This sounds good to me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Memory limited nest loop join [datafusion]
UBarney commented on issue #15760: URL: https://github.com/apache/datafusion/issues/15760#issuecomment-2821374117 > If the right side is a table backed by a local file, the operator can scan it again without buffering or spilling. I previously thought that 'backed by a local file' and 'spilling' were the same thing. I find that the current implementation loads all results of `join(all_left_table, current_right_batch)` into memory at once. This could potentially contain up to `all_left_table.row_count() * current_right_batch.row_count()` rows in the worst case. https://github.com/apache/datafusion/blob/3fe300c1525d373e9d78c3830e0990204c854a6a/datafusion/physical-plan/src/joins/nested_loop_join.rs#L885-L895 I added some log to confirm that: ``` +let batch1 = result?; +dbg!(&batch1.num_rows()); + +self.batch_transformer.set_batch(batch1); ``` output ``` > select count(*) from (select * from generate_series(1,2) as t1 join generate_series(8194) as t2 on t1.value != t2.value); [datafusion/physical-plan/src/joins/nested_loop_join.rs:900:17] &batch1.num_rows() = 30003 [datafusion/physical-plan/src/joins/nested_loop_join.rs:900:17] &batch1.num_rows() = 81928192 ``` To optimize memory usage, we could potentially compute the join results streamly. Maybe in a separate coroutine and stream them through a channel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Fix Infer prepare statement type tests [datafusion]
brayanjuls commented on code in PR #15743: URL: https://github.com/apache/datafusion/pull/15743#discussion_r2054252874 ## datafusion/sql/tests/sql_integration.rs: ## @@ -4673,16 +4675,17 @@ fn test_infer_types_from_predicate() { } #[test] -fn test_infer_types_from_between_predicate() { -let sql = "SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; +fn test_prepare_statement_infer_types_from_between_predicate() { +let sql = "PREPARE my_plan AS SELECT id, age FROM person WHERE age BETWEEN $1 AND $2"; let plan = logical_plan(sql).unwrap(); assert_snapshot!( plan, @r#" -Projection: person.id, person.age - Filter: person.age BETWEEN $1 AND $2 -TableScan: person +Prepare: "my_plan" [] Review Comment: but wouldn't that mean the type is not being inferred, but instead we are explicitly declaring it? Thanks, I would check the planner parser. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
comphead commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054587262 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: Field sounds good to me, but now I'm wondering how it may look like if unsupported datatype like BYTE is deep nested field? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Parquet filter pushdown via new filter pushdown APIs [datafusion]
adriangb commented on PR #15769: URL: https://github.com/apache/datafusion/pull/15769#issuecomment-2822076779 https://github.com/apache/datafusion/issues/15812 surfaced another reason why building the predicates from the files schemas is necessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on code in PR #15801: URL: https://github.com/apache/datafusion/pull/15801#discussion_r2054603061 ## datafusion/physical-plan/src/filter.rs: ## @@ -438,55 +440,112 @@ impl ExecutionPlan for FilterExec { try_embed_projection(projection, self) } -fn try_pushdown_filters( +fn gather_filters_for_pushdown( &self, -mut fd: FilterDescription, +parent_filters: &[Arc], _config: &ConfigOptions, -) -> Result>> { -// Extend the filter descriptions -fd.filters.push(Arc::clone(&self.predicate)); - -// Extract the information -let child_descriptions = vec![fd]; -let remaining_description = FilterDescription { filters: vec![] }; -let filter_input = Arc::clone(self.input()); - +) -> Result { if let Some(projection_indices) = self.projection.as_ref() { -// Push the filters down, but leave a ProjectionExec behind, instead of the FilterExec -let filter_child_schema = filter_input.schema(); -let proj_exprs = projection_indices +// We need to invert the projection on any referenced columns in the filter +// Create a mapping from the output columns to the input columns (the inverse of the projection) +let inverse_projection = projection_indices .iter() -.map(|p| { -let field = filter_child_schema.field(*p).clone(); -( -Arc::new(Column::new(field.name(), *p)) as Arc, -field.name().to_string(), -) +.enumerate() +.map(|(i, &p)| (i, p)) +.collect::>(); +let predicate = Arc::clone(&self.predicate) +.transform_up(|expr| { +if let Some(col) = expr.as_any().downcast_ref::() { +let index = col.index(); +let index_in_input_schema = +inverse_projection.get(&index).ok_or_else(|| { +DataFusionError::Internal(format!( +"Column {} not found in projection", +index +)) +})?; +Ok(Transformed::yes(Arc::new(Column::new( +col.name(), +*index_in_input_schema, +)) as _)) +} else { +Ok(Transformed::no(expr)) +} Review Comment: There's a bug with this inverting the projection logic that I haven't been able to spot... will continue later -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054613906 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: This PR introduces a two-step process: 1) call `isTypeSupported` 2) call `findUnsupportedField` if `isTypeSupported` returns `false` I wonder if we should consider a different approach so that do not have to duplicate the logic. One idea would be to pass an accumulator into `isTypeSupported` that can collect fallback reasons. For example: ```scala def isTypeSupported(dt: DataType, fallbackReasons: mutable.List[String]) ``` Another approach would be to rename `isTypeSupported` to `validateTypeSupported` and have it throw an exception when it encounters the first unsupported type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054613906 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: This PR introduces a two-step process: 1) call `isTypeSupported` 2) call `findUnsupportedField` if `isTypeSupported` returns `false` I wonder if we should consider a different approach so that we do not have to duplicate the logic. One idea would be to pass an accumulator into `isTypeSupported` that can collect fallback reasons. For example: ```scala def isTypeSupported(dt: DataType, fallbackReasons: mutable.List[String]) ``` Another approach would be to rename `isTypeSupported` to `validateTypeSupported` and have it throw an exception when it encounters the first unsupported type. ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: This PR introduces a two-step process: 1) call `isTypeSupported` 2) call `findUnsupportedField` if `isTypeSupported` returns `false` I wonder if we should consider a different approach so that we do not have to duplicate the logic. One idea would be to pass an accumulator into `isTypeSupported` that can collect fallback reasons. For example: ```scala def isTypeSupported(dt: DataType, fallbackReasons: mutable.List[String]): Boolean ``` Another approach would be to rename `isTypeSupported` to `validateTypeSupported` and have it throw an exception when it encounters the first unsupported type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: More warning info for users [datafusion-comet]
andygrove commented on code in PR #1667: URL: https://github.com/apache/datafusion-comet/pull/1667#discussion_r2054613906 ## spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala: ## @@ -161,10 +161,28 @@ class CometSparkSessionExtensions } if (!schemaSupported) { -fallbackReasons += s"Unsupported schema ${scanExec.requiredSchema} for $scanImpl" +val field: Option[StructField] = + CometNativeScanExec +.findUnsupportedField(scanExec.requiredSchema) + .orElse(CometScanExec.findUnsupportedField(scanExec.requiredSchema)) +assert(field.nonEmpty) +fallbackReasons += + s"Unsupported field ${field.get} in schema ${scanExec.requiredSchema} for $scanImpl" Review Comment: This PR introduces a two-step process: 1) call `isTypeSupported` 2) call `findUnsupportedField` if `isTypeSupported` returns `false` I wonder if we should consider a different approach so that we do not have to duplicate the logic. One idea would be to pass an accumulator into `isTypeSupported` that can collect fallback reasons. For example: ```scala def isTypeSupported(dt: DataType, fallbackReasons: mutable.ListBuffer[String]): Boolean ``` Another approach would be to rename `isTypeSupported` to `validateTypeSupported` and have it throw an exception when it encounters the first unsupported type. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Type coercion does not handle `Float16` correctly [datafusion]
etseidl opened a new issue, #15815: URL: https://github.com/apache/datafusion/issues/15815 ### Describe the bug Performing a query where a `Float16` column is compared to an integer literal results in the `Float16` being cast to `Int64`. ```sql > explain format indent select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > 2; +---+-+ | plan_type | plan | +---+-+ | logical_plan | Filter: CAST(parquet-testing/data/float16_nonzeros_and_nans.parquet.x AS Int64) > Int64(2) | | | TableScan: parquet-testing/data/float16_nonzeros_and_nans.parquet projection=[x], partial_filters=[CAST(parquet-testing/data/float16_nonzeros_and_nans.parquet.x AS Int64) > Int64(2)]| | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: CAST(x@0 AS Int64) > 2 | | | RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1 | | | DataSourceExec: file_groups={1 group: [[Users/seidl/src/datafusion/parquet-testing/data/float16_nonzeros_and_nans.parquet]]}, projection=[x], file_type=parquet, predicate=CAST(x@0 AS Int64) > 2 | | | | +---+-+ ``` ### To Reproduce ``` ./target/debug/datafusion-cli -c "explain format indent select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > 2;" ``` ### Expected behavior This should instead cast `2` to `Float16`. ``` +---+---+ | plan_type | plan | +---+---+ | logical_plan | Filter: parquet-testing/data/float16_nonzeros_and_nans.parquet.x > Float16(2) | | | TableScan: parquet-testing/data/float16_nonzeros_and_nans.parquet projection=[x], partial_filters=[parquet-testing/data/float16_nonzeros_and_nans.parquet.x > Float16(2)] | | physical_plan | CoalesceBatchesExec: target_batch_size=8192 | | | FilterExec: x@0 > 2
Re: [PR] refactor filter pushdown apis [datafusion]
adriangb commented on PR #15801: URL: https://github.com/apache/datafusion/pull/15801#issuecomment-2822461601 cc @berkaysynnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: validate range_fn when align not found [datafusion-sqlparser-rs]
killme2008 closed pull request #1820: fix: validate range_fn when align not found URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1820 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: validate range_fn when align not found [datafusion-sqlparser-rs]
killme2008 opened a new pull request, #1820: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1820 Try to fix https://github.com/GreptimeTeam/greptimedb/issues/5957 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] fix: Add coercion rules for Float16 types [datafusion]
etseidl opened a new pull request, #15816: URL: https://github.com/apache/datafusion/pull/15816 ## Which issue does this PR close? - Closes #15815. ## Rationale for this change Queries of `Float16` columns using integer literals can lead to a loss of precision because `numerical_coercion` does not test for `Float16`. ## What changes are included in this PR? Add check for `Float16` to `expr-common::type_coercion::numerical_coercion`. Also adds decimal conversion from `Float16`. ## Are these changes tested? Yes ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Add coercion rules for Float16 types [datafusion]
etseidl commented on code in PR #15816: URL: https://github.com/apache/datafusion/pull/15816#discussion_r2054873962 ## datafusion/expr-common/src/type_coercion/binary.rs: ## @@ -931,6 +931,7 @@ fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option { Int32 | UInt32 => Some(Decimal128(10, 0)), Int64 | UInt64 => Some(Decimal128(20, 0)), // TODO if we convert the floating-point data to the decimal type, it maybe overflow. +Float16 => Some(Decimal128(6, 3)), Review Comment: I noticed this and the `Decimal256` case while fixing `numerical_coercion`. I haven't found any bugs triggered by this so I'm happy to remove for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] perf: Experimental fix to avoid join strategy regression [datafusion-comet]
andygrove opened a new pull request, #1674: URL: https://github.com/apache/datafusion-comet/pull/1674 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] doc: Document local HDFS setup [datafusion-comet]
codecov-commenter commented on PR #1673: URL: https://github.com/apache/datafusion-comet/pull/1673#issuecomment-2822552934 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1673?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report All modified and coverable lines are covered by tests :white_check_mark: > Project coverage is 58.35%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`1481a25`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/1481a255290d9de3e23ea2ee23663177fd35a2fd?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 157 commits behind head on main. Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1673 +/- ## + Coverage 56.12% 58.35% +2.23% - Complexity 976 1050 +74 Files 119 125 +6 Lines 1174312593 +850 Branches 2251 2362 +111 + Hits 6591 7349 +758 - Misses 4012 4078 +66 - Partials 1140 1166 +26 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1673?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Use HDFS file system based on some parameter (schema or `spark.defaultFS`) [datafusion-comet]
comphead closed issue #1360: Use HDFS file system based on some parameter (schema or `spark.defaultFS`) URL: https://github.com/apache/datafusion-comet/issues/1360 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Experimental] Integrate Comet native reader with remote HDFS [datafusion-comet]
comphead commented on issue #1336: URL: https://github.com/apache/datafusion-comet/issues/1336#issuecomment-2822709486 Closing this, as everything completed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org