Re: [PR] fix: bugs when having and group by are all false [datafusion]
jonahgao commented on PR #11897: URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283248937 > So this PR actually fixed two bugs, (1. global group_by + having false. 2. global group_by + having true) The following case seems to be caused by `EliminateGroupByConstant`. I think we can create a separate fix for it later. ```sh DataFusion CLI v41.0.0 > SELECT AVG(v1) FROM t1 GROUP BY false; ++ | avg(t1.v1) | ++ || ++ 1 row(s) fetched. ``` In DuckDB: ```sh D SELECT AVG(v1) FROM t1 GROUP BY false; ┌─┐ │ avg(v1) │ │ double │ ├─┤ │ 0 rows │ └─┘ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Sketch for aggregation intermediate results blocked management [datafusion]
Rachelint opened a new pull request, #11943: URL: https://github.com/apache/datafusion/pull/11943 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: bugs when having and group by are all false [datafusion]
jayzhan211 commented on PR #11897: URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283302352 Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after `EliminateGroupByConstant`. I think this is why you bring the `is_global_group_by` information down to physical layer. I think another approach is we avoid `EliminateGroupByConstant` at all and we eliminate it when creating physical group by expression 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] TEST CI only [datafusion-comet]
viirya opened a new pull request, #812: URL: https://github.com/apache/datafusion-comet/pull/812 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]
JasonLi-cn commented on PR #11758: URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283323779 > > > I agree, finally it should be a big change which switches the group values and related states managed by block like duckdb , and I am working on this(#11931). > > > > > > I think personally suggest sketching out what this would look like in a first PR without worrying about getting all the tests passing / compiling etc. > > If we try to port all code at once to being managed in blocks it is going to be a very large change > > I am thinking maybe we can have a incremental approach (like for example separately adding the ability to do blocked emission for https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html and > > https://github.com/apache/datafusion/blob/fd237f8705b18fa089fdfb8dd5b04655ccb4d691/datafusion/physical-plan/src/aggregates/group_values/mod.rs#L37-L55 > > > > If we can set this up so that we get the pattern and a few common implementations setup in the first PR then we can make subsequent PRs to port over the other parts of the aggregation 🤔 > > Yes, I want to do the similar things for > > > incremental approach > > Yes... I try to do something like it but still not thorough enough, and I found it is actually hard to support the exact `EmitTo::First` used by streaming aggr after I found the reason why some tests failed... And we actually should disable the blocked mode in streaming aggr. > > I am planning to switch to the special blocked emission impl now... It seems we can do two things to introduce the new emit modes? > > * Define a `support_blocked_emission` for `GroupValues` and `GroupAccumulator`. > * Add two emit enums > > ``` > pub enum EmitTo { > /// Emit all groups > All, > /// Emit only the first `n` groups and shift all existing group > /// indexes down by `n`. > /// > /// For example, if `n=10`, group_index `0, 1, ... 9` are emitted > /// and group indexes '`10, 11, 12, ...` become `0, 1, 2, ...`. > First(usize), > > AllBlocks, > > FirstBlocks(usize), > } > ``` > > What do you think this way to supprot the special blocked emission? Do we need to put `batch_size` in `AllBlocks` and `FirstBlocks` 🤔 ? And that's exactly what I was planning on doing. But I found that the changes to the code were very large. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: bugs when having and group by are all false [datafusion]
jonahgao commented on PR #11897: URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283416590 I agree to disable `EliminateGroupByConstant` because it does not work correctly with empty input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]
jonahgao commented on PR #11929: URL: https://github.com/apache/datafusion/pull/11929#issuecomment-2283419507 Thanks for the review @alamb @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]
jonahgao merged PR #11929: URL: https://github.com/apache/datafusion/pull/11929 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] chore(deps): update substrait requirement from 0.36.0 to 0.40.0 [datafusion]
dependabot[bot] opened a new pull request, #11944: URL: https://github.com/apache/datafusion/pull/11944 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. Release notes Sourced from https://github.com/substrait-io/substrait-rs/releases";>substrait's releases. v0.40.0 Chore (BREAKING) bump substrait from 0.53.0 to 0.54.0 Bumps https://github.com/substrait-io/substrait";>substrait from 5fe2a16 to 93c8339. Commit Statistics 1 commit contributed to the release. 6 days passed between releases. 1 commit was understood as https://www.conventionalcommits.org";>conventional. 1 unique issue was worked on: https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205 Commit Details https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205 Bump substrait from 0.53.0 to 0.54.0 (01d706c) Changelog Sourced from https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md";>substrait's changelog. 0.40.0 (2024-08-12) Chore (BREAKING) bump substrait from 0.53.0 to 0.54.0 Bumps https://github.com/substrait-io/substrait";>substrait from 5fe2a16 to 93c8339. Commit Statistics 1 commit contributed to the release. 6 days passed between releases. 1 commit was understood as https://www.conventionalcommits.org";>conventional. 1 unique issue was worked on: https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205 Commit Details https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205 Bump substrait from 0.53.0 to 0.54.0 (https://github.com/substrait-io/substrait-rs/commit/01d706cbaf9c189b299a8dd719e607dd0679fbd7";>01d706c) 0.39.0 (2024-08-05) Other use previous rust release to build smart release with lock file The smart-release lock file is pointing to an outdated version of time that fails to build on the latest stable rustc. Reverting to the previous stable release until the next smart-release release. Chore (BREAKING) bump substrait from 0.52.0 to 0.53.0 Bumps https://github.com/substrait-io/substrait";>substrait from a68c1ac to 5fe2a16. Commit Statistics 3 commits contributed to the release. ... (truncated) Commits https://github.com/substrait-io/substrait-rs/commit/5631ef59adb82e263cd27873d6a0dfdd78737b62";>5631ef5 Release substrait v0.40.0 https://github.com/substrait-io/substrait-rs/commit/01d706cbaf9c189b299a8dd719e607dd0679fbd7";>01d706c chore(deps,substrait)!: bump substrait from 0.53.0 to 0.54.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205) https://github.com/substrait-io/substrait-rs/commit/e6e87fb03e9efe4db66c24ba22bf8bfa0544726d";>e6e87fb Release substrait v0.39.0 https://github.com/substrait-io/substrait-rs/commit/0ac026f0d5cb1363c1a64612291bf5a976b91e96";>0ac026f ci: use previous rust release to build smart release with lock file (https://redirect.github.com/substrait-io/substrait-rs/issues/204";>#204) https://github.com/substrait-io/substrait-rs/commit/83c1fc416f41313c78407c46051be62211453b5f";>83c1fc4 chore(deps,substrait)!: bump substrait from 0.52.0 to 0.53.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/203";>#203) https://github.com/substrait-io/substrait-rs/commit/87096e2ce2d41681bcbda1eda3abfd48e358d852";>87096e2 Release substrait v0.38.0 https://github.com/substrait-io/substrait-rs/commit/2ee4f6576f216b36989b572f555da8d9b20ba311";>2ee4f65 chore(deps,substrait)!: bump substrait from 0.51.0 to 0.52.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/201";>#201) https://github.com/substrait-io/substrait-rs/commit/9d3b09f4856f1289c7477242353a62aee60ded37";>9d3b09f Release substrait v0.37.3 https://github.com/substrait-io/substrait-rs/commit/0752984cdc8ff51fbe019e5aaae1aba6fd1e1a5d";>0752984 chore: fix build script warning, fix doctest with default features (https://redirect.github.com/substrait-io/substrait-rs/issues/200";>#200) https://github.com/substrait-io/substrait-rs/commit/c67f7e090f11222304c199440eb95e78f55c1788";>c67f7e0 Release substrait v0.37.2 Additional commits viewable in https://github.com/substrait-io/substrait-rs/compare/v0.36.0...v0.40.0";>compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting
Re: [PR] chore(deps): update substrait requirement from 0.36.0 to 0.39.0 [datafusion]
dependabot[bot] closed pull request #11842: chore(deps): update substrait requirement from 0.36.0 to 0.39.0 URL: https://github.com/apache/datafusion/pull/11842 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore(deps): update substrait requirement from 0.36.0 to 0.39.0 [datafusion]
dependabot[bot] commented on PR #11842: URL: https://github.com/apache/datafusion/pull/11842#issuecomment-2283421829 Superseded by #11944. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled [datafusion]
crepererum commented on issue #4028: URL: https://github.com/apache/datafusion/issues/4028#issuecomment-2283437288 No need to rush. I think @alamb's comment was mostly meant as an assignment signal, so that nobody else starts to work on it and we end up wasting resources 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]
Rachelint commented on PR #11758: URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283482260 @JasonLi-cn As I think, maybe we should impl the special block based `GroupValues` impls: - We pass the `block size` when initializing it - It manage the inner values block by block - It return all blocks with internal `block size` We can always make the `block size == batch size`, so we can totally avoid any split operators. I think the `GroupValues` impls maybe should not care about the `batch size`? And we just do the `split and merge` work in the `GroupedHashAggregateStream::poll` if the unfortunately, the `batch size != block size` (usually they will equal)? I am making a try about it in #11943 , and have done some related code changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]
lewiszlw opened a new pull request, #11945: URL: https://github.com/apache/datafusion/pull/11945 ## Which issue does this PR close? part of https://github.com/apache/datafusion/issues/11502. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]
JasonLi-cn commented on PR #11758: URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283566894 > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about the `batch size`? And we just do the `split and merge` work in the `GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block size` (usually they will equal)? > > Maybe we should impl the special block based `GroupValues` impls like following? > > * We pass the `block size` when initializing it > * It manage the inner values block by block > * It return all blocks with internal `block size` > We can always make the `block size == batch size`, so we can totally avoid any split operators. > > I am making a try about it in #11943 , and have done some related code changes. OK. How do we determine the value of block size? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Write a blog post about implementing StringView in DataFusion [datafusion]
alamb commented on issue #11603: URL: https://github.com/apache/datafusion/issues/11603#issuecomment-2283594648 We are done with the draft. We expect it to be published in the next few weeks (it turns out to be a two part series) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled [datafusion]
alamb commented on issue #4028: URL: https://github.com/apache/datafusion/issues/4028#issuecomment-2283600292 > No need to rush. I think @alamb's comment was mostly meant as an assignment signal, so that nobody else starts to work on it and we end up wasting resources 🙂 Yes this is what I meant. Sorry about not being clear. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update INITCAP scalar function to support Utf8View [datafusion]
alamb merged PR #11888: URL: https://github.com/apache/datafusion/pull/11888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `INITCAP` scalar function to support `Utf8View` [datafusion]
alamb closed issue #11853: Update `INITCAP` scalar function to support `Utf8View` URL: https://github.com/apache/datafusion/issues/11853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update INITCAP scalar function to support Utf8View [datafusion]
alamb commented on PR #11888: URL: https://github.com/apache/datafusion/pull/11888#issuecomment-2283612658 Thanks again @xinlifoobar and @XiangpengHao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native support StringView for Octet Length [datafusion]
alamb merged PR #11906: URL: https://github.com/apache/datafusion/pull/11906 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native support StringView for Octet Length [datafusion]
alamb commented on PR #11906: URL: https://github.com/apache/datafusion/pull/11906#issuecomment-2283613083 Thanks again @PsiACE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `OCTET_LENGTH` scalar function to support `Utf8View` [datafusion]
alamb closed issue #11858: Update `OCTET_LENGTH` scalar function to support `Utf8View` URL: https://github.com/apache/datafusion/issues/11858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Implement `ordering` serialization for `AggregateUdf` in `datafusion-proto` [datafusion]
alamb closed issue #11804: Implement `ordering` serialization for `AggregateUdf` in `datafusion-proto` URL: https://github.com/apache/datafusion/issues/11804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]
alamb commented on code in PR #11940: URL: https://github.com/apache/datafusion/pull/11940#discussion_r1713505076 ## datafusion/sql/tests/sql_integration.rs: ## @@ -2739,39 +2742,44 @@ fn logical_plan_with_dialect_and_options( dialect: &dyn Dialect, options: ParserOptions, ) -> Result { -let context = MockContextProvider::default() -.with_udf(unicode::character_length().as_ref().clone()) -.with_udf(string::concat().as_ref().clone()) -.with_udf(make_udf( +let state = MockSessionState::default() + .with_scalar_function(Arc::new(unicode::character_length().as_ref().clone())) Review Comment: ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: impl ordering for serialization/deserialization for AggregateUdf [datafusion]
alamb merged PR #11926: URL: https://github.com/apache/datafusion/pull/11926 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: impl ordering for serialization/deserialization for AggregateUdf [datafusion]
alamb commented on PR #11926: URL: https://github.com/apache/datafusion/pull/11926#issuecomment-2283616969 Thanks @jayzhan211 and @haohuaijin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Support tuples as types [datafusion]
alamb commented on PR #11896: URL: https://github.com/apache/datafusion/pull/11896#issuecomment-2283621173 I took the liberty of merging up from main to resolve a conflict -- I think we could merge this PR and address the comments as a follow on PR, or we can do it prior to merging this one. Please let me know what you prefer @samuelcolvin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]
alamb commented on code in PR #11945: URL: https://github.com/apache/datafusion/pull/11945#discussion_r1713520236 ## datafusion/common/src/utils/mod.rs: ## @@ -683,6 +683,69 @@ pub fn transpose(original: Vec>) -> Vec> { } } +/// Computes the `skip` and `fetch` parameters of a single limit that would be Review Comment: It is nice to put this into `utils` 👍 ## datafusion/physical-optimizer/src/output_requirements.rs: ## @@ -286,3 +286,5 @@ fn require_top_ordering_helper( Ok((plan, false)) } } + +// See tests in datafusion/core/tests/physical_optimizer Review Comment: ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]
alamb commented on code in PR #11942: URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713532418 ## datafusion/functions/src/unicode/rpad.rs: ## @@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc { fn invoke(&self, args: &[ColumnarValue]) -> Result { match args[0].data_type() { -DataType::Utf8 => make_scalar_function(rpad::, vec![])(args), +DataType::Utf8 | DataType::Utf8View => { +make_scalar_function(rpad::, vec![])(args) +} DataType::LargeUtf8 => make_scalar_function(rpad::, vec![])(args), other => exec_err!("Unsupported data type {other:?} for function rpad"), } } } -/// Extends the string to length 'length' by appending the characters fill (a space by default). If the string is already longer than length then it is truncated. -/// rpad('hi', 5, 'xy') = 'hixyx' -pub fn rpad(args: &[ArrayRef]) -> Result { -match args.len() { -2 => { -let string_array = as_generic_string_array::(&args[0])?; -let length_array = as_int64_array(&args[1])?; - -let result = string_array -.iter() -.zip(length_array.iter()) -.map(|(string, length)| match (string, length) { -(Some(string), Some(length)) => { -if length > i32::MAX as i64 { -return exec_err!( -"rpad requested length {length} too large" -); -} +macro_rules! process_rpad { +// For the two-argument case +($string_array:expr, $length_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.map(|(string, length)| match (string, length) { +(Some(string), Some(length)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} -let length = if length < 0 { 0 } else { length as usize }; -if length == 0 { -Ok(Some("".to_string())) +let length = if length < 0 { 0 } else { length as usize }; +if length == 0 { +Ok(Some("".to_string())) +} else { +let graphemes = string.graphemes(true).collect::>(); +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) } else { -let graphemes = string.graphemes(true).collect::>(); -if length < graphemes.len() { -Ok(Some(graphemes[..length].concat())) +let mut s = string.to_string(); +if !$is_view { +s.push_str(&" ".repeat(length - graphemes.len())); } else { -let mut s = string.to_string(); s.push_str(" ".repeat(length - graphemes.len()).as_str()); -Ok(Some(s)) } +Ok(Some(s)) } } -_ => Ok(None), -}) -.collect::>>()?; +} +_ => Ok(None), +}) +.collect::>>() +}}; + +// For the three-argument case +($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.zip($fill_array.iter()) +.map(|((string, length), fill)| match (string, length, fill) { +(Some(string), Some(length), Some(fill)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} + +let length = if length < 0 { 0 } else { length as usize }; +let graphemes = string.graphemes(true).collect::>(); +let fill_chars = fill.chars().collect::>(); + +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) +} else if fill_chars.is_empty() { +Ok(Some(string.to_string())) +} else { +let mut s = string.to_string(); +let char_vector: Vec = (0..length - graphemes.len()) +.map(|l| fill_chars[l % fill_chars.len()]) +.collect(); +s.push_str(&char_ve
Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]
Rachelint commented on PR #11758: URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283653639 > > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about the `batch size`? And we just do the `split and merge` work in the `GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block size` (usually they will equal)? > > Maybe we should impl the special block based `GroupValues` impls like following? > > > > * We pass the `block size` when initializing it > > * It manage the inner values block by block > > * It return all blocks with internal `block size` > > We can always make the `block size == batch size`, so we can totally avoid any split operators. > > > > I am making a try about it in #11943 , and have done some related code changes. > > OK. How do we determine the value of block size? > > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about the `batch size`? And we just do the `split and merge` work in the `GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block size` (usually they will equal)? > > Maybe we should impl the special block based `GroupValues` impls like following? > > > > * We pass the `block size` when initializing it > > * It manage the inner values block by block > > * It return all blocks with internal `block size` > > We can always make the `block size == batch size`, so we can totally avoid any split operators. > > > > I am making a try about it in #11943 , and have done some related code changes. > > OK. How do we determine the value of block size? I think maybe we make it equal to `batch_size` in most cases, and so that we can avoid any split operations during producing output? And for the cornercase, for example, the `batch_size` is too small, we can let it fallback to single block mode? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 [datafusion]
alamb commented on issue #11826: URL: https://github.com/apache/datafusion/issues/11826#issuecomment-2283655669 PRs in need of review: DataFusion - [ ] https://github.com/apache/datafusion/pull/11456 - [ ] https://github.com/apache/datafusion/pull/11938 - [ ] https://github.com/apache/datafusion/pull/11933 Arrow - [ ] https://github.com/apache/arrow-rs/pull/6201 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Configurable null_equals_null flag [datafusion]
berkaysynnada commented on issue #11883: URL: https://github.com/apache/datafusion/issues/11883#issuecomment-2283657039 That makes sense. If such usages exist, there's no need to expose a new flag. It would be great if we could handle these two versions and link them to our internal join flags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] test: re-enable window function over parquet with forced collisions [datafusion]
alamb commented on PR #11939: URL: https://github.com/apache/datafusion/pull/11939#issuecomment-2283661518 I re-ran the relevant test: ```shell cargo test --test sqllogictests --features=force_hash_collisions -- parquet ``` And indeed it passes locally for me too ``` andrewlamb@Andrews-MBP-2:~/Software/datafusion$ cargo test --test sqllogictests --features=force_hash_collisions -- parquet warning: function `hash_map_array` is never used --> datafusion/common/src/hash_utils.rs:249:4 | 249 | fn hash_map_array( |^^ | = note: `#[warn(dead_code)]` on by default warning: `datafusion-common` (lib) generated 1 warning Finished `test` profile [unoptimized + debuginfo] target(s) in 0.40s Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-a3320b7543efea67) Running "parquet.slt" Running "parquet_sorted_statistics.slt" andrewlamb@Andrews-MBP-2:~/Software/datafusion$ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `ENDS_WITH` scalar function to support `Utf8View` [datafusion]
alamb closed issue #11852: Update `ENDS_WITH` scalar function to support `Utf8View` URL: https://github.com/apache/datafusion/issues/11852 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native support StringView for Ends With [datafusion]
alamb merged PR #11924: URL: https://github.com/apache/datafusion/pull/11924 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native support StringView for Ends With [datafusion]
alamb commented on code in PR #11924: URL: https://github.com/apache/datafusion/pull/11924#discussion_r1713543104 ## datafusion/functions/src/string/ends_with.rs: ## @@ -43,14 +41,15 @@ impl Default for EndsWithFunc { impl EndsWithFunc { pub fn new() -> Self { -use DataType::*; Self { signature: Signature::one_of( vec![ -Exact(vec![Utf8, Utf8]), -Exact(vec![Utf8, LargeUtf8]), -Exact(vec![LargeUtf8, Utf8]), -Exact(vec![LargeUtf8, LargeUtf8]), +// Planner attempts coercion to the target type starting with the most preferred candidate. Review Comment: ❤️ for the comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native support StringView for Levenshtein [datafusion]
alamb merged PR #11925: URL: https://github.com/apache/datafusion/pull/11925 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `levenshtein` scalar function to support `Utf8View` [datafusion]
alamb closed issue #11854: Update `levenshtein` scalar function to support `Utf8View` URL: https://github.com/apache/datafusion/issues/11854 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `OVERLAY` scalar function to support Utf8View [datafusion]
PsiACE commented on issue #11909: URL: https://github.com/apache/datafusion/issues/11909#issuecomment-2283746445 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `REGEXP_LIKE` scalar function to support Utf8View [datafusion]
PsiACE commented on issue #11910: URL: https://github.com/apache/datafusion/issues/11910#issuecomment-2283749092 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Utf8View for lpad scalar function [datafusion]
alamb commented on code in PR #11941: URL: https://github.com/apache/datafusion/pull/11941#discussion_r1713604384 ## datafusion/functions/src/unicode/lpad.rs: ## @@ -76,300 +87,450 @@ impl ScalarUDFImpl for LPadFunc { } fn invoke(&self, args: &[ColumnarValue]) -> Result { -match args[0].data_type() { -DataType::Utf8 => make_scalar_function(lpad::, vec![])(args), -DataType::LargeUtf8 => make_scalar_function(lpad::, vec![])(args), -other => exec_err!("Unsupported data type {other:?} for function lpad"), -} +make_scalar_function(lpad, vec![])(args) } } -/// Extends the string to length 'length' by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right). +/// Extends the string to length 'length' by prepending the characters fill (a space by default). +/// If the string is already longer than length then it is truncated (on the right). /// lpad('hi', 5, 'xy') = 'xyxhi' -pub fn lpad(args: &[ArrayRef]) -> Result { -match args.len() { -2 => { -let string_array = as_generic_string_array::(&args[0])?; -let length_array = as_int64_array(&args[1])?; - -let result = string_array -.iter() -.zip(length_array.iter()) -.map(|(string, length)| match (string, length) { -(Some(string), Some(length)) => { -if length > i32::MAX as i64 { -return exec_err!( -"lpad requested length {length} too large" -); -} +pub fn lpad(args: &[ArrayRef]) -> Result { +if args.len() <= 1 || args.len() > 3 { +return exec_err!( +"lpad was called with {} arguments. It requires at least 2 and at most 3.", +args.len() +); +} + +let length_array = as_int64_array(&args[1])?; + +match args[0].data_type() { +Utf8 => match args.len() { +2 => lpad_impl::<&GenericStringArray, &GenericStringArray, i32>( +args[0].as_string::(), +length_array, +None, +), +3 => lpad_with_replace::<&GenericStringArray, i32>( +args[0].as_string::(), +length_array, +&args[2], +), +_ => unreachable!(), +}, +LargeUtf8 => match args.len() { +2 => lpad_impl::<&GenericStringArray, &GenericStringArray, i64>( +args[0].as_string::(), +length_array, +None, +), +3 => lpad_with_replace::<&GenericStringArray, i64>( +args[0].as_string::(), +length_array, +&args[2], +), +_ => unreachable!(), +}, +Utf8View => match args.len() { +2 => lpad_impl::<&StringViewArray, &GenericStringArray, i32>( +args[0].as_string_view(), +length_array, +None, +), +3 => lpad_with_replace::<&StringViewArray, i32>( +args[0].as_string_view(), +length_array, +&args[2], +), +_ => unreachable!(), +}, +other => { +exec_err!("Unsupported data type {other:?} for function lpad") +} +} +} -let length = if length < 0 { 0 } else { length as usize }; -if length == 0 { -Ok(Some("".to_string())) +fn lpad_with_replace<'a, V, T: OffsetSizeTrait>( +string_array: V, +length_array: &Int64Array, +fill_array: &'a ArrayRef, +) -> Result +where +V: StringArrayType<'a>, +{ +match fill_array.data_type() { +Utf8 => lpad_impl::, T>( +string_array, +length_array, +Some(fill_array.as_string::()), +), +LargeUtf8 => lpad_impl::, T>( +string_array, +length_array, +Some(fill_array.as_string::()), +), +Utf8View => lpad_impl::( +string_array, +length_array, +Some(fill_array.as_string_view()), +), +other => { +exec_err!("Unsupported data type {other:?} for function lpad") +} +} +} + +fn lpad_impl<'a, V, V2, T>( +string_array: V, +length_array: &Int64Array, +fill_array: Option, +) -> Result +where +V: StringArrayType<'a>, +V2: StringArrayType<'a>, +T: OffsetSizeTrait, +{ +if fill_array.is_none() { +let result = string_array +.iter() +.zip(length_array.iter()) +.map(|(string, length)| match (string, length) { +(Some(string), Some(length)) => { +if
Re: [PR] Implement native stringview support for BTRIM [datafusion]
alamb commented on code in PR #11920: URL: https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 ## datafusion/functions/src/string/common.rs: ## @@ -68,6 +69,74 @@ pub(crate) fn general_trim( }, }; +if use_string_view { +string_view_trim::(trim_type, func, args) +} else { +string_trim::(trim_type, func, args) +} +} + +// removing 'a will cause compiler complaining lifetime of `func` +fn string_view_trim<'a, T: OffsetSizeTrait>( Review Comment: This implementation is an improvement -- however, it will now copy the string values to a new `StringArray` I think if the function generated StringView output directly it would be possible to avoid all string copies (and only adjust the views). I can file this as a follow on optimization idea. However, we can -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic] Native `StringView` support for string functions [datafusion]
alamb commented on issue #11790: URL: https://github.com/apache/datafusion/issues/11790#issuecomment-2283777166 One thing I have noticed during implementations is that some functions such as `ltrim`/`rtrim`/`btrim` could be more efficient if they produced Utf8View as *output* in addition to accepting them as *input* For example, in https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 from @Kev1n8 it is actually probably a good idea to *always* generate StringView as output (rather than StringArray) as it could avoid a copy. I am thinking once we get the string functions so they can support StringView as input then we can do a second pass and optimize some functions so they produce StringView as output -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `LTRIM` scalar function to support `Utf8View` [datafusion]
alamb commented on issue #11856: URL: https://github.com/apache/datafusion/issues/11856#issuecomment-2283779062 After https://github.com/apache/datafusion/pull/11920 from @Kev1n8 I think this one will be quick to implement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `RTRIM` scalar function to support Utf8View [datafusion]
alamb commented on issue #11916: URL: https://github.com/apache/datafusion/issues/11916#issuecomment-2283779205 After https://github.com/apache/datafusion/pull/11920 from @Kev1n8 I think this one will be quick to implement -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native stringview support for BTRIM [datafusion]
alamb commented on code in PR #11920: URL: https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 ## datafusion/functions/src/string/common.rs: ## @@ -68,6 +69,74 @@ pub(crate) fn general_trim( }, }; +if use_string_view { +string_view_trim::(trim_type, func, args) +} else { +string_trim::(trim_type, func, args) +} +} + +// removing 'a will cause compiler complaining lifetime of `func` +fn string_view_trim<'a, T: OffsetSizeTrait>( Review Comment: This implementation is an improvement -- however, it will now copy the string values to a new `StringArray` I think if the function generated StringView output directly it would be possible to avoid all string copies (and only adjust the views). I can file this as a follow on optimization idea. I noticed this on https://github.com/apache/datafusion/issues/11790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement native stringview support for BTRIM [datafusion]
alamb merged PR #11920: URL: https://github.com/apache/datafusion/pull/11920 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update the `BTRIM` scalar function to support `Utf8View` [datafusion]
alamb closed issue #11835: Update the `BTRIM` scalar function to support `Utf8View` URL: https://github.com/apache/datafusion/issues/11835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Add additional regexp functions [datafusion]
timsaucer opened a new issue, #11946: URL: https://github.com/apache/datafusion/issues/11946 ### Is your feature request related to a problem or challenge? I would like to see the following regexp functions implemented. These exist in some, but not all, versions of PostgreSQL. - [ ] [regexp_count()](https://pgpedia.info/r/regexp_count.html) - [ ] [regexp_instr()](https://pgpedia.info/r/regexp_instr.html) - [ ] [regexp_matches()](https://pgpedia.info/r/regexp_matches.html) - [ ] [regexp_split_to_array()](https://pgpedia.info/r/regexp_split_to_array.html) - [ ] [regexp_split_to_table()](https://pgpedia.info/r/regexp_split_to_table.html) - [ ] [regexp_substr()](https://pgpedia.info/r/regexp_substr.html) ### Describe the solution you'd like Implement these functions. ### Describe alternatives you've considered These operations *can* be performed using the existing functions, so I am currently unblocked for my immediate use case but having these functions built in would be convenient. ### Additional context We currently have the following regexp functions implemented. The source is in `datafusion/functions/src/regex/mod.rs` [regexp_like()](https://pgpedia.info/r/regexp_like.html) [regexp_match()](https://pgpedia.info/r/regexp_match.html) [regexp_replace()](https://pgpedia.info/r/regexp_replace.html) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]
crepererum commented on code in PR #11828: URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713654802 ## datafusion/common/src/stats.rs: ## @@ -25,7 +25,7 @@ use arrow_schema::Schema; /// Represents a value with a degree of certainty. `Precision` is used to /// propagate information the precision of statistical values. -#[derive(Clone, PartialEq, Eq, Default)] +#[derive(Clone, PartialEq, Eq, Default, Copy)] Review Comment: BTW: You can always manually implement the `Copy` trait if you want to have fine grained control about the bounds: ```rust impl Copy for Precision where: TDebug + Clone + PartialEq + Eq + PartialOrd + Clone {} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]
crepererum commented on code in PR #11828: URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713654802 ## datafusion/common/src/stats.rs: ## @@ -25,7 +25,7 @@ use arrow_schema::Schema; /// Represents a value with a degree of certainty. `Precision` is used to /// propagate information the precision of statistical values. -#[derive(Clone, PartialEq, Eq, Default)] +#[derive(Clone, PartialEq, Eq, Default, Copy)] Review Comment: BTW: You can always manually implement the `Copy` trait if you want to have fine grained control about the bounds: ```rust impl Copy for Precision where T: Debug + Clone + PartialEq + Eq + PartialOrd + Clone {} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: Sort on single struct should fallback to Spark [datafusion-comet]
andygrove commented on code in PR #811: URL: https://github.com/apache/datafusion-comet/pull/811#discussion_r1713770493 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -2501,6 +2501,13 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim case SortExec(sortOrder, _, child, _) if isCometOperatorEnabled(op.conf, CometConf.OPERATOR_SORT) => +// TODO: Remove this constraint when we upgrade to new arrow-rs including +// https://github.com/apache/arrow-rs/pull/6225 +if (child.output.length == 1 && child.output.head.dataType.isInstanceOf[StructType]) { Review Comment: As we add support for other types, do we need to update this to make it recursive so that we check for Map or Array containing struct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: Implement to_json for subset of types [datafusion-comet]
andygrove commented on PR #805: URL: https://github.com/apache/datafusion-comet/pull/805#issuecomment-2283984392 @parthchandra could you review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]
Omega359 commented on code in PR #11942: URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713779456 ## datafusion/functions/src/unicode/rpad.rs: ## @@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc { fn invoke(&self, args: &[ColumnarValue]) -> Result { match args[0].data_type() { -DataType::Utf8 => make_scalar_function(rpad::, vec![])(args), +DataType::Utf8 | DataType::Utf8View => { +make_scalar_function(rpad::, vec![])(args) +} DataType::LargeUtf8 => make_scalar_function(rpad::, vec![])(args), other => exec_err!("Unsupported data type {other:?} for function rpad"), } } } -/// Extends the string to length 'length' by appending the characters fill (a space by default). If the string is already longer than length then it is truncated. -/// rpad('hi', 5, 'xy') = 'hixyx' -pub fn rpad(args: &[ArrayRef]) -> Result { -match args.len() { -2 => { -let string_array = as_generic_string_array::(&args[0])?; -let length_array = as_int64_array(&args[1])?; - -let result = string_array -.iter() -.zip(length_array.iter()) -.map(|(string, length)| match (string, length) { -(Some(string), Some(length)) => { -if length > i32::MAX as i64 { -return exec_err!( -"rpad requested length {length} too large" -); -} +macro_rules! process_rpad { +// For the two-argument case +($string_array:expr, $length_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.map(|(string, length)| match (string, length) { +(Some(string), Some(length)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} -let length = if length < 0 { 0 } else { length as usize }; -if length == 0 { -Ok(Some("".to_string())) +let length = if length < 0 { 0 } else { length as usize }; +if length == 0 { +Ok(Some("".to_string())) +} else { +let graphemes = string.graphemes(true).collect::>(); +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) } else { -let graphemes = string.graphemes(true).collect::>(); -if length < graphemes.len() { -Ok(Some(graphemes[..length].concat())) +let mut s = string.to_string(); +if !$is_view { +s.push_str(&" ".repeat(length - graphemes.len())); } else { -let mut s = string.to_string(); s.push_str(" ".repeat(length - graphemes.len()).as_str()); -Ok(Some(s)) } +Ok(Some(s)) } } -_ => Ok(None), -}) -.collect::>>()?; +} +_ => Ok(None), +}) +.collect::>>() +}}; + +// For the three-argument case +($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.zip($fill_array.iter()) +.map(|((string, length), fill)| match (string, length, fill) { +(Some(string), Some(length), Some(fill)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} + +let length = if length < 0 { 0 } else { length as usize }; +let graphemes = string.graphemes(true).collect::>(); +let fill_chars = fill.chars().collect::>(); + +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) +} else if fill_chars.is_empty() { +Ok(Some(string.to_string())) +} else { +let mut s = string.to_string(); +let char_vector: Vec = (0..length - graphemes.len()) +.map(|l| fill_chars[l % fill_chars.len()]) +.collect(); +s.push_str(&char
[PR] minor: Update release documentation based on 41.0.0 release [datafusion]
andygrove opened a new pull request, #11947: URL: https://github.com/apache/datafusion/pull/11947 ## Which issue does this PR close? Part of https://github.com/apache/datafusion/issues/11476 ## Rationale for this change Update documentation to add new crates and remove the out-of-date dependency diagram ## What changes are included in this PR? ## Are these changes tested? ## Are there any user-facing changes? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add native stringview support for LTRIM & RTRIM [datafusion]
Kev1n8 opened a new pull request, #11948: URL: https://github.com/apache/datafusion/pull/11948 ## Which issue does this PR close? Closes #11856 and #11916 ## Rationale for this change ## What changes are included in this PR? Added `ltrim` and `rtrim` function signatures to support stringview. ## Are these changes tested? yes ## Are there any user-facing changes? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] minor: Update release documentation based on 41.0.0 release [datafusion]
andygrove commented on code in PR #11947: URL: https://github.com/apache/datafusion/pull/11947#discussion_r1713783391 ## datafusion/catalog/Cargo.toml: ## @@ -17,6 +17,7 @@ [package] name = "datafusion-catalog" +description = "datafusion-catalog" Review Comment: I had to add a description so that I could publish the crate. It would obviously be better to have a more meaningful description here, but I am not familiar with this new crate yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]
Omega359 commented on code in PR #11942: URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713789099 ## datafusion/functions/src/unicode/rpad.rs: ## @@ -46,8 +48,11 @@ impl RPadFunc { signature: Signature::one_of( vec![ Exact(vec![Utf8, Int64]), Review Comment: As noted in https://github.com/apache/datafusion/pull/11924/files:: // Planner attempts coercion to the target type starting with the most preferred candidate. // For example, given input `(Utf8View, Utf8)`, it first tries coercing to `(Utf8View, Utf8View)`. // If that fails, it proceeds to `(Utf8, Utf8)`. Thus, the ordering of the signature should likely be changed to reflect that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Utf8View for lpad scalar function [datafusion]
Omega359 commented on PR #11941: URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284011527 I was looking at @Lordworms implementation in https://github.com/apache/datafusion/pull/11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]
jayzhan211 merged PR #11945: URL: https://github.com/apache/datafusion/pull/11945 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]
jayzhan211 commented on PR #11945: URL: https://github.com/apache/datafusion/pull/11945#issuecomment-2284039319 Thanks @lewiszlw @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Enable shuffle in micro benchmarks [datafusion-comet]
andygrove merged PR #806: URL: https://github.com/apache/datafusion-comet/pull/806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `RIGHT` scalar function to support Utf8View [datafusion]
Kev1n8 commented on issue #11917: URL: https://github.com/apache/datafusion/issues/11917#issuecomment-2284063257 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve comments in row_hash.rs for skipping aggregation [datafusion]
jayzhan211 commented on PR #11820: URL: https://github.com/apache/datafusion/pull/11820#issuecomment-2284066506 Thanks all -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Minor: Improve comments in row_hash.rs for skipping aggregation [datafusion]
jayzhan211 merged PR #11820: URL: https://github.com/apache/datafusion/pull/11820 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `REGEXP_MATCH` scalar function to support Utf8View [datafusion]
PsiACE commented on issue #11911: URL: https://github.com/apache/datafusion/issues/11911#issuecomment-2284066645 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `regexp_replace` scalar function to support Utf8View [datafusion]
PsiACE commented on issue #11912: URL: https://github.com/apache/datafusion/issues/11912#issuecomment-2284067440 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Manage group values and states by blocks in aggregation [datafusion]
jayzhan211 commented on issue #11931: URL: https://github.com/apache/datafusion/issues/11931#issuecomment-2284071358 > /// For example, `n= 10`, `block size=4`, `n` will be aligned to 12, /// and finally 3 blocks will be returned. FirstBlocks(usize) I think emitting with "n" blocks is much more straightforward. n = 3, block size = 4. emit 3 * 4 = 12 elements -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]
Omega359 commented on PR #11942: URL: https://github.com/apache/datafusion/pull/11942#issuecomment-2284082623 You may want to look at the tests I added in https://github.com/apache/datafusion/pull/11941 as an example to verify that all the signature variants are tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] [Epic] Native `StringView` support for string functions [datafusion]
2010YOUY01 commented on issue #11790: URL: https://github.com/apache/datafusion/issues/11790#issuecomment-2284100621 Inspired by @Omega359 's great PR https://github.com/apache/datafusion/pull/11941, I have some suggestion on testing `Utf8View` support for functions: Although most implementation is adapted from existing implementation, but the execution takes another path, so I think comprehensive end-to-end tests are still needed. The good news is there already exists `sqllogictest`s for original string functions, the only thing to do is to duplicate existing testings with `Utf8View` Here are the examples on how to adapt existing test cases for `Utf8View` input 1. For functions takes scalar value, use `arrow_cast()` like https://github.com/apache/datafusion/pull/11941/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196 2. For functions read from a table, string column can be converted to `Utf8View` column like https://github.com/apache/datafusion/blob/2cf09566af7d7d5f83a8bdff5f0adda97d40deee/datafusion/sqllogictest/test_files/string_view.slt#L30-L42 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Utf8View for lpad scalar function [datafusion]
alamb commented on PR #11941: URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284118946 > I was looking at @Lordworms implementation in #11942 and I think it would make sense to align the two implementations. In some aspects I like his approach much more than the one I took in this PR. Do you think we should merge this PR and work on improvements in a follow on PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]
alamb commented on code in PR #11828: URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713874112 ## datafusion/common/src/stats.rs: ## @@ -25,7 +25,7 @@ use arrow_schema::Schema; /// Represents a value with a degree of certainty. `Precision` is used to /// propagate information the precision of statistical values. -#[derive(Clone, PartialEq, Eq, Default)] +#[derive(Clone, PartialEq, Eq, Default, Copy)] Review Comment: 👍 -- I think that is basically what `#[derive(Debug)]` does in this case, which is kind of cook ## datafusion/common/src/stats.rs: ## @@ -25,7 +25,7 @@ use arrow_schema::Schema; /// Represents a value with a degree of certainty. `Precision` is used to /// propagate information the precision of statistical values. -#[derive(Clone, PartialEq, Eq, Default)] +#[derive(Clone, PartialEq, Eq, Default, Copy)] Review Comment: 👍 -- I think that is basically what `#[derive(Debug)]` does in this case, which is kind of cool -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]
jayzhan211 merged PR #11940: URL: https://github.com/apache/datafusion/pull/11940 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]
jayzhan211 commented on PR #11940: URL: https://github.com/apache/datafusion/pull/11940#issuecomment-2284135678 Thanks @dharanad @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] chore: Add `SessionState` to `MockContextProvider` just like `SessionContextProvider` [datafusion]
jayzhan211 closed issue #11531: chore: Add `SessionState` to `MockContextProvider` just like `SessionContextProvider` URL: https://github.com/apache/datafusion/issues/11531 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Improve performance of inner hash join [datafusion-comet]
andygrove commented on issue #808: URL: https://github.com/apache/datafusion-comet/issues/808#issuecomment-2284140087 The `FilterExec` in the above example is even more expensive than the `HashJoinExec`. Evaluating the predicate is cheap but copying data to the filtered batch takes 99% of the time. We could potentially avoid this copy by using a selection vector approach instead. ``` Time to compute filter mask on batch of 32768 rows is: 581ns Time to filter batch is: 252.194µs ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Implement Utf8View for lpad scalar function [datafusion]
Omega359 commented on PR #11941: URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284141051 > Do you think we should merge this PR and work on improvements in a follow on PR? I have no objections to that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]
Kimahriman commented on issue #11923: URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284199250 After digging a little more it seems like there's some other issues that might make this significantly more difficult: - When `invoke` is called, there's no way to know the logical nullability of the ColumnarValue's/ArrayRef's being passed in. The only thing available is `Array::is_nullable`, but it looks like all that does it check if the current array has any nulls or not, so that could vary from batch to batch. So inside `invoke` there's no way to know if field inside ```rust Ok(Arc::new(GenericListArraytry_new( Arc::new(Field::new("item", data_type, true)), OffsetBuffer::new(offsets.into()), arrow_array::make_array(data), None, )?)) ``` should be logically nullable or not. And arrow will verify this when creating a record batch, which is the error I ran into initially with the nullability not matching: https://github.com/apache/arrow-rs/blob/5c5a94a11f01a286dd03b18af0f11c327a9accc6/arrow-array/src/record_batch.rs#L205 - ScalarUDF's in general don't have a concept of nullability, they are always considered nullable. Only tangentially related but just a larger issue of nullability and ScalarUDFs: https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/scalar_function.rs#L120 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] feat: `CreateArray` support [datafusion-comet]
Kimahriman commented on code in PR #793: URL: https://github.com/apache/datafusion-comet/pull/793#discussion_r1713950119 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -2348,6 +2348,27 @@ object QueryPlanSerde extends Logging with ShimQueryPlanSerde with CometExprShim .build() } +// datafusion's make_array only supports nullable element types +case array @ CreateArray(children, _) if array.dataType.containsNull => + val childExprs = children.map(exprToProto(_, inputs, binding)) + val dataType = serializeDataType(array.dataType) Review Comment: More digging made me realize this is a somewhat larger issue with ScalarUDFs, in that they don't support setting nullability at all (every ScalarUDF is assumed to be nullable). So how has this been handled elsewhere if at all? Is the best approach just to "pretend" the column is nullable for DataFusion, knowing that logically it should not contain any nulls and keep it as non-nullable on the Spark side? Otherwise any expression backed by a ScalarUDF can't support non-nullable expressions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]
jayzhan211 commented on issue #11923: URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284236770 > ```rust > Ok(Arc::new(GenericListArraytry_new( > ``` We could have another `invoke_with_schema` function to get the `nullable` from schema. Of course, another incompatible change 😢 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]
alamb commented on code in PR #11929: URL: https://github.com/apache/datafusion/pull/11929#discussion_r1713973317 ## datafusion/sqllogictest/src/engines/datafusion_engine/runner.rs: ## @@ -69,9 +72,12 @@ impl sqllogictest::AsyncDB for DataFusion { async fn run_query(ctx: &SessionContext, sql: impl Into) -> Result { let df = ctx.sql(sql.into().as_str()).await?; +let task_ctx = Arc::new(df.task_ctx()); +let plan = df.create_physical_plan().await?; -let types = normalize::convert_schema_to_types(df.schema().fields()); -let results: Vec = df.collect().await?; +let stream = execute_stream(plan, task_ctx)?; +let types = normalize::convert_schema_to_types(stream.schema().fields()); Review Comment: 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]
jayzhan211 commented on issue #11923: URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284262048 I remember there was a discussion about replace ColumnarValue::Array(ArrayRef) with Recordbatch, which contains `Schema` ```rust #[derive(Clone, Debug, PartialEq)] pub struct RecordBatch { schema: SchemaRef, columns: Vec>, /// The number of rows in this RecordBatch /// /// This is stored separately from the columns to handle the case of no columns row_count: usize, } ``` Maybe Recordbatch is what we need for `invoke` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]
Omega359 commented on code in PR #11942: URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713832682 ## datafusion/functions/src/unicode/rpad.rs: ## @@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc { fn invoke(&self, args: &[ColumnarValue]) -> Result { match args[0].data_type() { -DataType::Utf8 => make_scalar_function(rpad::, vec![])(args), +DataType::Utf8 | DataType::Utf8View => { +make_scalar_function(rpad::, vec![])(args) +} DataType::LargeUtf8 => make_scalar_function(rpad::, vec![])(args), other => exec_err!("Unsupported data type {other:?} for function rpad"), } } } -/// Extends the string to length 'length' by appending the characters fill (a space by default). If the string is already longer than length then it is truncated. -/// rpad('hi', 5, 'xy') = 'hixyx' -pub fn rpad(args: &[ArrayRef]) -> Result { -match args.len() { -2 => { -let string_array = as_generic_string_array::(&args[0])?; -let length_array = as_int64_array(&args[1])?; - -let result = string_array -.iter() -.zip(length_array.iter()) -.map(|(string, length)| match (string, length) { -(Some(string), Some(length)) => { -if length > i32::MAX as i64 { -return exec_err!( -"rpad requested length {length} too large" -); -} +macro_rules! process_rpad { +// For the two-argument case +($string_array:expr, $length_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.map(|(string, length)| match (string, length) { +(Some(string), Some(length)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} -let length = if length < 0 { 0 } else { length as usize }; -if length == 0 { -Ok(Some("".to_string())) +let length = if length < 0 { 0 } else { length as usize }; +if length == 0 { +Ok(Some("".to_string())) +} else { +let graphemes = string.graphemes(true).collect::>(); +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) } else { -let graphemes = string.graphemes(true).collect::>(); -if length < graphemes.len() { -Ok(Some(graphemes[..length].concat())) +let mut s = string.to_string(); +if !$is_view { +s.push_str(&" ".repeat(length - graphemes.len())); } else { -let mut s = string.to_string(); s.push_str(" ".repeat(length - graphemes.len()).as_str()); -Ok(Some(s)) } +Ok(Some(s)) } } -_ => Ok(None), -}) -.collect::>>()?; +} +_ => Ok(None), +}) +.collect::>>() +}}; + +// For the three-argument case +($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) => {{ +$string_array +.iter() +.zip($length_array.iter()) +.zip($fill_array.iter()) +.map(|((string, length), fill)| match (string, length, fill) { +(Some(string), Some(length), Some(fill)) => { +if length > i32::MAX as i64 { +return exec_err!("rpad requested length {} too large", length); +} + +let length = if length < 0 { 0 } else { length as usize }; +let graphemes = string.graphemes(true).collect::>(); +let fill_chars = fill.chars().collect::>(); + +if length < graphemes.len() { +Ok(Some(graphemes[..length].concat())) +} else if fill_chars.is_empty() { +Ok(Some(string.to_string())) +} else { +let mut s = string.to_string(); +let char_vector: Vec = (0..length - graphemes.len()) +.map(|l| fill_chars[l % fill_chars.len()]) +.collect(); +s.push_str(&char
[PR] Use the tracked-consumers memory pool be the default. [datafusion]
wiedld opened a new pull request, #11949: URL: https://github.com/apache/datafusion/pull/11949 ## Which issue does this PR close? Closes #11523 ## Rationale for this change We would like the improved OOM error messages, which lists the top overall memory reservation consumers, to be the default. ## What changes are included in this PR? Make `TrackConsumersPool` be the default. Include benchmarks to assess potential performance overhead from added mutex. ## Are these changes tested? Yes, for the existing tests using the default pool. ## Are there any user-facing changes? No API changes. Change in OOM messages returned. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update `SPLIT_PART` scalar function to support Utf8View [datafusion]
tshauck opened a new issue, #11950: URL: https://github.com/apache/datafusion/issues/11950 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update `STRPOS` scalar function to support Utf8View [datafusion]
tshauck opened a new issue, #11951: URL: https://github.com/apache/datafusion/issues/11951 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update `SUBSTR` scalar function to support Utf8View [datafusion]
tshauck opened a new issue, #11952: URL: https://github.com/apache/datafusion/issues/11952 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update `TRANSLATE` scalar function to support Utf8View [datafusion]
tshauck opened a new issue, #11953: URL: https://github.com/apache/datafusion/issues/11953 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[I] Update `FIND_IN_SET` scalar function to support Utf8View [datafusion]
tshauck opened a new issue, #11954: URL: https://github.com/apache/datafusion/issues/11954 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Update `REVERSE` scalar function to support Utf8View [datafusion]
Omega359 commented on issue #11915: URL: https://github.com/apache/datafusion/issues/11915#issuecomment-2284343484 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add tests for StringView / character functions, fix `regexp_like` and `regexp_match` to work with StringView [datafusion]
tshauck commented on code in PR #11753: URL: https://github.com/apache/datafusion/pull/11753#discussion_r1714025767 ## datafusion/sqllogictest/test_files/string_view.slt: ## @@ -594,8 +518,417 @@ SELECT 228 0 NULL +## Ensure no casts for BTRIM +query TT +EXPLAIN SELECT + BTRIM(column1_utf8view, 'foo') AS l +FROM test; + +logical_plan +01)Projection: btrim(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS l +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for CHARACTER_LENGTH +query TT +EXPLAIN SELECT + CHARACTER_LENGTH(column1_utf8view) AS l +FROM test; + +logical_plan +01)Projection: character_length(test.column1_utf8view) AS l +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for CONCAT +## TODO https://github.com/apache/datafusion/issues/11836 +query TT +EXPLAIN SELECT + concat(column1_utf8view, column2_utf8view) as c +FROM test; + +logical_plan +01)Projection: concat(CAST(test.column1_utf8view AS Utf8), CAST(test.column2_utf8view AS Utf8)) AS c +02)--TableScan: test projection=[column1_utf8view, column2_utf8view] + +## Ensure no casts for CONCAT_WS +## TODO https://github.com/apache/datafusion/issues/11837 +query TT +EXPLAIN SELECT + concat_ws(', ', column1_utf8view, column2_utf8view) as c +FROM test; + +logical_plan +01)Projection: concat_ws(Utf8(", "), CAST(test.column1_utf8view AS Utf8), CAST(test.column2_utf8view AS Utf8)) AS c +02)--TableScan: test projection=[column1_utf8view, column2_utf8view] + +## Ensure no casts for CONTAINS +## TODO https://github.com/apache/datafusion/issues/11838 +query TT +EXPLAIN SELECT + CONTAINS(column1_utf8view, 'foo') as c1, + CONTAINS(column2_utf8view, column2_utf8view) as c2 +FROM test; + +logical_plan +01)Projection: contains(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS c1, contains(__common_expr_1, __common_expr_1) AS c2 +02)--Projection: CAST(test.column2_utf8view AS Utf8) AS __common_expr_1, test.column1_utf8view +03)TableScan: test projection=[column1_utf8view, column2_utf8view] + +## Ensure no casts for ENDS_WITH +## TODO https://github.com/apache/datafusion/issues/11852 +query TT +EXPLAIN SELECT + ENDS_WITH(column1_utf8view, 'foo') as c1, + ENDS_WITH(column2_utf8view, column2_utf8view) as c2 +FROM test; + +logical_plan +01)Projection: ends_with(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS c1, ends_with(__common_expr_1, __common_expr_1) AS c2 +02)--Projection: CAST(test.column2_utf8view AS Utf8) AS __common_expr_1, test.column1_utf8view +03)TableScan: test projection=[column1_utf8view, column2_utf8view] + + +## Ensure no casts for INITCAP +## TODO https://github.com/apache/datafusion/issues/11853 +query TT +EXPLAIN SELECT + INITCAP(column1_utf8view) as c +FROM test; + +logical_plan +01)Projection: initcap(CAST(test.column1_utf8view AS Utf8)) AS c +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for LEVENSHTEIN +## TODO https://github.com/apache/datafusion/issues/11854 +query TT +EXPLAIN SELECT + levenshtein(column1_utf8view, 'foo') as c1, + levenshtein(column1_utf8view, column2_utf8view) as c2 +FROM test; + +logical_plan +01)Projection: levenshtein(__common_expr_1, Utf8("foo")) AS c1, levenshtein(__common_expr_1, CAST(test.column2_utf8view AS Utf8)) AS c2 +02)--Projection: CAST(test.column1_utf8view AS Utf8) AS __common_expr_1, test.column2_utf8view +03)TableScan: test projection=[column1_utf8view, column2_utf8view] + +## Ensure no casts for LOWER +## TODO https://github.com/apache/datafusion/issues/11855 +query TT +EXPLAIN SELECT + LOWER(column1_utf8view) as c1 +FROM test; + +logical_plan +01)Projection: lower(CAST(test.column1_utf8view AS Utf8)) AS c1 +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for LTRIM +## TODO https://github.com/apache/datafusion/issues/11856 +query TT +EXPLAIN SELECT + LTRIM(column1_utf8view) as c1 +FROM test; + +logical_plan +01)Projection: ltrim(CAST(test.column1_utf8view AS Utf8)) AS c1 +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for LPAD +## TODO https://github.com/apache/datafusion/issues/11857 +query TT +EXPLAIN SELECT + LPAD(column1_utf8view, 12, ' ') as c1 +FROM test; + +logical_plan +01)Projection: lpad(CAST(test.column1_utf8view AS Utf8), Int64(12), Utf8(" ")) AS c1 +02)--TableScan: test projection=[column1_utf8view] + + +## Ensure no casts for OCTET_LENGTH +## TODO https://github.com/apache/datafusion/issues/11858 +query TT +EXPLAIN SELECT + OCTET_LENGTH(column1_utf8view) as c1 +FROM test; + +logical_plan +01)Projection: octet_length(CAST(test.column1_utf8view AS Utf8)) AS c1 +02)--TableScan: test projection=[column1_utf8view] + +## Ensure no casts for OVERLAY +## TODO file ticket +query TT +EXPLAIN SELECT + OVERLAY(column1_utf8view PLACING 'foo' FROM 2 ) as c1 +FROM test; + +logical_plan +01)Projection: overlay(CAST(test.column1_utf8view AS Utf8), Utf8("foo")
Re: [PR] Use the tracked-consumers memory pool be the default. [datafusion]
wiedld commented on PR #11949: URL: https://github.com/apache/datafusion/pull/11949#issuecomment-2284349835 Benchmark clickbench_1 | Query | main_base | 11523_tracked-consumers-default | Change | | -- | | - | -- | | QQuery 0 | 0.52ms |0.53ms | no change | | QQuery 1 | 91.61ms | 92.41ms | no change | | QQuery 2 | 235.65ms | 233.69ms | no change | | QQuery 3 | 195.27ms | 189.60ms | no change | | QQuery 4 |2253.70ms | 2310.43ms | no change | | QQuery 5 |2113.38ms | 2299.27ms | 1.09x slower | | QQuery 6 | 84.20ms | 82.10ms | no change | | QQuery 7 | 97.18ms | 97.45ms | no change | | QQuery 8 |3212.14ms | 3615.31ms | 1.13x slower | | QQuery 9 |2539.55ms | 2537.32ms | no change | | QQuery 10 |1997.85ms | 1998.86ms | no change | | QQuery 11 |2068.84ms | 2088.95ms | no change | | QQuery 12 |3082.54ms | 3277.47ms | 1.06x slower | | QQuery 13 |4732.25ms | 5160.32ms | 1.09x slower | | QQuery 14 |3866.84ms | 4161.76ms | 1.08x slower | | QQuery 15 |2649.19ms | 2758.73ms | no change | | QQuery 16 |6131.58ms | 6935.31ms | 1.13x slower | | QQuery 17 |6047.45ms | 6782.28ms | 1.12x slower | | QQuery 18 | 10795.80ms |12602.46ms | 1.17x slower | | QQuery 19 | 169.50ms | 163.48ms | no change | | QQuery 20 |3270.49ms | 3224.10ms | no change | | QQuery 21 |5061.89ms | 5050.48ms | no change | | QQuery 22 | 12660.40ms |12715.04ms | no change | | QQuery 23 | 27705.44ms |28134.60ms | no change | | QQuery 24 |2485.82ms | 2479.34ms | no change | | QQuery 25 |2273.66ms | 2284.05ms | no change | | QQuery 26 |2562.82ms | 2560.38ms | no change | | QQuery 27 |4308.90ms | 4316.42ms | no change | | QQuery 28 | 36444.61ms |36264.41ms | no change | | QQuery 29 |1386.33ms | 1380.30ms | no change | | QQuery 30 |3864.30ms | 3874.16ms | no change | | QQuery 31 |4622.73ms | 4637.74ms | no change | | QQuery 32 | 21909.22ms |22018.07ms | no change | | QQuery 33 | 10431.35ms |10353.16ms | no change | | QQuery 34 | 10348.38ms |10348.83ms | no change | | QQuery 35 |4157.44ms | 4150.37ms | no change | | QQuery 36 | 226.46ms | 230.03ms | no change | | QQuery 37 | 166.41ms | 165.30ms | no change | | QQuery 38 | 132.29ms | 141.74ms | 1.07x slower | | QQuery 39 | 790.22ms | 772.59ms | no change | | QQuery 40 | 60.17ms | 67.44ms | 1.12x slower | | QQuery 41 | 57.77ms | 59.21ms | no change | | QQuery 42 | 71.12ms | 69.47ms | no change | | Benchmark Summary | | | | - | | Total Time (main_base)
Re: [I] Improve performance of broadcast hash join [datafusion-comet]
andygrove commented on issue #808: URL: https://github.com/apache/datafusion-comet/issues/808#issuecomment-2284350076 The filter on the probe input is very simple (`col_0@0 IS NOT NULL`) and it should be possible to push down to the parquet scan? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] chore: Enable Comet shuffle with AQE coalesce partitions [datafusion-comet]
viirya closed pull request #651: chore: Enable Comet shuffle with AQE coalesce partitions URL: https://github.com/apache/datafusion-comet/pull/651 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] fix: bugs when having and group by are all false [datafusion]
Lordworms commented on PR #11897: URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2284447751 > EliminateGroupByConstant > Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after `EliminateGroupByConstant`. > > I think this is why you bring the `is_global_group_by` information down to physical layer. exactly > > I think another approach is we avoid `EliminateGroupByConstant` at all and we eliminate it when creating physical group by expression 🤔 I don't think we should completely avoid this rule since it has its own usage, for example here, if we disable it, the plan would be like https://github.com/user-attachments/assets/bcc4988c-9164-4a42-a6be-130bdd02111a";> and with enable it, the plan is https://github.com/user-attachments/assets/c7501fdd-a8b7-48fd-9cb8-d15c3c32d28b";> we introduced an unnecessary Projection which I don't feel suitable. Also I think there should be other test cases which may require AggregateExec to emit empty set, since right now it always returns an null with no input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [I] Change default value of `datafusion.catalog.has_header` to `true` [datafusion]
alamb commented on issue #11936: URL: https://github.com/apache/datafusion/issues/11936#issuecomment-2284459200 I think default to "no headers" is unlikely to have been a conscious choice (I suspect it was a historical accident) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
[PR] Add native stringview support for RIGHT [datafusion]
Kev1n8 opened a new pull request, #11955: URL: https://github.com/apache/datafusion/pull/11955 ## Which issue does this PR close? Closes #11917 ## Rationale for this change ## What changes are included in this PR? Update signature of RIGHT to accept Utf8View and add some tests for that. ## Are these changes tested? yes ## Are there any user-facing changes? no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Add native stringview support for RIGHT [datafusion]
Kev1n8 commented on PR #11955: URL: https://github.com/apache/datafusion/pull/11955#issuecomment-2284463629 Pasting a possible implementation of `StringViewArray` output here (current implementation returns `StringArray`): ``` fn string_view_right(args: &[ArrayRef]) -> Result { let string_array = as_string_view_array(&args[0])?; let n_array = as_int64_array(&args[1])?; let mut builder = StringViewBuilder::new(); let combined = string_array.iter().zip(n_array); for iter in combined { if let (Some(str), Some(n)) = iter { match n.cmp(&0) { Ordering::Less => { let start = n.unsigned_abs() as usize; let sub_string = &str[start..]; builder.append_value(sub_string); } Ordering::Equal => { builder.append_null(); } Ordering::Greater => { let start = max(str.chars().count() as i64 - n, 0) as usize; let sub_string = &str[start..]; builder.append_value(sub_string); } } } else { builder.append_null(); } } let result = builder.finish(); Ok(Arc::new(result) as ArrayRef) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org
Re: [PR] Update labeler.yml to match crates [datafusion]
alamb commented on PR #11937: URL: https://github.com/apache/datafusion/pull/11937#issuecomment-2284501265 Thanks for the review @comphead -- let's see how this works! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org