Re: [PR] fix: bugs when having and group by are all false [datafusion]

2024-08-12 Thread via GitHub


jonahgao commented on PR #11897:
URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283248937

   > So this PR actually fixed two bugs, (1. global group_by + having false. 2. 
global group_by + having true)
   
   The following case seems to be caused by `EliminateGroupByConstant`. I think 
we can create a separate fix for it later.
   
   ```sh
   DataFusion CLI v41.0.0
   > SELECT AVG(v1) FROM t1 GROUP BY false;
   ++
   | avg(t1.v1) |
   ++
   ||
   ++
   1 row(s) fetched.
   ```
   In DuckDB:
   ```sh
   D SELECT AVG(v1) FROM t1 GROUP BY false;
   ┌─┐
   │ avg(v1) │
   │ double  │
   ├─┤
   │ 0 rows  │
   └─┘
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Sketch for aggregation intermediate results blocked management [datafusion]

2024-08-12 Thread via GitHub


Rachelint opened a new pull request, #11943:
URL: https://github.com/apache/datafusion/pull/11943

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: bugs when having and group by are all false [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on PR #11897:
URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283302352

   Ideally group by constant should be eliminated, but the result is different 
when there is no row and we can't differentiate it after 
`EliminateGroupByConstant`.
   
   I think this is why you bring the `is_global_group_by` information down to 
physical layer.
   
   I think another approach is we avoid `EliminateGroupByConstant` at all and 
we eliminate it when creating physical group by expression 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] TEST CI only [datafusion-comet]

2024-08-12 Thread via GitHub


viirya opened a new pull request, #812:
URL: https://github.com/apache/datafusion-comet/pull/812

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

2024-08-12 Thread via GitHub


JasonLi-cn commented on PR #11758:
URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283323779

   > > > I agree, finally it should be a big change which switches the group 
values and related states managed by block like duckdb , and I am working on 
this(#11931).
   > > 
   > > 
   > > I think personally suggest sketching out what this would look like in a 
first PR without worrying about getting all the tests passing / compiling etc.
   > > If we try to port all code at once to being managed in blocks it is 
going to be a very large change
   > > I am thinking maybe we can have a incremental approach (like for example 
separately adding the ability to do blocked emission for 
https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html
 and
   > > 
https://github.com/apache/datafusion/blob/fd237f8705b18fa089fdfb8dd5b04655ccb4d691/datafusion/physical-plan/src/aggregates/group_values/mod.rs#L37-L55
   > > 
   > > If we can set this up so that we get the pattern and a few common 
implementations setup in the first PR then we can make subsequent PRs to port 
over the other parts of the aggregation 🤔
   > 
   > Yes, I want to do the similar things for
   > 
   > > incremental approach
   > 
   > Yes... I try to do something like it but still not thorough enough, and I 
found it is actually hard to support the exact `EmitTo::First` used by 
streaming aggr after I found the reason why some tests failed... And we 
actually should disable the blocked mode in streaming aggr.
   > 
   > I am planning to switch to the special blocked emission impl now... It 
seems we can do two things to introduce the new emit modes?
   > 
   > * Define a `support_blocked_emission` for `GroupValues` and 
`GroupAccumulator`.
   > * Add two emit enums
   > 
   > ```
   > pub enum EmitTo {
   > /// Emit all groups
   > All,
   > /// Emit only the first `n` groups and shift all existing group
   > /// indexes down by `n`.
   > ///
   > /// For example, if `n=10`, group_index `0, 1, ... 9` are emitted
   > /// and group indexes '`10, 11, 12, ...` become `0, 1, 2, ...`.
   > First(usize),
   > 
   > AllBlocks,
   > 
   > FirstBlocks(usize),
   > }
   > ```
   > 
   > What do you think this way to supprot the special blocked emission?
   
   Do we need to put `batch_size` in `AllBlocks` and `FirstBlocks` 🤔 ?
   And that's exactly what I was planning on doing. But I found that the 
changes to the code were very large.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: bugs when having and group by are all false [datafusion]

2024-08-12 Thread via GitHub


jonahgao commented on PR #11897:
URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2283416590

   I agree to disable `EliminateGroupByConstant` because it does not work 
correctly with empty input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]

2024-08-12 Thread via GitHub


jonahgao commented on PR #11929:
URL: https://github.com/apache/datafusion/pull/11929#issuecomment-2283419507

   Thanks for the review @alamb @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]

2024-08-12 Thread via GitHub


jonahgao merged PR #11929:
URL: https://github.com/apache/datafusion/pull/11929


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore(deps): update substrait requirement from 0.36.0 to 0.40.0 [datafusion]

2024-08-12 Thread via GitHub


dependabot[bot] opened a new pull request, #11944:
URL: https://github.com/apache/datafusion/pull/11944

   Updates the requirements on 
[substrait](https://github.com/substrait-io/substrait-rs) to permit the latest 
version.
   
   Release notes
   Sourced from https://github.com/substrait-io/substrait-rs/releases";>substrait's 
releases.
   
   v0.40.0
   Chore (BREAKING)
   
   bump substrait from 0.53.0 to 0.54.0
   Bumps https://github.com/substrait-io/substrait";>substrait from
   5fe2a16 to 93c8339.
   
   Commit Statistics
   
   1 commit contributed to the release.
   6 days passed between releases.
   1 commit was understood as https://www.conventionalcommits.org";>conventional.
   1 unique issue was worked on: https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205
   
   Commit Details
   
   
   
   https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205
   
   Bump substrait from 0.53.0 to 0.54.0 
(01d706c)
   
   
   
   
   
   
   
   Changelog
   Sourced from https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md";>substrait's
 changelog.
   
   0.40.0 (2024-08-12)
   Chore (BREAKING)
   
bump substrait from 0.53.0 to 
0.54.0
   Bumps https://github.com/substrait-io/substrait";>substrait from
   5fe2a16 to 93c8339.
   
   Commit Statistics
   
   
   1 commit contributed to the release.
   6 days passed between releases.
   1 commit was understood as https://www.conventionalcommits.org";>conventional.
   1 unique issue was worked on: https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205
   
   Commit Details
   
   
   
   https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205
   
   Bump substrait from 0.53.0 to 0.54.0 (https://github.com/substrait-io/substrait-rs/commit/01d706cbaf9c189b299a8dd719e607dd0679fbd7";>01d706c)
   
   
   
   
   0.39.0 (2024-08-05)
   
   Other
   
use previous rust release to build smart 
release with lock file
   The smart-release lock file is pointing to an outdated version of 
time
   that fails to build on the latest stable rustc. Reverting to the
   previous stable release until the next smart-release release.
   
   Chore (BREAKING)
   
bump substrait from 0.52.0 to 
0.53.0
   Bumps https://github.com/substrait-io/substrait";>substrait from
   a68c1ac to 5fe2a16.
   
   Commit Statistics
   
   
   3 commits contributed to the release.
   
   
   
   ... (truncated)
   
   
   Commits
   
   https://github.com/substrait-io/substrait-rs/commit/5631ef59adb82e263cd27873d6a0dfdd78737b62";>5631ef5
 Release substrait v0.40.0
   https://github.com/substrait-io/substrait-rs/commit/01d706cbaf9c189b299a8dd719e607dd0679fbd7";>01d706c
 chore(deps,substrait)!: bump substrait from 0.53.0 to 
0.54.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/205";>#205)
   https://github.com/substrait-io/substrait-rs/commit/e6e87fb03e9efe4db66c24ba22bf8bfa0544726d";>e6e87fb
 Release substrait v0.39.0
   https://github.com/substrait-io/substrait-rs/commit/0ac026f0d5cb1363c1a64612291bf5a976b91e96";>0ac026f
 ci: use previous rust release to build smart release with lock file (https://redirect.github.com/substrait-io/substrait-rs/issues/204";>#204)
   https://github.com/substrait-io/substrait-rs/commit/83c1fc416f41313c78407c46051be62211453b5f";>83c1fc4
 chore(deps,substrait)!: bump substrait from 0.52.0 to 
0.53.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/203";>#203)
   https://github.com/substrait-io/substrait-rs/commit/87096e2ce2d41681bcbda1eda3abfd48e358d852";>87096e2
 Release substrait v0.38.0
   https://github.com/substrait-io/substrait-rs/commit/2ee4f6576f216b36989b572f555da8d9b20ba311";>2ee4f65
 chore(deps,substrait)!: bump substrait from 0.51.0 to 
0.52.0 (https://redirect.github.com/substrait-io/substrait-rs/issues/201";>#201)
   https://github.com/substrait-io/substrait-rs/commit/9d3b09f4856f1289c7477242353a62aee60ded37";>9d3b09f
 Release substrait v0.37.3
   https://github.com/substrait-io/substrait-rs/commit/0752984cdc8ff51fbe019e5aaae1aba6fd1e1a5d";>0752984
 chore: fix build script warning, fix doctest with default features (https://redirect.github.com/substrait-io/substrait-rs/issues/200";>#200)
   https://github.com/substrait-io/substrait-rs/commit/c67f7e090f11222304c199440eb95e78f55c1788";>c67f7e0
 Release substrait v0.37.2
   Additional commits viewable in https://github.com/substrait-io/substrait-rs/compare/v0.36.0...v0.40.0";>compare
 view
   
   
   
   
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting 

Re: [PR] chore(deps): update substrait requirement from 0.36.0 to 0.39.0 [datafusion]

2024-08-12 Thread via GitHub


dependabot[bot] closed pull request #11842: chore(deps): update substrait 
requirement from 0.36.0 to 0.39.0
URL: https://github.com/apache/datafusion/pull/11842


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore(deps): update substrait requirement from 0.36.0 to 0.39.0 [datafusion]

2024-08-12 Thread via GitHub


dependabot[bot] commented on PR #11842:
URL: https://github.com/apache/datafusion/pull/11842#issuecomment-2283421829

   Superseded by #11944.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled [datafusion]

2024-08-12 Thread via GitHub


crepererum commented on issue #4028:
URL: https://github.com/apache/datafusion/issues/4028#issuecomment-2283437288

   No need to rush. I think @alamb's comment was mostly meant as an assignment 
signal, so that nobody else starts to work on it and we end up wasting 
resources 🙂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

2024-08-12 Thread via GitHub


Rachelint commented on PR #11758:
URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283482260

   @JasonLi-cn As I think, maybe we should impl the special block based 
`GroupValues` impls:
   - We pass the `block size` when initializing it
   - It manage the inner values block by block
   - It return all blocks with internal `block size`
   We can always make the `block size == batch size`, so we can totally avoid 
any split operators.
   
   I think the `GroupValues` impls maybe should not care about the `batch 
size`? And we just do the `split and merge` work in the 
`GroupedHashAggregateStream::poll` if the unfortunately, the `batch size != 
block size` (usually they will equal)?
   
   I am making a try about it in #11943 , and have done some related code 
changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]

2024-08-12 Thread via GitHub


lewiszlw opened a new pull request, #11945:
URL: https://github.com/apache/datafusion/pull/11945

   ## Which issue does this PR close?
   
   
   
   part of https://github.com/apache/datafusion/issues/11502.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

2024-08-12 Thread via GitHub


JasonLi-cn commented on PR #11758:
URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283566894

   > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about 
the `batch size`? And we just do the `split and merge` work in the 
`GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block 
size` (usually they will equal)?
   > 
   > Maybe we should impl the special block based `GroupValues` impls like 
following?
   > 
   > * We pass the `block size` when initializing it
   > * It manage the inner values block by block
   > * It return all blocks with internal `block size`
   >   We can always make the `block size == batch size`, so we can totally 
avoid any split operators.
   > 
   > I am making a try about it in #11943 , and have done some related code 
changes.
   
   OK. How do we determine the value of block size? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Write a blog post about implementing StringView in DataFusion [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11603:
URL: https://github.com/apache/datafusion/issues/11603#issuecomment-2283594648

   We are done with the draft. We expect it to be published in the next few 
weeks (it turns out to be a two part series)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Return TableProviderFilterPushDown::Exact when Parquet Pushdown Enabled [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #4028:
URL: https://github.com/apache/datafusion/issues/4028#issuecomment-2283600292

   > No need to rush. I think @alamb's comment was mostly meant as an 
assignment signal, so that nobody else starts to work on it and we end up 
wasting resources 🙂
   
   Yes this is what I meant. Sorry about not being clear. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update INITCAP scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11888:
URL: https://github.com/apache/datafusion/pull/11888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `INITCAP` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11853: Update `INITCAP` scalar function to support 
`Utf8View`
URL: https://github.com/apache/datafusion/issues/11853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update INITCAP scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11888:
URL: https://github.com/apache/datafusion/pull/11888#issuecomment-2283612658

   Thanks again @xinlifoobar and @XiangpengHao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native support StringView for Octet Length [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11906:
URL: https://github.com/apache/datafusion/pull/11906


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native support StringView for Octet Length [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11906:
URL: https://github.com/apache/datafusion/pull/11906#issuecomment-2283613083

   Thanks again @PsiACE 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `OCTET_LENGTH` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11858: Update `OCTET_LENGTH` scalar function to support 
`Utf8View`
URL: https://github.com/apache/datafusion/issues/11858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Implement `ordering` serialization for `AggregateUdf` in `datafusion-proto` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11804: Implement `ordering` serialization for 
`AggregateUdf` in `datafusion-proto`
URL: https://github.com/apache/datafusion/issues/11804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11940:
URL: https://github.com/apache/datafusion/pull/11940#discussion_r1713505076


##
datafusion/sql/tests/sql_integration.rs:
##
@@ -2739,39 +2742,44 @@ fn logical_plan_with_dialect_and_options(
 dialect: &dyn Dialect,
 options: ParserOptions,
 ) -> Result {
-let context = MockContextProvider::default()
-.with_udf(unicode::character_length().as_ref().clone())
-.with_udf(string::concat().as_ref().clone())
-.with_udf(make_udf(
+let state = MockSessionState::default()
+
.with_scalar_function(Arc::new(unicode::character_length().as_ref().clone()))

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: impl ordering for serialization/deserialization for AggregateUdf [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11926:
URL: https://github.com/apache/datafusion/pull/11926


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: impl ordering for serialization/deserialization for AggregateUdf [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11926:
URL: https://github.com/apache/datafusion/pull/11926#issuecomment-2283616969

   Thanks @jayzhan211  and @haohuaijin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Support tuples as types [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11896:
URL: https://github.com/apache/datafusion/pull/11896#issuecomment-2283621173

   I took the liberty of merging up from main to resolve a conflict -- I think 
we could merge this PR and address the comments as a follow on PR, or we can do 
it prior to merging this one.
   
   Please let me know what you prefer @samuelcolvin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11945:
URL: https://github.com/apache/datafusion/pull/11945#discussion_r1713520236


##
datafusion/common/src/utils/mod.rs:
##
@@ -683,6 +683,69 @@ pub fn transpose(original: Vec>) -> Vec> {
 }
 }
 
+/// Computes the `skip` and `fetch` parameters of a single limit that would be

Review Comment:
   It is nice to put this into `utils` 👍 



##
datafusion/physical-optimizer/src/output_requirements.rs:
##
@@ -286,3 +286,5 @@ fn require_top_ordering_helper(
 Ok((plan, false))
 }
 }
+
+// See tests in datafusion/core/tests/physical_optimizer

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11942:
URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713532418


##
datafusion/functions/src/unicode/rpad.rs:
##
@@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc {
 
 fn invoke(&self, args: &[ColumnarValue]) -> Result {
 match args[0].data_type() {
-DataType::Utf8 => make_scalar_function(rpad::, vec![])(args),
+DataType::Utf8 | DataType::Utf8View => {
+make_scalar_function(rpad::, vec![])(args)
+}
 DataType::LargeUtf8 => make_scalar_function(rpad::, 
vec![])(args),
 other => exec_err!("Unsupported data type {other:?} for function 
rpad"),
 }
 }
 }
 
-/// Extends the string to length 'length' by appending the characters fill (a 
space by default). If the string is already longer than length then it is 
truncated.
-/// rpad('hi', 5, 'xy') = 'hixyx'
-pub fn rpad(args: &[ArrayRef]) -> Result {
-match args.len() {
-2 => {
-let string_array = as_generic_string_array::(&args[0])?;
-let length_array = as_int64_array(&args[1])?;
-
-let result = string_array
-.iter()
-.zip(length_array.iter())
-.map(|(string, length)| match (string, length) {
-(Some(string), Some(length)) => {
-if length > i32::MAX as i64 {
-return exec_err!(
-"rpad requested length {length} too large"
-);
-}
+macro_rules! process_rpad {
+// For the two-argument case
+($string_array:expr, $length_array:expr, $is_view:expr) => {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.map(|(string, length)| match (string, length) {
+(Some(string), Some(length)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
 
-let length = if length < 0 { 0 } else { length as 
usize };
-if length == 0 {
-Ok(Some("".to_string()))
+let length = if length < 0 { 0 } else { length as usize };
+if length == 0 {
+Ok(Some("".to_string()))
+} else {
+let graphemes = 
string.graphemes(true).collect::>();
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
 } else {
-let graphemes = 
string.graphemes(true).collect::>();
-if length < graphemes.len() {
-Ok(Some(graphemes[..length].concat()))
+let mut s = string.to_string();
+if !$is_view {
+s.push_str(&" ".repeat(length - 
graphemes.len()));
 } else {
-let mut s = string.to_string();
 s.push_str(" ".repeat(length - 
graphemes.len()).as_str());
-Ok(Some(s))
 }
+Ok(Some(s))
 }
 }
-_ => Ok(None),
-})
-.collect::>>()?;
+}
+_ => Ok(None),
+})
+.collect::>>()
+}};
+
+// For the three-argument case
+($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) 
=> {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.zip($fill_array.iter())
+.map(|((string, length), fill)| match (string, length, fill) {
+(Some(string), Some(length), Some(fill)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
+
+let length = if length < 0 { 0 } else { length as usize };
+let graphemes = 
string.graphemes(true).collect::>();
+let fill_chars = fill.chars().collect::>();
+
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
+} else if fill_chars.is_empty() {
+Ok(Some(string.to_string()))
+} else {
+let mut s = string.to_string();
+let char_vector: Vec = (0..length - 
graphemes.len())
+.map(|l| fill_chars[l % fill_chars.len()])
+.collect();
+s.push_str(&char_ve

Re: [PR] Generate GroupByHash output in multiple RecordBatches [datafusion]

2024-08-12 Thread via GitHub


Rachelint commented on PR #11758:
URL: https://github.com/apache/datafusion/pull/11758#issuecomment-2283653639

   > > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about 
the `batch size`? And we just do the `split and merge` work in the 
`GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block 
size` (usually they will equal)?
   > > Maybe we should impl the special block based `GroupValues` impls like 
following?
   > > 
   > > * We pass the `block size` when initializing it
   > > * It manage the inner values block by block
   > > * It return all blocks with internal `block size`
   > >   We can always make the `block size == batch size`, so we can totally 
avoid any split operators.
   > > 
   > > I am making a try about it in #11943 , and have done some related code 
changes.
   > 
   > OK. How do we determine the value of block size?
   
   
   
   > > @JasonLi-cn As I think, `GroupValues` impls maybe should not care about 
the `batch size`? And we just do the `split and merge` work in the 
`GroupedHashAggregateStream::poll` , if unfortunately, the `batch size != block 
size` (usually they will equal)?
   > > Maybe we should impl the special block based `GroupValues` impls like 
following?
   > > 
   > > * We pass the `block size` when initializing it
   > > * It manage the inner values block by block
   > > * It return all blocks with internal `block size`
   > >   We can always make the `block size == batch size`, so we can totally 
avoid any split operators.
   > > 
   > > I am making a try about it in #11943 , and have done some related code 
changes.
   > 
   > OK. How do we determine the value of block size?
   
   I think maybe we make it equal to `batch_size` in most cases, and so that we 
can avoid any split operations during producing output? And for the cornercase, 
for example, the `batch_size` is too small, we can let it fallback to single 
block mode?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11826:
URL: https://github.com/apache/datafusion/issues/11826#issuecomment-2283655669

   PRs in need of review:
   
   DataFusion
   - [ ] https://github.com/apache/datafusion/pull/11456
   - [ ] https://github.com/apache/datafusion/pull/11938
   - [ ] https://github.com/apache/datafusion/pull/11933
   
   Arrow
   - [ ] https://github.com/apache/arrow-rs/pull/6201
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Configurable null_equals_null flag [datafusion]

2024-08-12 Thread via GitHub


berkaysynnada commented on issue #11883:
URL: https://github.com/apache/datafusion/issues/11883#issuecomment-2283657039

   That makes sense. If such usages exist, there's no need to expose a new 
flag. It would be great if we could handle these two versions and link them to 
our internal join flags. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] test: re-enable window function over parquet with forced collisions [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11939:
URL: https://github.com/apache/datafusion/pull/11939#issuecomment-2283661518

   I re-ran the relevant test: 
   
   ```shell
   cargo test --test sqllogictests --features=force_hash_collisions -- parquet
   ```
   
   And indeed it passes locally for me too
   
   ```
   andrewlamb@Andrews-MBP-2:~/Software/datafusion$ cargo test --test 
sqllogictests --features=force_hash_collisions -- parquet
   warning: function `hash_map_array` is never used
  --> datafusion/common/src/hash_utils.rs:249:4
   |
   249 | fn hash_map_array(
   |^^
   |
   = note: `#[warn(dead_code)]` on by default
   
   warning: `datafusion-common` (lib) generated 1 warning
   Finished `test` profile [unoptimized + debuginfo] target(s) in 0.40s
Running bin/sqllogictests.rs 
(target/debug/deps/sqllogictests-a3320b7543efea67)
   Running "parquet.slt"
   Running "parquet_sorted_statistics.slt"
   andrewlamb@Andrews-MBP-2:~/Software/datafusion$
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `ENDS_WITH` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11852: Update `ENDS_WITH` scalar function to support 
`Utf8View`
URL: https://github.com/apache/datafusion/issues/11852


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native support StringView for Ends With [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11924:
URL: https://github.com/apache/datafusion/pull/11924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native support StringView for Ends With [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11924:
URL: https://github.com/apache/datafusion/pull/11924#discussion_r1713543104


##
datafusion/functions/src/string/ends_with.rs:
##
@@ -43,14 +41,15 @@ impl Default for EndsWithFunc {
 
 impl EndsWithFunc {
 pub fn new() -> Self {
-use DataType::*;
 Self {
 signature: Signature::one_of(
 vec![
-Exact(vec![Utf8, Utf8]),
-Exact(vec![Utf8, LargeUtf8]),
-Exact(vec![LargeUtf8, Utf8]),
-Exact(vec![LargeUtf8, LargeUtf8]),
+// Planner attempts coercion to the target type starting 
with the most preferred candidate.

Review Comment:
   ❤️  for the comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native support StringView for Levenshtein [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11925:
URL: https://github.com/apache/datafusion/pull/11925


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `levenshtein` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11854: Update `levenshtein` scalar function to support 
`Utf8View`
URL: https://github.com/apache/datafusion/issues/11854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `OVERLAY` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


PsiACE commented on issue #11909:
URL: https://github.com/apache/datafusion/issues/11909#issuecomment-2283746445

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `REGEXP_LIKE` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


PsiACE commented on issue #11910:
URL: https://github.com/apache/datafusion/issues/11910#issuecomment-2283749092

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Utf8View for lpad scalar function [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11941:
URL: https://github.com/apache/datafusion/pull/11941#discussion_r1713604384


##
datafusion/functions/src/unicode/lpad.rs:
##
@@ -76,300 +87,450 @@ impl ScalarUDFImpl for LPadFunc {
 }
 
 fn invoke(&self, args: &[ColumnarValue]) -> Result {
-match args[0].data_type() {
-DataType::Utf8 => make_scalar_function(lpad::, vec![])(args),
-DataType::LargeUtf8 => make_scalar_function(lpad::, 
vec![])(args),
-other => exec_err!("Unsupported data type {other:?} for function 
lpad"),
-}
+make_scalar_function(lpad, vec![])(args)
 }
 }
 
-/// Extends the string to length 'length' by prepending the characters fill (a 
space by default). If the string is already longer than length then it is 
truncated (on the right).
+/// Extends the string to length 'length' by prepending the characters fill (a 
space by default).
+/// If the string is already longer than length then it is truncated (on the 
right).
 /// lpad('hi', 5, 'xy') = 'xyxhi'
-pub fn lpad(args: &[ArrayRef]) -> Result {
-match args.len() {
-2 => {
-let string_array = as_generic_string_array::(&args[0])?;
-let length_array = as_int64_array(&args[1])?;
-
-let result = string_array
-.iter()
-.zip(length_array.iter())
-.map(|(string, length)| match (string, length) {
-(Some(string), Some(length)) => {
-if length > i32::MAX as i64 {
-return exec_err!(
-"lpad requested length {length} too large"
-);
-}
+pub fn lpad(args: &[ArrayRef]) -> Result {
+if args.len() <= 1 || args.len() > 3 {
+return exec_err!(
+"lpad was called with {} arguments. It requires at least 2 and at 
most 3.",
+args.len()
+);
+}
+
+let length_array = as_int64_array(&args[1])?;
+
+match args[0].data_type() {
+Utf8 => match args.len() {
+2 => lpad_impl::<&GenericStringArray, 
&GenericStringArray, i32>(
+args[0].as_string::(),
+length_array,
+None,
+),
+3 => lpad_with_replace::<&GenericStringArray, i32>(
+args[0].as_string::(),
+length_array,
+&args[2],
+),
+_ => unreachable!(),
+},
+LargeUtf8 => match args.len() {
+2 => lpad_impl::<&GenericStringArray, 
&GenericStringArray, i64>(
+args[0].as_string::(),
+length_array,
+None,
+),
+3 => lpad_with_replace::<&GenericStringArray, i64>(
+args[0].as_string::(),
+length_array,
+&args[2],
+),
+_ => unreachable!(),
+},
+Utf8View => match args.len() {
+2 => lpad_impl::<&StringViewArray, &GenericStringArray, i32>(
+args[0].as_string_view(),
+length_array,
+None,
+),
+3 => lpad_with_replace::<&StringViewArray, i32>(
+args[0].as_string_view(),
+length_array,
+&args[2],
+),
+_ => unreachable!(),
+},
+other => {
+exec_err!("Unsupported data type {other:?} for function lpad")
+}
+}
+}
 
-let length = if length < 0 { 0 } else { length as 
usize };
-if length == 0 {
-Ok(Some("".to_string()))
+fn lpad_with_replace<'a, V, T: OffsetSizeTrait>(
+string_array: V,
+length_array: &Int64Array,
+fill_array: &'a ArrayRef,
+) -> Result
+where
+V: StringArrayType<'a>,
+{
+match fill_array.data_type() {
+Utf8 => lpad_impl::, T>(
+string_array,
+length_array,
+Some(fill_array.as_string::()),
+),
+LargeUtf8 => lpad_impl::, T>(
+string_array,
+length_array,
+Some(fill_array.as_string::()),
+),
+Utf8View => lpad_impl::(
+string_array,
+length_array,
+Some(fill_array.as_string_view()),
+),
+other => {
+exec_err!("Unsupported data type {other:?} for function lpad")
+}
+}
+}
+
+fn lpad_impl<'a, V, V2, T>(
+string_array: V,
+length_array: &Int64Array,
+fill_array: Option,
+) -> Result
+where
+V: StringArrayType<'a>,
+V2: StringArrayType<'a>,
+T: OffsetSizeTrait,
+{
+if fill_array.is_none() {
+let result = string_array
+.iter()
+.zip(length_array.iter())
+.map(|(string, length)| match (string, length) {
+(Some(string), Some(length)) => {
+if 

Re: [PR] Implement native stringview support for BTRIM [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11920:
URL: https://github.com/apache/datafusion/pull/11920#discussion_r1713618503


##
datafusion/functions/src/string/common.rs:
##
@@ -68,6 +69,74 @@ pub(crate) fn general_trim(
 },
 };
 
+if use_string_view {
+string_view_trim::(trim_type, func, args)
+} else {
+string_trim::(trim_type, func, args)
+}
+}
+
+// removing 'a will cause compiler complaining lifetime of `func`
+fn string_view_trim<'a, T: OffsetSizeTrait>(

Review Comment:
   This implementation is an improvement -- however, it will now copy the 
string values to a new `StringArray`
   
   I think if the function generated StringView output directly it would be 
possible to avoid all string copies (and only adjust the views).
   
   I can file this as a follow on optimization idea.
   
   However, we can 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic] Native `StringView` support for string functions [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11790:
URL: https://github.com/apache/datafusion/issues/11790#issuecomment-2283777166

   One thing I have noticed during implementations is that some functions such 
as `ltrim`/`rtrim`/`btrim` could be more efficient if they produced Utf8View as 
*output* in addition to accepting them as *input* 
   
   For example, in 
https://github.com/apache/datafusion/pull/11920#discussion_r1713618503 from 
@Kev1n8  it is actually probably a good idea to *always* generate StringView as 
output (rather than StringArray) as it could avoid a copy.
   
   I am thinking once we get the string functions so they can support 
StringView as input then we can do a second pass and optimize some functions so 
they produce StringView as output


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `LTRIM` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11856:
URL: https://github.com/apache/datafusion/issues/11856#issuecomment-2283779062

   After https://github.com/apache/datafusion/pull/11920 from @Kev1n8  I think 
this one will be quick to implement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `RTRIM` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11916:
URL: https://github.com/apache/datafusion/issues/11916#issuecomment-2283779205

   After https://github.com/apache/datafusion/pull/11920 from @Kev1n8  I think 
this one will be quick to implement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native stringview support for BTRIM [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11920:
URL: https://github.com/apache/datafusion/pull/11920#discussion_r1713618503


##
datafusion/functions/src/string/common.rs:
##
@@ -68,6 +69,74 @@ pub(crate) fn general_trim(
 },
 };
 
+if use_string_view {
+string_view_trim::(trim_type, func, args)
+} else {
+string_trim::(trim_type, func, args)
+}
+}
+
+// removing 'a will cause compiler complaining lifetime of `func`
+fn string_view_trim<'a, T: OffsetSizeTrait>(

Review Comment:
   This implementation is an improvement -- however, it will now copy the 
string values to a new `StringArray`
   
   I think if the function generated StringView output directly it would be 
possible to avoid all string copies (and only adjust the views).
   
   I can file this as a follow on optimization idea.
   
   I noticed this on https://github.com/apache/datafusion/issues/11790



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement native stringview support for BTRIM [datafusion]

2024-08-12 Thread via GitHub


alamb merged PR #11920:
URL: https://github.com/apache/datafusion/pull/11920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update the `BTRIM` scalar function to support `Utf8View` [datafusion]

2024-08-12 Thread via GitHub


alamb closed issue #11835: Update the `BTRIM` scalar function to support 
`Utf8View`
URL: https://github.com/apache/datafusion/issues/11835


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Add additional regexp functions [datafusion]

2024-08-12 Thread via GitHub


timsaucer opened a new issue, #11946:
URL: https://github.com/apache/datafusion/issues/11946

   ### Is your feature request related to a problem or challenge?
   
   I would like to see the following regexp functions implemented. These exist 
in some, but not all, versions of PostgreSQL.
   
   - [ ] [regexp_count()](https://pgpedia.info/r/regexp_count.html)
   - [ ] [regexp_instr()](https://pgpedia.info/r/regexp_instr.html)
   - [ ] [regexp_matches()](https://pgpedia.info/r/regexp_matches.html)
   - [ ] 
[regexp_split_to_array()](https://pgpedia.info/r/regexp_split_to_array.html)
   - [ ] 
[regexp_split_to_table()](https://pgpedia.info/r/regexp_split_to_table.html)
   - [ ] [regexp_substr()](https://pgpedia.info/r/regexp_substr.html)
   
   
   ### Describe the solution you'd like
   
   Implement these functions.
   
   ### Describe alternatives you've considered
   
   These operations *can* be performed using the existing functions, so I am 
currently unblocked for my immediate use case but having these functions built 
in would be convenient.
   
   ### Additional context
   
   We currently have the following regexp functions implemented. The source is 
in `datafusion/functions/src/regex/mod.rs`
   
   [regexp_like()](https://pgpedia.info/r/regexp_like.html)
   [regexp_match()](https://pgpedia.info/r/regexp_match.html)
   [regexp_replace()](https://pgpedia.info/r/regexp_replace.html)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]

2024-08-12 Thread via GitHub


crepererum commented on code in PR #11828:
URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713654802


##
datafusion/common/src/stats.rs:
##
@@ -25,7 +25,7 @@ use arrow_schema::Schema;
 
 /// Represents a value with a degree of certainty. `Precision` is used to
 /// propagate information the precision of statistical values.
-#[derive(Clone, PartialEq, Eq, Default)]
+#[derive(Clone, PartialEq, Eq, Default, Copy)]

Review Comment:
   BTW: You can always manually implement the `Copy` trait if you want to have 
fine grained control about the bounds:
   
   ```rust
   impl Copy for Precision where: TDebug + Clone + PartialEq + Eq + 
PartialOrd + Clone {}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]

2024-08-12 Thread via GitHub


crepererum commented on code in PR #11828:
URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713654802


##
datafusion/common/src/stats.rs:
##
@@ -25,7 +25,7 @@ use arrow_schema::Schema;
 
 /// Represents a value with a degree of certainty. `Precision` is used to
 /// propagate information the precision of statistical values.
-#[derive(Clone, PartialEq, Eq, Default)]
+#[derive(Clone, PartialEq, Eq, Default, Copy)]

Review Comment:
   BTW: You can always manually implement the `Copy` trait if you want to have 
fine grained control about the bounds:
   
   ```rust
   impl Copy for Precision where T: Debug + Clone + PartialEq + Eq + 
PartialOrd + Clone {}
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Sort on single struct should fallback to Spark [datafusion-comet]

2024-08-12 Thread via GitHub


andygrove commented on code in PR #811:
URL: https://github.com/apache/datafusion-comet/pull/811#discussion_r1713770493


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2501,6 +2501,13 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
 
   case SortExec(sortOrder, _, child, _)
   if isCometOperatorEnabled(op.conf, CometConf.OPERATOR_SORT) =>
+// TODO: Remove this constraint when we upgrade to new arrow-rs 
including
+// https://github.com/apache/arrow-rs/pull/6225
+if (child.output.length == 1 && 
child.output.head.dataType.isInstanceOf[StructType]) {

Review Comment:
   As we add support for other types, do we need to update this to make it 
recursive so that we check for Map or Array containing struct?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Implement to_json for subset of types [datafusion-comet]

2024-08-12 Thread via GitHub


andygrove commented on PR #805:
URL: https://github.com/apache/datafusion-comet/pull/805#issuecomment-2283984392

   @parthchandra could you review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on code in PR #11942:
URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713779456


##
datafusion/functions/src/unicode/rpad.rs:
##
@@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc {
 
 fn invoke(&self, args: &[ColumnarValue]) -> Result {
 match args[0].data_type() {
-DataType::Utf8 => make_scalar_function(rpad::, vec![])(args),
+DataType::Utf8 | DataType::Utf8View => {
+make_scalar_function(rpad::, vec![])(args)
+}
 DataType::LargeUtf8 => make_scalar_function(rpad::, 
vec![])(args),
 other => exec_err!("Unsupported data type {other:?} for function 
rpad"),
 }
 }
 }
 
-/// Extends the string to length 'length' by appending the characters fill (a 
space by default). If the string is already longer than length then it is 
truncated.
-/// rpad('hi', 5, 'xy') = 'hixyx'
-pub fn rpad(args: &[ArrayRef]) -> Result {
-match args.len() {
-2 => {
-let string_array = as_generic_string_array::(&args[0])?;
-let length_array = as_int64_array(&args[1])?;
-
-let result = string_array
-.iter()
-.zip(length_array.iter())
-.map(|(string, length)| match (string, length) {
-(Some(string), Some(length)) => {
-if length > i32::MAX as i64 {
-return exec_err!(
-"rpad requested length {length} too large"
-);
-}
+macro_rules! process_rpad {
+// For the two-argument case
+($string_array:expr, $length_array:expr, $is_view:expr) => {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.map(|(string, length)| match (string, length) {
+(Some(string), Some(length)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
 
-let length = if length < 0 { 0 } else { length as 
usize };
-if length == 0 {
-Ok(Some("".to_string()))
+let length = if length < 0 { 0 } else { length as usize };
+if length == 0 {
+Ok(Some("".to_string()))
+} else {
+let graphemes = 
string.graphemes(true).collect::>();
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
 } else {
-let graphemes = 
string.graphemes(true).collect::>();
-if length < graphemes.len() {
-Ok(Some(graphemes[..length].concat()))
+let mut s = string.to_string();
+if !$is_view {
+s.push_str(&" ".repeat(length - 
graphemes.len()));
 } else {
-let mut s = string.to_string();
 s.push_str(" ".repeat(length - 
graphemes.len()).as_str());
-Ok(Some(s))
 }
+Ok(Some(s))
 }
 }
-_ => Ok(None),
-})
-.collect::>>()?;
+}
+_ => Ok(None),
+})
+.collect::>>()
+}};
+
+// For the three-argument case
+($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) 
=> {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.zip($fill_array.iter())
+.map(|((string, length), fill)| match (string, length, fill) {
+(Some(string), Some(length), Some(fill)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
+
+let length = if length < 0 { 0 } else { length as usize };
+let graphemes = 
string.graphemes(true).collect::>();
+let fill_chars = fill.chars().collect::>();
+
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
+} else if fill_chars.is_empty() {
+Ok(Some(string.to_string()))
+} else {
+let mut s = string.to_string();
+let char_vector: Vec = (0..length - 
graphemes.len())
+.map(|l| fill_chars[l % fill_chars.len()])
+.collect();
+s.push_str(&char

[PR] minor: Update release documentation based on 41.0.0 release [datafusion]

2024-08-12 Thread via GitHub


andygrove opened a new pull request, #11947:
URL: https://github.com/apache/datafusion/pull/11947

   ## Which issue does this PR close?
   
   
   
   Part of https://github.com/apache/datafusion/issues/11476
   
   ## Rationale for this change
   
   
   
   Update documentation to add new crates and remove the out-of-date dependency 
diagram
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add native stringview support for LTRIM & RTRIM [datafusion]

2024-08-12 Thread via GitHub


Kev1n8 opened a new pull request, #11948:
URL: https://github.com/apache/datafusion/pull/11948

   ## Which issue does this PR close?
   
   
   
   Closes #11856 and #11916
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   Added `ltrim` and `rtrim` function signatures to support stringview.
   
   
   ## Are these changes tested?
   yes
   
   
   ## Are there any user-facing changes?
   no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] minor: Update release documentation based on 41.0.0 release [datafusion]

2024-08-12 Thread via GitHub


andygrove commented on code in PR #11947:
URL: https://github.com/apache/datafusion/pull/11947#discussion_r1713783391


##
datafusion/catalog/Cargo.toml:
##
@@ -17,6 +17,7 @@
 
 [package]
 name = "datafusion-catalog"
+description = "datafusion-catalog"

Review Comment:
   I had to add a description so that I could publish the crate. It would 
obviously be better to have a more meaningful description here, but I am not 
familiar with this new crate yet.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on code in PR #11942:
URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713789099


##
datafusion/functions/src/unicode/rpad.rs:
##
@@ -46,8 +48,11 @@ impl RPadFunc {
 signature: Signature::one_of(
 vec![
 Exact(vec![Utf8, Int64]),

Review Comment:
   As noted in https://github.com/apache/datafusion/pull/11924/files::
   
   // Planner attempts coercion to the target type starting with the most 
preferred candidate.
   // For example, given input `(Utf8View, Utf8)`, it first tries coercing to 
`(Utf8View, Utf8View)`.
   // If that fails, it proceeds to `(Utf8, Utf8)`.
   
   Thus, the ordering of the signature should likely be changed to reflect that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Utf8View for lpad scalar function [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on PR #11941:
URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284011527

   I was looking at @Lordworms implementation in 
https://github.com/apache/datafusion/pull/11942 and I think it would make sense 
to align the two implementations. In some aspects I like his approach much more 
than the one I took in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 merged PR #11945:
URL: https://github.com/apache/datafusion/pull/11945


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Move `LimitPushdown` to physical-optimizer crate [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on PR #11945:
URL: https://github.com/apache/datafusion/pull/11945#issuecomment-2284039319

   Thanks @lewiszlw @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Enable shuffle in micro benchmarks [datafusion-comet]

2024-08-12 Thread via GitHub


andygrove merged PR #806:
URL: https://github.com/apache/datafusion-comet/pull/806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `RIGHT` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Kev1n8 commented on issue #11917:
URL: https://github.com/apache/datafusion/issues/11917#issuecomment-2284063257

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Improve comments in row_hash.rs for skipping aggregation [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on PR #11820:
URL: https://github.com/apache/datafusion/pull/11820#issuecomment-2284066506

   Thanks all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Minor: Improve comments in row_hash.rs for skipping aggregation [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 merged PR #11820:
URL: https://github.com/apache/datafusion/pull/11820


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `REGEXP_MATCH` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


PsiACE commented on issue #11911:
URL: https://github.com/apache/datafusion/issues/11911#issuecomment-2284066645

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `regexp_replace` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


PsiACE commented on issue #11912:
URL: https://github.com/apache/datafusion/issues/11912#issuecomment-2284067440

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Manage group values and states by blocks in aggregation [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on issue #11931:
URL: https://github.com/apache/datafusion/issues/11931#issuecomment-2284071358

   > /// For example, `n= 10`, `block size=4`, `n` will be aligned to 12, 
   /// and finally 3 blocks will be returned.
   FirstBlocks(usize)
   
   I think emitting with "n" blocks is much more straightforward. n = 3, block 
size = 4. emit 3 * 4 = 12 elements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on PR #11942:
URL: https://github.com/apache/datafusion/pull/11942#issuecomment-2284082623

   You may want to look at the tests I added in 
https://github.com/apache/datafusion/pull/11941 as an example to verify that 
all the signature variants are tested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [Epic] Native `StringView` support for string functions [datafusion]

2024-08-12 Thread via GitHub


2010YOUY01 commented on issue #11790:
URL: https://github.com/apache/datafusion/issues/11790#issuecomment-2284100621

   Inspired by @Omega359 's great PR 
https://github.com/apache/datafusion/pull/11941, I have some suggestion on 
testing `Utf8View` support for functions:
   
   Although most implementation is adapted from existing implementation, but 
the execution takes another path, so I think comprehensive end-to-end tests are 
still needed.
   The good news is there already exists `sqllogictest`s for original string 
functions, the only thing to do is to duplicate existing testings with 
`Utf8View`
   
   Here are the examples on how to adapt existing test cases for `Utf8View` 
input
   1. For functions takes scalar value, use `arrow_cast()` like 
https://github.com/apache/datafusion/pull/11941/files#diff-51757b2b1d0a07b88551d88eabeba7f74e11b5217e44203ac7c6f613c0221196
   2. For functions read from a table, string column can be converted to 
`Utf8View` column like 
https://github.com/apache/datafusion/blob/2cf09566af7d7d5f83a8bdff5f0adda97d40deee/datafusion/sqllogictest/test_files/string_view.slt#L30-L42


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Utf8View for lpad scalar function [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11941:
URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284118946

   > I was looking at @Lordworms implementation in #11942 and I think it would 
make sense to align the two implementations. In some aspects I like his 
approach much more than the one I took in this PR.
   
   Do you think we should merge this PR and work on improvements in a follow on 
PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Make `Precision` copy to make it clear clones are not expensive [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11828:
URL: https://github.com/apache/datafusion/pull/11828#discussion_r1713874112


##
datafusion/common/src/stats.rs:
##
@@ -25,7 +25,7 @@ use arrow_schema::Schema;
 
 /// Represents a value with a degree of certainty. `Precision` is used to
 /// propagate information the precision of statistical values.
-#[derive(Clone, PartialEq, Eq, Default)]
+#[derive(Clone, PartialEq, Eq, Default, Copy)]

Review Comment:
   👍  -- I think that is basically what `#[derive(Debug)]` does in this case, 
which is kind of cook



##
datafusion/common/src/stats.rs:
##
@@ -25,7 +25,7 @@ use arrow_schema::Schema;
 
 /// Represents a value with a degree of certainty. `Precision` is used to
 /// propagate information the precision of statistical values.
-#[derive(Clone, PartialEq, Eq, Default)]
+#[derive(Clone, PartialEq, Eq, Default, Copy)]

Review Comment:
   👍  -- I think that is basically what `#[derive(Debug)]` does in this case, 
which is kind of cool



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 merged PR #11940:
URL: https://github.com/apache/datafusion/pull/11940


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add SessionState to MockContextProvider just like SessionContextProvider [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on PR #11940:
URL: https://github.com/apache/datafusion/pull/11940#issuecomment-2284135678

   Thanks @dharanad @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] chore: Add `SessionState` to `MockContextProvider` just like `SessionContextProvider` [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 closed issue #11531: chore: Add `SessionState` to 
`MockContextProvider` just like `SessionContextProvider`
URL: https://github.com/apache/datafusion/issues/11531


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Improve performance of inner hash join [datafusion-comet]

2024-08-12 Thread via GitHub


andygrove commented on issue #808:
URL: 
https://github.com/apache/datafusion-comet/issues/808#issuecomment-2284140087

   The `FilterExec` in the above example is even more expensive than the 
`HashJoinExec`. Evaluating the predicate is cheap but copying data to the 
filtered batch takes 99% of the time. We could potentially avoid this copy by 
using a selection vector approach instead.
   
   ```
   Time to compute filter mask on batch of 32768 rows is: 581ns
   Time to filter batch is: 252.194µs
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Implement Utf8View for lpad scalar function [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on PR #11941:
URL: https://github.com/apache/datafusion/pull/11941#issuecomment-2284141051

   > Do you think we should merge this PR and work on improvements in a follow 
on PR?
   
   I have no objections to that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]

2024-08-12 Thread via GitHub


Kimahriman commented on issue #11923:
URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284199250

   After digging a little more it seems like there's some other issues that 
might make this significantly more difficult:
   
   - When `invoke` is called, there's no way to know the logical nullability of 
the ColumnarValue's/ArrayRef's being passed in. The only thing available is 
`Array::is_nullable`, but it looks like all that does it check if the current 
array has any nulls or not, so that could vary from batch to batch. So inside 
`invoke` there's  no way to know if field inside
   ```rust
   Ok(Arc::new(GenericListArraytry_new(
   Arc::new(Field::new("item", data_type, true)),
   OffsetBuffer::new(offsets.into()),
   arrow_array::make_array(data),
   None,
   )?))
   ```
   should be logically nullable or not. And arrow will verify this when 
creating a record batch, which is the error I ran into initially with the 
nullability not matching: 
https://github.com/apache/arrow-rs/blob/5c5a94a11f01a286dd03b18af0f11c327a9accc6/arrow-array/src/record_batch.rs#L205
   - ScalarUDF's in general don't have a concept of nullability, they are 
always considered nullable. Only tangentially related but just a larger issue 
of nullability and ScalarUDFs: 
https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/scalar_function.rs#L120


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: `CreateArray` support [datafusion-comet]

2024-08-12 Thread via GitHub


Kimahriman commented on code in PR #793:
URL: https://github.com/apache/datafusion-comet/pull/793#discussion_r1713950119


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2348,6 +2348,27 @@ object QueryPlanSerde extends Logging with 
ShimQueryPlanSerde with CometExprShim
   .build()
   }
 
+// datafusion's make_array only supports nullable element types
+case array @ CreateArray(children, _) if array.dataType.containsNull =>
+  val childExprs = children.map(exprToProto(_, inputs, binding))
+  val dataType = serializeDataType(array.dataType)

Review Comment:
   More digging made me realize this is a somewhat larger issue with 
ScalarUDFs, in that they don't support setting nullability at all (every 
ScalarUDF is assumed to be nullable). So how has this been handled elsewhere if 
at all? Is the best approach just to "pretend" the column is nullable for 
DataFusion, knowing that logically it should not contain any nulls and keep it 
as non-nullable on the Spark side? Otherwise any expression backed by a 
ScalarUDF can't support non-nullable expressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on issue #11923:
URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284236770

   > ```rust
   > Ok(Arc::new(GenericListArraytry_new(
   > ```
   
   We could have another `invoke_with_schema` function to get the `nullable` 
from schema. Of course, another incompatible change 😢 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Parse Sqllogictest column types from physical schema [datafusion]

2024-08-12 Thread via GitHub


alamb commented on code in PR #11929:
URL: https://github.com/apache/datafusion/pull/11929#discussion_r1713973317


##
datafusion/sqllogictest/src/engines/datafusion_engine/runner.rs:
##
@@ -69,9 +72,12 @@ impl sqllogictest::AsyncDB for DataFusion {
 
 async fn run_query(ctx: &SessionContext, sql: impl Into) -> 
Result {
 let df = ctx.sql(sql.into().as_str()).await?;
+let task_ctx = Arc::new(df.task_ctx());
+let plan = df.create_physical_plan().await?;
 
-let types = normalize::convert_schema_to_types(df.schema().fields());
-let results: Vec = df.collect().await?;
+let stream = execute_stream(plan, task_ctx)?;
+let types = normalize::convert_schema_to_types(stream.schema().fields());

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support creating arrays with non-nullable elements in `make_array` [datafusion]

2024-08-12 Thread via GitHub


jayzhan211 commented on issue #11923:
URL: https://github.com/apache/datafusion/issues/11923#issuecomment-2284262048

   I remember there was a discussion about replace 
ColumnarValue::Array(ArrayRef) with Recordbatch,  which contains `Schema`
   
   ```rust
   #[derive(Clone, Debug, PartialEq)]
   pub struct RecordBatch {
   schema: SchemaRef,
   columns: Vec>,
   
   /// The number of rows in this RecordBatch
   ///
   /// This is stored separately from the columns to handle the case of no 
columns
   row_count: usize,
   }
   ```
   
   Maybe Recordbatch is what we need for `invoke` 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update RPAD scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on code in PR #11942:
URL: https://github.com/apache/datafusion/pull/11942#discussion_r1713832682


##
datafusion/functions/src/unicode/rpad.rs:
##
@@ -77,96 +82,122 @@ impl ScalarUDFImpl for RPadFunc {
 
 fn invoke(&self, args: &[ColumnarValue]) -> Result {
 match args[0].data_type() {
-DataType::Utf8 => make_scalar_function(rpad::, vec![])(args),
+DataType::Utf8 | DataType::Utf8View => {
+make_scalar_function(rpad::, vec![])(args)
+}
 DataType::LargeUtf8 => make_scalar_function(rpad::, 
vec![])(args),
 other => exec_err!("Unsupported data type {other:?} for function 
rpad"),
 }
 }
 }
 
-/// Extends the string to length 'length' by appending the characters fill (a 
space by default). If the string is already longer than length then it is 
truncated.
-/// rpad('hi', 5, 'xy') = 'hixyx'
-pub fn rpad(args: &[ArrayRef]) -> Result {
-match args.len() {
-2 => {
-let string_array = as_generic_string_array::(&args[0])?;
-let length_array = as_int64_array(&args[1])?;
-
-let result = string_array
-.iter()
-.zip(length_array.iter())
-.map(|(string, length)| match (string, length) {
-(Some(string), Some(length)) => {
-if length > i32::MAX as i64 {
-return exec_err!(
-"rpad requested length {length} too large"
-);
-}
+macro_rules! process_rpad {
+// For the two-argument case
+($string_array:expr, $length_array:expr, $is_view:expr) => {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.map(|(string, length)| match (string, length) {
+(Some(string), Some(length)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
 
-let length = if length < 0 { 0 } else { length as 
usize };
-if length == 0 {
-Ok(Some("".to_string()))
+let length = if length < 0 { 0 } else { length as usize };
+if length == 0 {
+Ok(Some("".to_string()))
+} else {
+let graphemes = 
string.graphemes(true).collect::>();
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
 } else {
-let graphemes = 
string.graphemes(true).collect::>();
-if length < graphemes.len() {
-Ok(Some(graphemes[..length].concat()))
+let mut s = string.to_string();
+if !$is_view {
+s.push_str(&" ".repeat(length - 
graphemes.len()));
 } else {
-let mut s = string.to_string();
 s.push_str(" ".repeat(length - 
graphemes.len()).as_str());
-Ok(Some(s))
 }
+Ok(Some(s))
 }
 }
-_ => Ok(None),
-})
-.collect::>>()?;
+}
+_ => Ok(None),
+})
+.collect::>>()
+}};
+
+// For the three-argument case
+($string_array:expr, $length_array:expr, $fill_array:expr, $is_view:expr) 
=> {{
+$string_array
+.iter()
+.zip($length_array.iter())
+.zip($fill_array.iter())
+.map(|((string, length), fill)| match (string, length, fill) {
+(Some(string), Some(length), Some(fill)) => {
+if length > i32::MAX as i64 {
+return exec_err!("rpad requested length {} too large", 
length);
+}
+
+let length = if length < 0 { 0 } else { length as usize };
+let graphemes = 
string.graphemes(true).collect::>();
+let fill_chars = fill.chars().collect::>();
+
+if length < graphemes.len() {
+Ok(Some(graphemes[..length].concat()))
+} else if fill_chars.is_empty() {
+Ok(Some(string.to_string()))
+} else {
+let mut s = string.to_string();
+let char_vector: Vec = (0..length - 
graphemes.len())
+.map(|l| fill_chars[l % fill_chars.len()])
+.collect();
+s.push_str(&char

[PR] Use the tracked-consumers memory pool be the default. [datafusion]

2024-08-12 Thread via GitHub


wiedld opened a new pull request, #11949:
URL: https://github.com/apache/datafusion/pull/11949

   ## Which issue does this PR close?
   
   Closes #11523 
   
   ## Rationale for this change
   
   We would like the improved OOM error messages, which lists the top overall 
memory reservation consumers, to be the default.
   
   ## What changes are included in this PR?
   
   Make `TrackConsumersPool` be the default.
   Include benchmarks to assess potential performance overhead from added mutex.
   
   
   ## Are these changes tested?
   
   Yes, for the existing tests using the default pool.
   
   
   ## Are there any user-facing changes?
   
   No API changes.
   Change in OOM messages returned.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Update `SPLIT_PART` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


tshauck opened a new issue, #11950:
URL: https://github.com/apache/datafusion/issues/11950

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Update `STRPOS` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


tshauck opened a new issue, #11951:
URL: https://github.com/apache/datafusion/issues/11951

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Update `SUBSTR` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


tshauck opened a new issue, #11952:
URL: https://github.com/apache/datafusion/issues/11952

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Update `TRANSLATE` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


tshauck opened a new issue, #11953:
URL: https://github.com/apache/datafusion/issues/11953

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Update `FIND_IN_SET` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


tshauck opened a new issue, #11954:
URL: https://github.com/apache/datafusion/issues/11954

   ### Is your feature request related to a problem or challenge?
   
   _No response_
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Update `REVERSE` scalar function to support Utf8View [datafusion]

2024-08-12 Thread via GitHub


Omega359 commented on issue #11915:
URL: https://github.com/apache/datafusion/issues/11915#issuecomment-2284343484

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add tests for StringView / character functions, fix `regexp_like` and `regexp_match` to work with StringView [datafusion]

2024-08-12 Thread via GitHub


tshauck commented on code in PR #11753:
URL: https://github.com/apache/datafusion/pull/11753#discussion_r1714025767


##
datafusion/sqllogictest/test_files/string_view.slt:
##
@@ -594,8 +518,417 @@ SELECT
 
 228 0 NULL
 
+## Ensure no casts for BTRIM
+query TT
+EXPLAIN SELECT
+  BTRIM(column1_utf8view, 'foo') AS l
+FROM test;
+
+logical_plan
+01)Projection: btrim(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS l
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for CHARACTER_LENGTH
+query TT
+EXPLAIN SELECT
+  CHARACTER_LENGTH(column1_utf8view) AS l
+FROM test;
+
+logical_plan
+01)Projection: character_length(test.column1_utf8view) AS l
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for CONCAT
+## TODO https://github.com/apache/datafusion/issues/11836
+query TT
+EXPLAIN SELECT
+  concat(column1_utf8view, column2_utf8view) as c
+FROM test;
+
+logical_plan
+01)Projection: concat(CAST(test.column1_utf8view AS Utf8), 
CAST(test.column2_utf8view AS Utf8)) AS c
+02)--TableScan: test projection=[column1_utf8view, column2_utf8view]
+
+## Ensure no casts for CONCAT_WS
+## TODO https://github.com/apache/datafusion/issues/11837
+query TT
+EXPLAIN SELECT
+  concat_ws(', ', column1_utf8view, column2_utf8view) as c
+FROM test;
+
+logical_plan
+01)Projection: concat_ws(Utf8(", "), CAST(test.column1_utf8view AS Utf8), 
CAST(test.column2_utf8view AS Utf8)) AS c
+02)--TableScan: test projection=[column1_utf8view, column2_utf8view]
+
+## Ensure no casts for CONTAINS
+## TODO https://github.com/apache/datafusion/issues/11838
+query TT
+EXPLAIN SELECT
+  CONTAINS(column1_utf8view, 'foo') as c1,
+  CONTAINS(column2_utf8view, column2_utf8view) as c2
+FROM test;
+
+logical_plan
+01)Projection: contains(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS 
c1, contains(__common_expr_1, __common_expr_1) AS c2
+02)--Projection: CAST(test.column2_utf8view AS Utf8) AS __common_expr_1, 
test.column1_utf8view
+03)TableScan: test projection=[column1_utf8view, column2_utf8view]
+
+## Ensure no casts for ENDS_WITH
+## TODO https://github.com/apache/datafusion/issues/11852
+query TT
+EXPLAIN SELECT
+  ENDS_WITH(column1_utf8view, 'foo') as c1,
+  ENDS_WITH(column2_utf8view, column2_utf8view) as c2
+FROM test;
+
+logical_plan
+01)Projection: ends_with(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS 
c1, ends_with(__common_expr_1, __common_expr_1) AS c2
+02)--Projection: CAST(test.column2_utf8view AS Utf8) AS __common_expr_1, 
test.column1_utf8view
+03)TableScan: test projection=[column1_utf8view, column2_utf8view]
+
+
+## Ensure no casts for INITCAP
+## TODO https://github.com/apache/datafusion/issues/11853
+query TT
+EXPLAIN SELECT
+  INITCAP(column1_utf8view) as c
+FROM test;
+
+logical_plan
+01)Projection: initcap(CAST(test.column1_utf8view AS Utf8)) AS c
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for LEVENSHTEIN
+## TODO https://github.com/apache/datafusion/issues/11854
+query TT
+EXPLAIN SELECT
+  levenshtein(column1_utf8view, 'foo') as c1,
+  levenshtein(column1_utf8view, column2_utf8view) as c2
+FROM test;
+
+logical_plan
+01)Projection: levenshtein(__common_expr_1, Utf8("foo")) AS c1, 
levenshtein(__common_expr_1, CAST(test.column2_utf8view AS Utf8)) AS c2
+02)--Projection: CAST(test.column1_utf8view AS Utf8) AS __common_expr_1, 
test.column2_utf8view
+03)TableScan: test projection=[column1_utf8view, column2_utf8view]
+
+## Ensure no casts for LOWER
+## TODO https://github.com/apache/datafusion/issues/11855
+query TT
+EXPLAIN SELECT
+  LOWER(column1_utf8view) as c1
+FROM test;
+
+logical_plan
+01)Projection: lower(CAST(test.column1_utf8view AS Utf8)) AS c1
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for LTRIM
+## TODO https://github.com/apache/datafusion/issues/11856
+query TT
+EXPLAIN SELECT
+  LTRIM(column1_utf8view) as c1
+FROM test;
+
+logical_plan
+01)Projection: ltrim(CAST(test.column1_utf8view AS Utf8)) AS c1
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for LPAD
+## TODO https://github.com/apache/datafusion/issues/11857
+query TT
+EXPLAIN SELECT
+  LPAD(column1_utf8view, 12, ' ') as c1
+FROM test;
+
+logical_plan
+01)Projection: lpad(CAST(test.column1_utf8view AS Utf8), Int64(12), Utf8(" ")) 
AS c1
+02)--TableScan: test projection=[column1_utf8view]
+
+
+## Ensure no casts for OCTET_LENGTH
+## TODO https://github.com/apache/datafusion/issues/11858
+query TT
+EXPLAIN SELECT
+  OCTET_LENGTH(column1_utf8view) as c1
+FROM test;
+
+logical_plan
+01)Projection: octet_length(CAST(test.column1_utf8view AS Utf8)) AS c1
+02)--TableScan: test projection=[column1_utf8view]
+
+## Ensure no casts for OVERLAY
+## TODO file ticket
+query TT
+EXPLAIN SELECT
+  OVERLAY(column1_utf8view PLACING 'foo' FROM 2 ) as c1
+FROM test;
+
+logical_plan
+01)Projection: overlay(CAST(test.column1_utf8view AS Utf8), Utf8("foo")

Re: [PR] Use the tracked-consumers memory pool be the default. [datafusion]

2024-08-12 Thread via GitHub


wiedld commented on PR #11949:
URL: https://github.com/apache/datafusion/pull/11949#issuecomment-2284349835

   
   Benchmark clickbench_1
   
   
|  Query |   main_base  |  11523_tracked-consumers-default  |   
 Change  | 
| -- |  | - | 
-- | 
 |   QQuery 0   |   0.52ms   |0.53ms   
|  no change   |  
 |   QQuery 1   |  91.61ms   |   92.41ms   
|  no change   |  
 |   QQuery 2   | 235.65ms   |  233.69ms   
|  no change   |  
 |   QQuery 3   | 195.27ms   |  189.60ms   
|  no change   |  
 |   QQuery 4   |2253.70ms   | 2310.43ms   
|  no change   |  
 |   QQuery 5   |2113.38ms   | 2299.27ms   
|   1.09x slower   |  
 |   QQuery 6   |  84.20ms   |   82.10ms   
|  no change   |  
 |   QQuery 7   |  97.18ms   |   97.45ms   
|  no change   |  
 |   QQuery 8   |3212.14ms   | 3615.31ms   
|   1.13x slower   |  
 |   QQuery 9   |2539.55ms   | 2537.32ms   
|  no change   |  
 |   QQuery 10  |1997.85ms   | 1998.86ms   
|  no change   |  
 |   QQuery 11  |2068.84ms   | 2088.95ms   
|  no change   |  
 |   QQuery 12  |3082.54ms   | 3277.47ms   
|   1.06x slower   |  
 |   QQuery 13  |4732.25ms   | 5160.32ms   
|   1.09x slower   |  
 |   QQuery 14  |3866.84ms   | 4161.76ms   
|   1.08x slower   |  
 |   QQuery 15  |2649.19ms   | 2758.73ms   
|  no change   |  
 |   QQuery 16  |6131.58ms   | 6935.31ms   
|   1.13x slower   |  
 |   QQuery 17  |6047.45ms   | 6782.28ms   
|   1.12x slower   |  
 |   QQuery 18  |   10795.80ms   |12602.46ms   
|   1.17x slower   |  
 |   QQuery 19  | 169.50ms   |  163.48ms   
|  no change   |  
 |   QQuery 20  |3270.49ms   | 3224.10ms   
|  no change   |  
 |   QQuery 21  |5061.89ms   | 5050.48ms   
|  no change   |  
 |   QQuery 22  |   12660.40ms   |12715.04ms   
|  no change   |  
 |   QQuery 23  |   27705.44ms   |28134.60ms   
|  no change   |  
 |   QQuery 24  |2485.82ms   | 2479.34ms   
|  no change   |  
 |   QQuery 25  |2273.66ms   | 2284.05ms   
|  no change   |  
 |   QQuery 26  |2562.82ms   | 2560.38ms   
|  no change   |  
 |   QQuery 27  |4308.90ms   | 4316.42ms   
|  no change   |  
 |   QQuery 28  |   36444.61ms   |36264.41ms   
|  no change   |  
 |   QQuery 29  |1386.33ms   | 1380.30ms   
|  no change   |  
 |   QQuery 30  |3864.30ms   | 3874.16ms   
|  no change   |  
 |   QQuery 31  |4622.73ms   | 4637.74ms   
|  no change   |  
 |   QQuery 32  |   21909.22ms   |22018.07ms   
|  no change   |  
 |   QQuery 33  |   10431.35ms   |10353.16ms   
|  no change   |  
 |   QQuery 34  |   10348.38ms   |10348.83ms   
|  no change   |  
 |   QQuery 35  |4157.44ms   | 4150.37ms   
|  no change   |  
 |   QQuery 36  | 226.46ms   |  230.03ms   
|  no change   |  
 |   QQuery 37  | 166.41ms   |  165.30ms   
|  no change   |  
 |   QQuery 38  | 132.29ms   |  141.74ms   
|   1.07x slower   |  
 |   QQuery 39  | 790.22ms   |  772.59ms   
|  no change   |  
 |   QQuery 40  |  60.17ms   |   67.44ms   
|   1.12x slower   |  
 |   QQuery 41  |  57.77ms   |   59.21ms   
|  no change   |  
 |   QQuery 42  |  71.12ms   |   69.47ms   
|  no change   |  
   
   
|  Benchmark Summary   |   | 
|  | - | 
 |   Total Time (main_base)

Re: [I] Improve performance of broadcast hash join [datafusion-comet]

2024-08-12 Thread via GitHub


andygrove commented on issue #808:
URL: 
https://github.com/apache/datafusion-comet/issues/808#issuecomment-2284350076

   The filter on the probe input is very simple (`col_0@0 IS NOT NULL`) and it 
should be possible to push down to the parquet scan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Enable Comet shuffle with AQE coalesce partitions [datafusion-comet]

2024-08-12 Thread via GitHub


viirya closed pull request #651: chore: Enable Comet shuffle with AQE coalesce 
partitions
URL: https://github.com/apache/datafusion-comet/pull/651


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: bugs when having and group by are all false [datafusion]

2024-08-12 Thread via GitHub


Lordworms commented on PR #11897:
URL: https://github.com/apache/datafusion/pull/11897#issuecomment-2284447751

   > EliminateGroupByConstant
   
   
   
   > Ideally group by constant should be eliminated, but the result is 
different when there is no row and we can't differentiate it after 
`EliminateGroupByConstant`.
   > 
   > I think this is why you bring the `is_global_group_by` information down to 
physical layer.
   exactly
   > 
   > I think another approach is we avoid `EliminateGroupByConstant` at all and 
we eliminate it when creating physical group by expression 🤔
   I don't think we should completely avoid this rule since it has its own 
usage, for example here, if we disable it, the plan would be like 
   https://github.com/user-attachments/assets/bcc4988c-9164-4a42-a6be-130bdd02111a";>
   
   and with enable it, the plan is 
   https://github.com/user-attachments/assets/c7501fdd-a8b7-48fd-9cb8-d15c3c32d28b";>
   
   we introduced an unnecessary Projection which I don't feel suitable. Also I 
think there should be other test cases which may require AggregateExec to emit 
empty set, since right now it always returns an null with no input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Change default value of `datafusion.catalog.has_header` to `true` [datafusion]

2024-08-12 Thread via GitHub


alamb commented on issue #11936:
URL: https://github.com/apache/datafusion/issues/11936#issuecomment-2284459200

   I think default to "no headers" is unlikely to have been a conscious choice 
(I suspect it was a historical accident)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Add native stringview support for RIGHT [datafusion]

2024-08-12 Thread via GitHub


Kev1n8 opened a new pull request, #11955:
URL: https://github.com/apache/datafusion/pull/11955

   ## Which issue does this PR close?
   
   
   
   Closes #11917
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   Update signature of RIGHT to accept Utf8View and add some tests for that.
   
   
   ## Are these changes tested?
   yes
   
   
   ## Are there any user-facing changes?
   no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add native stringview support for RIGHT [datafusion]

2024-08-12 Thread via GitHub


Kev1n8 commented on PR #11955:
URL: https://github.com/apache/datafusion/pull/11955#issuecomment-2284463629

   Pasting a possible implementation of `StringViewArray` output here (current 
implementation returns `StringArray`):
   ```
   fn string_view_right(args: &[ArrayRef]) -> Result {
   let string_array = as_string_view_array(&args[0])?;
   let n_array = as_int64_array(&args[1])?;
   
   let mut builder = StringViewBuilder::new();
   let combined = string_array.iter().zip(n_array);
   
   for iter in combined {
   if let (Some(str), Some(n)) = iter {
   match n.cmp(&0) {
   Ordering::Less => {
   let start = n.unsigned_abs() as usize;
   let sub_string = &str[start..];
   builder.append_value(sub_string);
   }
   Ordering::Equal => {
   builder.append_null();
   }
   Ordering::Greater => {
   let start = max(str.chars().count() as i64 - n, 0) as 
usize;
   let sub_string = &str[start..];
   builder.append_value(sub_string);
   }
   }
   } else {
   builder.append_null();
   }
   }
   let result = builder.finish();
   Ok(Arc::new(result) as ArrayRef)
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Update labeler.yml to match crates [datafusion]

2024-08-12 Thread via GitHub


alamb commented on PR #11937:
URL: https://github.com/apache/datafusion/pull/11937#issuecomment-2284501265

   Thanks for the review @comphead  -- let's see how this works!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   >