[PR] chore(deps): bump libc from 0.2.173 to 0.2.174 [datafusion]

2025-06-18 Thread via GitHub
dependabot[bot] opened a new pull request, #16440: URL: https://github.com/apache/datafusion/pull/16440 Bumps [libc](https://github.com/rust-lang/libc) from 0.2.173 to 0.2.174. Release notes Sourced from https://github.com/rust-lang/libc/releases";>libc's releases. 0.2.174

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
nirnayroy commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2154895463 ## datafusion/functions/src/regex/regexpcount.rs: ## @@ -550,7 +550,7 @@ where } } -fn compile_and_cache_regex<'strings, 'cache>( +pub fn compile_and_cac

[PR] chore(deps): bump bzip2 from 0.5.2 to 0.6.0 [datafusion]

2025-06-18 Thread via GitHub
dependabot[bot] opened a new pull request, #16441: URL: https://github.com/apache/datafusion/pull/16441 Bumps [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) from 0.5.2 to 0.6.0. Release notes Sourced from https://github.com/trifectatechfoundation/bzip2-rs/releases";>bz

Re: [PR] chore(deps): bump bzip2 from 0.5.2 to 0.6.0 [datafusion]

2025-06-18 Thread via GitHub
xudong963 merged PR #16441: URL: https://github.com/apache/datafusion/pull/16441 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
epgif commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2154521172 ## datafusion/catalog/src/schema.rs: ## @@ -54,6 +55,14 @@ pub trait SchemaProvider: Debug + Sync + Send { name: &str, ) -> Result>, DataFusionError>;

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984064963 2. Easier to serialize across the wire Yeah that part is of course true (especially larger tables you probably want to avoid sending over the network). the `1. More

Re: [PR] Prune files during streams and avoid additional pruning if there are no dynamic filters [datafusion]

2025-06-18 Thread via GitHub
xudong963 commented on code in PR #16424: URL: https://github.com/apache/datafusion/pull/16424#discussion_r2154535222 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +512,91 @@ fn should_enable_page_index( .unwrap_or(false) } +/// Prune based on part

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-18 Thread via GitHub
YanivKunda commented on code in PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#discussion_r2154538985 ## dev/.DS_Store: ## Review Comment: macOS local file ## dev/diffs/4.0.0-diff.patch: ## Review Comment: looks like a temporary

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
xudong963 commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984129180 From my past experience, bloom filter mostly generates a negative impact. And for most cases, min-max works fine. -- This is an automated message from the Apache Git Service

Re: [PR] feat: support fixed size list for array reverse [datafusion]

2025-06-18 Thread via GitHub
comphead merged PR #16423: URL: https://github.com/apache/datafusion/pull/16423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] chore(deps): bump libc from 0.2.173 to 0.2.174 [datafusion]

2025-06-18 Thread via GitHub
comphead merged PR #16440: URL: https://github.com/apache/datafusion/pull/16440 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155006506 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] feat: Add support to lookup map by key [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1898: URL: https://github.com/apache/datafusion-comet/pull/1898#discussion_r2155022740 ## native/core/src/execution/planner.rs: ## @@ -555,7 +555,21 @@ impl PhysicalPlanner { fail_on_error, )))

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155025406 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

[PR] Support multiple column options in `CREATE VIEW` [datafusion-sqlparser-rs]

2025-06-18 Thread via GitHub
eliaperantoni opened a new pull request, #1891: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1891 A Snowflake query like: ```sql CREATE VIEW X (COL WITH TAG (pii='email') COMMENT 'foobar') AS SELECT * FROM Y ``` Would've previously failed because it contain

[PR] Change tag and policy names to `ObjectName` instead of `Ident` [datafusion-sqlparser-rs]

2025-06-18 Thread via GitHub
eliaperantoni opened a new pull request, #1892: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1892 In e.g. Snowflake, tags can have qualifying elements in their name: ```sql CREATE VIEW foo AS (SELECT 1 AS COL WITH TAG foo.bar.baz) ``` But the parser currentl

[PR] Fix `impl Ord for Ident` [datafusion-sqlparser-rs]

2025-06-18 Thread via GitHub
eliaperantoni opened a new pull request, #1893: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1893 At the moment, the `Span` of an `Ident` is used influences its ordering, but this doesn't seem consistent with the implementations of `PartialEq`, `Hash`, and the general guideli

Re: [PR] Change tag and policy names to `ObjectName` instead of `Ident` [datafusion-sqlparser-rs]

2025-06-18 Thread via GitHub
eliaperantoni commented on PR #1892: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1892#issuecomment-2983066319 It would be nice if #1891 was merged first so that I can write a Snowflake test like the example in this PR's description, since `WITH TAG ...` is not currently sup

[I] 404 for indexed docs page [datafusion]

2025-06-18 Thread via GitHub
samuelcolvin opened a new issue, #16438: URL: https://github.com/apache/datafusion/issues/16438 When searching "datafusion create_udf", the first result is `https://datafusion.apache.org/library-user-guide/adding-udfs.html` which is showing a 404 page. -- This is an automated message fro

Re: [I] 404 for indexed docs page [datafusion]

2025-06-18 Thread via GitHub
samuelcolvin commented on issue #16438: URL: https://github.com/apache/datafusion/issues/16438#issuecomment-2983122187 If docs pages are moved, a redirect should be used to both help users and maintain SEO. -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [I] [DISCUSSION] JOIN "task force" / project team [datafusion]

2025-06-18 Thread via GitHub
xudong963 commented on issue #15885: URL: https://github.com/apache/datafusion/issues/15885#issuecomment-2983943967 > DuckDB's implementation: transform dependent join, flatten dependent join, eliminate delimjoin I believe we're almost same, expect the implementation of datebend doesn

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983928323 > We already have the full table in memory, so we can not really save anything by compressing it into a bloom filter. Agreed: if we're not concerned with larger-than-me

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-18 Thread via GitHub
samuelcolvin commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2983962469 https://github.com/apache/datafusion/blob/630aa7b0c7b44ea8e77f9e0d685bf79f2a3cd3bd/datafusion/core/src/execution/context/mod.rs#L1766 Needs an option for async UDFs I gues

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983990746 > 2. Easier to serialize across the wire This is actually something I've started looking at in the last day and got stuck pretty quickly trying to serialize the HashBro

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154953337 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection,

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
comphead commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2154936751 ## datafusion/catalog/src/schema.rs: ## @@ -54,6 +55,14 @@ pub trait SchemaProvider: Debug + Sync + Send { name: &str, ) -> Result>, DataFusionError

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan closed pull request #16445: Add dynamic filter (bounds) pushdown to HashJoinExec URL: https://github.com/apache/datafusion/pull/16445 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984776797 I tink we should also consider a heuristic for not evaluating the filter if it's not useful. Also I think doing only the lookup is preferable above also computing / checking

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984782413 Sorry, misclicked a button. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2984788592 > I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead My thought was that for some cases the

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2154970369 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection, s

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
2010YOUY01 commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2154095345 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -1324,6 +1326,7 @@ impl Stream for SortMergeJoinStream { impl SortMergeJoinStream { #[allo

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16433: URL: https://github.com/apache/datafusion/pull/16433#discussion_r2154078467 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -319,13 +341,88 @@ impl TopK { /// (a > 2 OR (a = 2 AND b < 3)) /// ``` fn update_filter(&mut s

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983585902 FYI @mbutrovich -- I believe you were working on something like this related to Comet -- maybe it is worth a look / review here to make sure the design works with comet too if po

Re: [PR] [datafusion-spark] Implement ceil&floor function for spark [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #15958: URL: https://github.com/apache/datafusion/pull/15958#issuecomment-2983591412 > > @irenjj - I wonder if you would be willing to help pick this PR back up now that we have merged a PR with a bunch of tests from @shehabgamin here: > > > > * [chore: generate

[PR] chore: move udf registration to better place [datafusion-comet]

2025-06-18 Thread via GitHub
rluvaton opened a new pull request, #1899: URL: https://github.com/apache/datafusion-comet/pull/1899 ## Which issue does this PR close? N/A ## Rationale for this change Some registration exists in the `prepare_datafusion_session_context` and some in the `PhysicalPlanner:

Re: [PR] chore: move udf registration to better place [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1899: URL: https://github.com/apache/datafusion-comet/pull/1899#issuecomment-2984407095 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1899?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new pull request, #1903: URL: https://github.com/apache/datafusion-comet/pull/1903 ## Which issue does this PR close? N/A ## Rationale for this change We recently started to make QueryPlanSerde more maintainable by moving expression ser

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#discussion_r2154774569 ## docs/source/user-guide/expressions.md: ## @@ -127,7 +127,7 @@ The following Spark expressions are currently available. Any known compatibility | Log10

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
2010YOUY01 commented on PR #16268: URL: https://github.com/apache/datafusion/pull/16268#issuecomment-2983191181 close and reopen to trigger CI again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
2010YOUY01 closed pull request #16268: Add compression option to SpillManager URL: https://github.com/apache/datafusion/pull/16268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

[I] SortMergeJoin with timestamp fix [datafusion-comet]

2025-06-18 Thread via GitHub
SKY-ALIN opened a new issue, #1900: URL: https://github.com/apache/datafusion-comet/issues/1900 ### Describe the bug This is what I get when I use `CAST(time AS TIMESTAMP)` as key ```shell 25/06/18 12:51:12 WARN CometExecRule: Comet cannot execute some parts of this plan nat

[PR] Fix SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
SKY-ALIN opened a new pull request, #1901: URL: https://github.com/apache/datafusion-comet/pull/1901 ## Which issue does this PR close? Closes #1900. ## Rationale for this change This type is supported, but missed on the proto stage + message formatting is incorr

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983932272 @Dandandan the two ways I thought a bloom filter would be advantageous: 1. More performant if applied to each row than the full hash table, although I admit I haven't poked a

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983934401 Either way I think we can decouple the two things: there seems to be some interest in adding a bloom filter expression, that can be developed in parallel with the hash join pus

[I] Datafusion 48 Clickbench Q6 and Q0 regression [datafusion]

2025-06-18 Thread via GitHub
robert3005 opened a new issue, #16444: URL: https://github.com/apache/datafusion/issues/16444 ### Describe the bug When upgrading to Datafusion 48 our continous benchmarking infra detected a 25x regression in Q6 and 10x in Q0 of Clickbench. This query was previously answered all from

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2984503937 > > We already have the full table in memory, so we can not really save anything by compressing it into a bloom filter. > > Agreed: if we're not concerned with larger-than-m

[PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb opened a new pull request, #16445: URL: https://github.com/apache/datafusion/pull/16445 Part of #7955. My goal here is to lay the groundwork for pushing down joins. I am only implementing bounds pushdown because I am sure that is cheap and it will probably be quite effecti

Re: [I] Push Dynamic Join Predicates into Scan ("Sideways Information Passing", etc) [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on issue #7955: URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2984574454 Took an initial stab at this in https://github.com/apache/datafusion/pull/16445 -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Fix CI Failure: replace false with NullEqualsNothing [datafusion]

2025-06-18 Thread via GitHub
2010YOUY01 merged PR #16437: URL: https://github.com/apache/datafusion/pull/16437 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
2010YOUY01 closed pull request #16268: Add compression option to SpillManager URL: https://github.com/apache/datafusion/pull/16268 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Add support of parsing struct field's options in BigQuery [datafusion-sqlparser-rs]

2025-06-18 Thread via GitHub
alamb commented on PR #1890: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1890#issuecomment-2983657757 πŸŽ‰ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Update Roadmap documentation [datafusion]

2025-06-18 Thread via GitHub
xudong963 merged PR #16399: URL: https://github.com/apache/datafusion/pull/16399 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#issuecomment-2984688644 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1903?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
nirnayroy commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2154383754 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983856838 I had also played with building one with the `fastbloom` crate in the hash join operator, but lacked the ability to push it anywhere useful in the plan, which we now have.

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983871673 > I can point to the relevant code for interest, but we may want a different solution for core DF. Maybe this code would at least make it easy for us to have a performa

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983844232 So the high level for Spark is that there’s a BloomFilterAgg aggregate function that returns a byte sequence representing the bloom filter. The BloomFilterMightContaim scalar

Re: [I] Add BloomFilter PhysicalExpr [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on issue #16435: URL: https://github.com/apache/datafusion/issues/16435#issuecomment-2983912966 I believe it should also be possible to share the `Arc` within the created `PhysicalExpr`. This avoids to build a bloom filter. We already have the full table in memory

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-18 Thread via GitHub
samuelcolvin commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2983627011 This would be extremely useful for us. @alamb please would this get merged πŸ™ . -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] Fix SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
SKY-ALIN commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2983795662 It fixes formatting also, now it looks like this: ```shell 25/06/18 12:53:47 WARN CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet

[PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-18 Thread via GitHub
UBarney opened a new pull request, #16443: URL: https://github.com/apache/datafusion/pull/16443 ## Which issue does this PR close? part of #16364 ## Rationale for this change see issue ## What changes are included in this PR? 1. Limit intermediate_batch Siz

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2154673214 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2154704183 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase with

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2154723041 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase with

[I] Review experiemental status of array functions [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new issue, #1902: URL: https://github.com/apache/datafusion-comet/issues/1902 ### What is the problem the feature request solves? Most of the array functions are marked as `IncompatExpr` and are disabled by default. We should review whether this is necessary and imp

[PR] chore(deps): bump prost-build from 0.13.5 to 0.14.1 in the proto group [datafusion]

2025-06-18 Thread via GitHub
dependabot[bot] opened a new pull request, #16439: URL: https://github.com/apache/datafusion/pull/16439 Bumps the proto group with 1 update: [prost-build](https://github.com/tokio-rs/prost). Updates `prost-build` from 0.13.5 to 0.14.1 Changelog Sourced from https://github.co

Re: [PR] Add an example of embedding indexes inside a parquet file [datafusion]

2025-06-18 Thread via GitHub
zhuqi-lucas commented on code in PR #16395: URL: https://github.com/apache/datafusion/pull/16395#discussion_r2154205143 ## datafusion-examples/examples/embedding_parquet_indexes.rs: ## @@ -0,0 +1,363 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

[I] Support GROUP BY and DISTINCT with FixedSizeList values [datafusion]

2025-06-18 Thread via GitHub
findepi opened a new issue, #16442: URL: https://github.com/apache/datafusion/issues/16442 ### Is your feature request related to a problem or challenge? This should work ```diff $ git diff datafusion/sqllogictest/test_files/array.slt diff --git datafusion/sqllogictest

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
mbutrovich commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2984157981 Could we add a test case with timestamps as the join key? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[PR] feat: python based catalog and schema provider [datafusion-python]

2025-06-18 Thread via GitHub
timsaucer opened a new pull request, #1156: URL: https://github.com/apache/datafusion-python/pull/1156 # Which issue does this PR close? Closes #1091 # Rationale for this change This PR builds on top of https://github.com/apache/datafusion-python/pull/1137 and adds pyt

Re: [PR] feat: python based catalog and schema provider [datafusion-python]

2025-06-18 Thread via GitHub
timsaucer commented on PR #1156: URL: https://github.com/apache/datafusion-python/pull/1156#issuecomment-2984224547 FYI @renato2099 getting the python based providers ended up being a blocking issue for some of my work so I took a stab at implementing it. Please tell me what you think if y

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
comphead commented on code in PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#discussion_r2155251353 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -61,6 +61,39 @@ import org.apache.comet.shims.CometExprShim * An utility object f

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
epgif commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2155260272 ## datafusion/catalog/src/schema.rs: ## @@ -54,6 +55,14 @@ pub trait SchemaProvider: Debug + Sync + Send { name: &str, ) -> Result>, DataFusionError>;

Re: [PR] feat: Support Equijoin Expressions in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on PR #16450: URL: https://github.com/apache/datafusion/pull/16450#issuecomment-2985477515 I will try to run a benchmark on a table with smaller rows and return the result when finished. -- This is an automated message from the Apache Git Service. To respond to the m

[PR] feat: Support Equijoin Expressions in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n opened a new pull request, #16450: URL: https://github.com/apache/datafusion/pull/16450 ## Which issue does this PR close? - Closes #. ## Rationale for this change We want to support equijoins in `NestedLoopJoin` in the case where one of the tables in the

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#issuecomment-2986043661 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1911?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2155736023 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] test: Trigger Spark 3.4.3 SQL tests for iceberg-compat [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1912: URL: https://github.com/apache/datafusion-comet/pull/1912#issuecomment-2986069982 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1912?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#discussion_r2155747372 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -61,6 +61,39 @@ import org.apache.comet.shims.CometExprShim * An utility object

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2986090162 Our benchmarks show this change fixes the performance regression we saw - https://github.com/vortex-data/vortex/pull/3567 -- This is an automated message from the Apache Git Service

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155751838 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effect on S

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2155755872 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite w

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155756956 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effe

Re: [PR] fix: set RangePartitioning for native shuffle default to false [datafusion-comet]

2025-06-18 Thread via GitHub
mbutrovich commented on PR #1907: URL: https://github.com/apache/datafusion-comet/pull/1907#issuecomment-2985229525 I think I'll need to generate new golden plans for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[PR] Current working branch [datafusion-python]

2025-06-18 Thread via GitHub
timsaucer opened a new pull request, #1157: URL: https://github.com/apache/datafusion-python/pull/1157 # **DO NOT MERGE** This PR exists to make it easy to see the differences between `main` and our current working branch, `rerun`. -- This is an automated message from the Apach

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2985955904 It seems in some cases it's faster: ``` ┏━━┳━┳━━┳━━━┓ ┃ Query┃ topk-dynamic-filter ┃ topk-filters ┃

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2985967718 Seems like a bug in my implementation right? I'd be surprised if the update checks I added are that heavy compared to other work... -- This is an automated message from the Apache

[I] Was GCS support removed? [datafusion-ballista]

2025-06-18 Thread via GitHub
dfinninger opened a new issue, #1274: URL: https://github.com/apache/datafusion-ballista/issues/1274 Hi, we're trying to make Ballista read parquet files in Google Cloud Storage. It looks like support for GCS was added in 2023: https://github.com/apache/datafusion-ballista/pull/805. However

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2985979040 > Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key? The test should have the left side and the right side timestamps b

Re: [I] Was GCS support removed? [datafusion-ballista]

2025-06-18 Thread via GitHub
milenkovicm commented on issue #1274: URL: https://github.com/apache/datafusion-ballista/issues/1274#issuecomment-2985986246 In short users should extend ballista to support object store they need. S3 is a bit special case. You can find more details how to do that in the examples.

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16371: URL: https://github.com/apache/datafusion/pull/16371#issuecomment-2985997261 I'll try to review tomorrow. I took a look the other day and my thought was that while it's complex code that is a bit hard for me to fully wrap my head around it's well teste

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2155694477 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase wi

[PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra opened a new pull request, #1911: URL: https://github.com/apache/datafusion-comet/pull/1911 ## Which issue does this PR close? Adds a new unit test. Also adds a method to generate a complex type parquet file that can be used to test various complex type cases. -- This

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155673833 ## datafusion/functions/src/regex/regexpcount.rs: ## @@ -29,10 +30,10 @@ use datafusion_expr::{ use datafusion_macros::user_doc; use itertools::izip; use regex

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985988638 > I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think w

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155683469 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lice

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155683469 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lice

Re: [PR] chore: Fix typo in workflow [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#issuecomment-2986017103 One test failure, as expected: ``` 2025-06-18T22:31:07.6082754Z [info] - SPARK-17091: Convert IN predicate to Parquet filter push-down *** FAILED *** (297 millisecond

Re: [PR] Support types other than String and Int for partition columns [datafusion-python]

2025-06-18 Thread via GitHub
miclegr commented on code in PR #1154: URL: https://github.com/apache/datafusion-python/pull/1154#discussion_r2155423158 ## python/datafusion/context.py: ## @@ -535,7 +535,7 @@ def register_listing_table( self, name: str, path: str | pathlib.Path, -

  1   2   3   >