Re: [PR] feat: reduce duplicate fields on join [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on code in PR #1184: URL: https://github.com/apache/datafusion-python/pull/1184#discussion_r2194093566 ## python/tests/test_dataframe.py: ## @@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df): assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on PR #1185: URL: https://github.com/apache/datafusion-python/pull/1185#issuecomment-3051209201 Closed because of #1184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
kosiew closed pull request #1185: Add support for automatic join column deduplication in DataFrame joins URL: https://github.com/apache/datafusion-python/pull/1185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [I] Simplify Joins on Shared Column Name [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on issue #1173: URL: https://github.com/apache/datafusion-python/issues/1173#issuecomment-3051181319 @timsaucer Sorry for crossing lanes. Feel free to close my PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] [EPIC] Tracking issue of support substrait logical plan [datafusion]

2025-07-08 Thread via GitHub
ViggoC commented on issue #8149: URL: https://github.com/apache/datafusion/issues/8149#issuecomment-3051149803 @waynexia Why do you think that we don't need to support OuterReferenceColumn? IMHO, It is the key path to implementing subquery. -- This is an automated message from the Apa

Re: [PR] fix: Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-08 Thread via GitHub
hsiang-c commented on PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000#issuecomment-3051010409 @kazuyukitanimura yes. We can consider merging https://github.com/apache/datafusion-comet/pull/1987 to speed up CI, but need to fix some tests due to configuration. --

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
adriangb commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3051014271 Thanks for merging @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
xudong963 merged PR #16718: URL: https://github.com/apache/datafusion/pull/16718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] feat: Avoid duplicate `PhyscialExpr` evaluation on hash table [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n closed pull request #16719: feat: Avoid duplicate `PhyscialExpr` evaluation on hash table URL: https://github.com/apache/datafusion/pull/16719 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-08 Thread via GitHub
Kontinuation commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3050889742 I took a look at the `datafusion-comet-objectstore-hdfs` module Comet and found that it largely overlaps with the Hadoop FileSystem bridge we are building here. A better ap

Re: [PR] POC: Eliminate unnecessary group by keys (q35 in clickbench 1.35x faster) [datafusion]

2025-07-08 Thread via GitHub
github-actions[bot] closed pull request #13617: POC: Eliminate unnecessary group by keys (q35 in clickbench 1.35x faster) URL: https://github.com/apache/datafusion/pull/13617 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] Add xxhash algorithms in SQL and expression api [datafusion]

2025-07-08 Thread via GitHub
github-actions[bot] commented on PR #14367: URL: https://github.com/apache/datafusion/pull/14367#issuecomment-3050880910 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] [wip] update list & struct coercion to support incrementality [datafusion]

2025-07-08 Thread via GitHub
github-actions[bot] commented on PR #15259: URL: https://github.com/apache/datafusion/pull/15259#issuecomment-3050880818 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Attach diagnostic for wrong arg number error [datafusion]

2025-07-08 Thread via GitHub
github-actions[bot] closed pull request #15451: Attach diagnostic for wrong arg number error URL: https://github.com/apache/datafusion/pull/15451 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] fix: Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-08 Thread via GitHub
kazuyukitanimura commented on PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000#issuecomment-3050875597 cc @hsiang-c Does it make sense to include `[iceberg]` in the title to run the Iceberg CI? -- This is an automated message from the Apache Git Service. To respond

[PR] feat: Avoid duplicate `PhyscialExpr` evaluation on hash table [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n opened a new pull request, #16719: URL: https://github.com/apache/datafusion/pull/16719 ## Which issue does this PR close? - Closes #. ## Rationale for this change While looking into hash join spilling I noticed that the physical expression is being e

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
alamb commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3050686286 > I made some changes based latest comments from folks. > > FYI @alamb , please correct me if i made some wrong changes, thanks a lot! THank you -- it is looking great. I sp

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
alamb commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193678166 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Files +da

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
alamb commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193677635 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Files +da

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
adriangb commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050655613 Will merge once CI passes. Incidentally I was just working on HashJoinExec pushdown and think some of these APIs will need another tweak. Headed in the right direction but not quite

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050654080 @adriangb I refactored to use a function for it instead wdyt? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050635513 Actually instead of adding back the code, i think i'll bring this function back into `PredicateSupport` since it now makes sense to. -- This is an automated message from the Ap

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050634024 @adriangb Just need a quick merge here 😆 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-08 Thread via GitHub
parthchandra commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3050630944 @comphead @Kontinuation you might be interested in looking at this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-08 Thread via GitHub
parthchandra commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2193617615 ## native/core/src/parquet/objectstore/jni_hdfs.rs: ## @@ -0,0 +1,332 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contri

[PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n opened a new pull request, #16718: URL: https://github.com/apache/datafusion/pull/16718 ## Which issue does this PR close? - Closes #. ## Rationale for this change Fix ci ## What changes are included in this PR? #16686 seems to have been merg

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3050539883 ok updated the PR to include the workaround. lets pin to use the working commit for `apache/infrastructure-actions`, which is `8aee7a080268198548d8d1b4f1315a4fb94bffea`. Added

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3050505258 Opened https://github.com/apache/infrastructure-actions/issues/218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3050501650 > seems to be something wrong with the infrastructure-actions repo yea super weird. this is the offending commit https://github.com/apache/infrastructure-actions/commit/0b75b

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-08 Thread via GitHub
XiangpengHao commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3050458408 I believe the bugs are fixed in 2cf1a8f82f722e1c7e4857d7b07ba726f67d9f2f Can you point to that commit and try the benchmark again? I believe some tests will still fa

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
alamb commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193539118 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,578 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Files +da

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
fvj commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3050427897 you're very welcome. I hope I can contribute more valuable stuff in the future :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[I] Some group by query is 6~7x slower than DuckDB [datafusion-python]

2025-07-08 Thread via GitHub
wegamekinglc opened a new issue, #1186: URL: https://github.com/apache/datafusion-python/issues/1186 ### Describe the bug Hi team, I have encountered a performance issue when I run same query on a big table with datafusion comparing with DuckDB. I will try to simplify my case a

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050425801 Moving to `datafusion-python` repo as I think we have identified the underlying issue and it is traced in https://github.com/apache/datafusion/issues/16717 We can close thi

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050422281 I filed - https://github.com/apache/datafusion/issues/16717 to track -- This is an automated message from the Apache Git Service. To respond to the message, please log

[I] Better parallelize large input batches (speed up dataframe access) [datafusion]

2025-07-08 Thread via GitHub
alamb opened a new issue, #16717: URL: https://github.com/apache/datafusion/issues/16717 ### Is your feature request related to a problem or challenge? Most of DataFuson is carefully designed to operate on ~ target-batch size partitions (e.g. 8k rows) at a time However some dat

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3050383546 Thanks again @fvj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16687: URL: https://github.com/apache/datafusion/pull/16687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3050383391 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] Fix: Make `CopyTo` logical plan output schema consistent with physical schema [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16705: URL: https://github.com/apache/datafusion/pull/16705#issuecomment-3050382520 Thanks again @bert-beyondloops -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [I] Output schema of the CopyTo logical plan is not correct. [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16704: Output schema of the CopyTo logical plan is not correct. URL: https://github.com/apache/datafusion/issues/16704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Fix: Make `CopyTo` logical plan output schema consistent with physical schema [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16705: URL: https://github.com/apache/datafusion/pull/16705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #16686: URL: https://github.com/apache/datafusion/pull/16686#discussion_r2193493034 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -289,7 +289,7 @@ fn test_no_pushdown_through_aggregates() { Ok: - Filte

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16694: URL: https://github.com/apache/datafusion/pull/16694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Bug: the new filter pushdown optimizer rule in physical layer will miss the equivalence info in filter [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16563: Bug: the new filter pushdown optimizer rule in physical layer will miss the equivalence info in filter URL: https://github.com/apache/datafusion/issues/16563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16686: URL: https://github.com/apache/datafusion/pull/16686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16694: URL: https://github.com/apache/datafusion/pull/16694#issuecomment-3050382157 Thanks again @mjgarton -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [I] Running tests with `--test-threads` option fails. [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16693: Running tests with `--test-threads` option fails. URL: https://github.com/apache/datafusion/issues/16693 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Add support for Snowflake identifier function [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
yoavcloud commented on code in PR #1929: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1929#discussion_r2193440626 ## tests/sqlparser_snowflake.rs: ## @@ -4232,3 +4232,122 @@ fn test_snowflake_create_view_with_composite_policy_name() { r#"CREATE VIEW X (

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050296207 I double checked that the API used in the python script and it looks like depending on the exact type of the Python object, it seems it always gets the data in memory as a MemTabl

[PR] feat: add CopyExec and move CopyExec handling to Spark [datafusion-comet]

2025-07-08 Thread via GitHub
dharanad opened a new pull request, #2001: URL: https://github.com/apache/datafusion-comet/pull/2001 ## Which issue does this PR close? Closes #1995 ## Rationale for this change ## What changes are included in this PR? ## How are these chang

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3050260270 I would expect the largest difference to be in sorting benchmarks (`sort_tpch` etc.) -- This is an automated message from the Apache Git Service. To respond to the message, pleas

[PR] Sf create table as [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
yoavcloud opened a new pull request, #1931: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1931 The code that parses `CREATE TABLE` in the Snowflake dialect assumed that if the `AS`, `LIKE` or `CLONE` options are used, then no other options can be specified. That is not true, s

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA [datafusion]

2025-07-08 Thread via GitHub
GitHub user NGA-TRAN added a comment to the discussion: DISCUSSION: DataFusion Meetup in Boston, USA I have updated the description. Let us go with Wednesday, November 12th. GitHub link: https://github.com/apache/datafusion/discussions/16703#discussioncomment-13700744 This is an automat

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
ryanschneider commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2193394552 ## tests/sqlparser_common.rs: ## @@ -15974,3 +15974,133 @@ fn parse_create_procedure_with_parameter_modes() { _ => unreachable!(),

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
ryanschneider commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2193393495 ## src/dialect/mod.rs: ## @@ -650,8 +651,19 @@ pub trait Dialect: Debug + Any { Token::Word(w) if w.keyword == Keyword::MATCH =>

Re: [I] Enable comments on datafusion-site via giscus [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on issue #80: URL: https://github.com/apache/datafusion-site/issues/80#issuecomment-3050221253 i created an apache infra ticket to enable the giscus github app for this repo https://issues.apache.org/jira/browse/INFRA-26989 -- This is an automated message from the

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA [datafusion]

2025-07-08 Thread via GitHub
GitHub user NGA-TRAN edited a discussion: DISCUSSION: DataFusion Meetup in Boston, USA With the upcoming New York meetup on the horizon, the DataDog Boston team is excited to plan a local DataFusion-themed gathering this fall! **Date:** Wednesday, November 12 📍 Location: DataDog, 225 Frankl

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050196330 > Isn't the easiest solution is to update `DataSourceExec` to adhere to target batch size? Maybe -- I will file a ticket to explain the problem with a DataFusion only repro

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3050201621 > branch has been rebased. Should I squash commit or is this done during merge ? commits are squashed on merge so no need to do it on the branch Pushing commits rather tha

Re: [PR] fix: create file for empty stream [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16342: URL: https://github.com/apache/datafusion/pull/16342#issuecomment-3050189370 - Reverted in https://github.com/apache/datafusion/pull/16682 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16240: URL: https://github.com/apache/datafusion/issues/16240#issuecomment-3050190451 Unfortunately, we had to revert the original fix, see - https://github.com/apache/datafusion/pull/16682 So I am reopening the ticket to track -- This is an automated m

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16682: URL: https://github.com/apache/datafusion/pull/16682#issuecomment-3050188171 > @alamb @brunal sorry for inconvenience, revert it is ok. I'm trying to find a new way to implement this feature. Thank you @brunal and @chenkovsky -- This is an automated m

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16682: URL: https://github.com/apache/datafusion/pull/16682 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

[I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-07-08 Thread via GitHub
mmooyyii opened a new issue, #16240: URL: https://github.com/apache/datafusion/issues/16240 I want write a csv file only include headers; ``` use datafusion::config::CsvOptions; use datafusion::dataframe::DataFrameWriteOptions; use datafusion::error::Result; use datafusion:

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16645: URL: https://github.com/apache/datafusion/pull/16645#issuecomment-3050178802 > > I suspect updating the results was faster than it might have previously been thanks to the work @blaginin @Chen-Yuan-Lai have done to migrate most of our plan tests to `insta` >

Re: [PR] github: turn on discussion [datafusion-site]

2025-07-08 Thread via GitHub
alamb merged PR #85: URL: https://github.com/apache/datafusion-site/pull/85 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusio

Re: [PR] fix: try to lower plain reserved functions to columns as well [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16669: URL: https://github.com/apache/datafusion/pull/16669#issuecomment-3050174594 Thank you for the review @jonahgao ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Error when use `user` field in where clause [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #14141: Error when use `user` field in where clause URL: https://github.com/apache/datafusion/issues/14141 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] fix: try to lower plain reserved functions to columns as well [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16669: URL: https://github.com/apache/datafusion/pull/16669 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
ozankabak commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050173385 Isn't the easiest solution is to update `DataSourceExec` to adhere to target batch size? -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193350391 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,209 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-08 Thread via GitHub
bert-beyondloops commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3050159701 branch has been rebased -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [I] Optimize the join operators [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3050155137 Ohh, I missed the part in the documentation which had the h2o_small_join able to downloaded, i thought the h2o_small had it all 😆 -- This is an automated message from the

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-3050097130 TLDR this branch looks good from my performance perspective. Thank you @jonathanc-n and @Dandandan -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193310051 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [I] Support u32 indices in HashJoinExec [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16179: Support u32 indices in HashJoinExec URL: https://github.com/apache/datafusion/issues/16179 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16434: URL: https://github.com/apache/datafusion/pull/16434 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
timsaucer commented on PR #1185: URL: https://github.com/apache/datafusion-python/pull/1185#issuecomment-3050083502 Would you mind taking a look at https://github.com/apache/datafusion-python/pull/1184 ? It's an alternate approach which basically reuses the logic of `drop_columns` on the r

Re: [PR] cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-08 Thread via GitHub
codecov-commenter commented on PR #1996: URL: https://github.com/apache/datafusion-comet/pull/1996#issuecomment-3050078977 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1996?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-08 Thread via GitHub
andygrove commented on code in PR #1996: URL: https://github.com/apache/datafusion-comet/pull/1996#discussion_r2193256587 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -677,7 +677,13 @@ object QueryPlanSerde extends Logging with CometExprShim {

Re: [PR] Fix for Postgres regex and like binary operators [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
solontsev commented on code in PR #1928: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1928#discussion_r2193231211 ## tests/sqlparser_postgres.rs: ## @@ -2207,19 +2223,31 @@ fn parse_pg_like_match_ops() { ]; for (str_op, op) in pg_like_match_ops { -

Re: [PR] Fix for Postgres regex and like binary operators [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
solontsev commented on code in PR #1928: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1928#discussion_r2193231211 ## tests/sqlparser_postgres.rs: ## @@ -2207,19 +2223,31 @@ fn parse_pg_like_match_ops() { ]; for (str_op, op) in pg_like_match_ops { -

[PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-08 Thread via GitHub
nuno-faria opened a new pull request, #16716: URL: https://github.com/apache/datafusion/pull/16716 ## Which issue does this PR close? N/A. ## Rationale for this change When executing hash joins, the build side is first built from the left relation and the

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193162144 ## content/blog/2025-07-14-user-defined-parquet-indexes.md: ## @@ -0,0 +1,578 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet File

Re: [PR] chore: Introduce ANSI support for remainder operation [datafusion-comet]

2025-07-08 Thread via GitHub
rishvin commented on PR #1971: URL: https://github.com/apache/datafusion-comet/pull/1971#issuecomment-3049894948 @andygrove I have made the changes, could you please recheck. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] fix: Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-08 Thread via GitHub
codecov-commenter commented on PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000#issuecomment-3049875434 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2000?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3049874530 I can render it locally. also #86 should make local dev easier -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
ryanschneider commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2193132588 ## src/dialect/mod.rs: ## @@ -1076,6 +1088,15 @@ pub trait Dialect: Debug + Any { fn supports_comma_separated_drop_column_list(&self) -> boo

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-08 Thread via GitHub
comphead commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3049815105 Appreciate if anyone can tell if its possible to read the blog draft compiled with formatting? -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] refactor: standardize div_ceil [datafusion-comet]

2025-07-08 Thread via GitHub
codecov-commenter commented on PR #1999: URL: https://github.com/apache/datafusion-comet/pull/1999#issuecomment-3049811609 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1999?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[PR] fix: Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-08 Thread via GitHub
huaxingao opened a new pull request, #2000: URL: https://github.com/apache/datafusion-comet/pull/2000 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-08 Thread via GitHub
andygrove commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3049800502 I see that some tests are failing due to https://github.com/apache/datafusion-comet/issues/1982 ``` org.apache.spark.SparkException: Job aborted due to stage failure

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3049794080 changing to `urllib3==2.2.3` works https://github.com/apache/infrastructure-actions/blob/1115490227e7aaf7ccee5b06bb3b5955e7cf8493/pelican/requirements.txt#L11 -- This is an au

[PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu opened a new pull request, #86: URL: https://github.com/apache/datafusion-site/pull/86 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3049790410 Not sure why the pelican Dockerfile build fails... ``` 6.961 ERROR: Could not find a version that satisfies the requirement urllib3==2.5.0 (from versions: 0.3, 1.0, 1.0.1, 1.

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#issuecomment-3049764358 Just pending CI, will likely merge later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r219304 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r219304 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193074384 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

  1   2   3   >