Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
viirya commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3082735431 So the basic idea, as I understand it, is that instead of adapting the data batch using `SchemaAdapter` against the schema, the new approach involves rewriting or transforming th

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082712617 Updated parquet result from my local using the 1e8 dataset, it even faster: ```rust ./bench.sh run h2o_medium_join_parquet *** DataFusi

Re: [PR] improve rust workflows without cache [datafusion-ballista]

2025-07-16 Thread via GitHub
milenkovicm commented on PR #1275: URL: https://github.com/apache/datafusion-ballista/pull/1275#issuecomment-3082666517 Have no access to computer at the moment. Can you please push change, it should trigger new job Two questions - What's the reason to remove cancel job? It w

Re: [PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16804: URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082646775 > @zhuqi-lucas Just for the case falsa generates too much noise in stdout (and runner's output), I can add a command-line argument to suppress it. Something like `--silent`.

Re: [PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
SemyonSinchenko commented on PR #16804: URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082635479 @zhuqi-lucas Just for the case falsa generates too much noise in stdout (and runner's output), I can add a command-line argument to suppress it. Something like `--silent`.

Re: [PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16804: URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082603389 Updated, it works now, the falsa has merged the fix and released: https://github.com/mrpowers-io/falsa/pull/28 ```rust ./bench.sh data h2o_small_join_parquet

[PR] DROP USER [datafusion-sqlparser-rs]

2025-07-16 Thread via GitHub
yoavcloud opened a new pull request, #1951: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1951 Extended the already existing generic support for DROP statements to USER. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082400648 > > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > > > > > > > > [@MrPowers](https://github.com/

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082362965 > > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > > > > > [@MrPowers](https://github.com/MrPowers) I

Re: [PR] improve rust workflows without cache [datafusion-ballista]

2025-07-16 Thread via GitHub
Huy1Ng commented on PR #1275: URL: https://github.com/apache/datafusion-ballista/pull/1275#issuecomment-3082356061 @milenkovicm can you rerun the CI? I pushed 2 commits too close together so some jobs failed because they didn't get an available machine. The CI run on my fork succeeded: ht

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082347855 @mrpowers-wb I submit the PR for h2o benchmark to support parquet format in datafusion, but it blocks by falsa join dataset generate, details: https://github.com/a

Re: [PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16804: URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082341745 Create the jira for falsa side, it fails with generate parquet data for join set, but it works well with group by. https://github.com/mrpowers-io/falsa/issues/27 -- This

Re: [PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16804: URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082332726 Updated: error for parquet join data generate, it works for group by: ```rust ./bench.sh data h2o_medium_join_parquet *** DataFusion Benchma

[PR] benchmark: Add parquet h2o support [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas opened a new pull request, #16804: URL: https://github.com/apache/datafusion/pull/16804 ## Which issue does this PR close? Currently, we only support for CSV format for h2o benchmark, but from the compare with other database result, it is using parquet, so this ticket try to

Re: [PR] Automatically split large single RecordBatches in `MemorySource` into smaller batches [datafusion]

2025-07-16 Thread via GitHub
kosiew merged PR #16734: URL: https://github.com/apache/datafusion/pull/16734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafus

Re: [I] Better parallelize large input batches (speed up dataframe access) [datafusion]

2025-07-16 Thread via GitHub
kosiew closed issue #16717: Better parallelize large input batches (speed up dataframe access) URL: https://github.com/apache/datafusion/issues/16717 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Add example of custom file schema casting rules [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16803: URL: https://github.com/apache/datafusion/pull/16803#discussion_r2212054428 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -433,7 +433,7 @@ impl ListingTableConfig { /// `SchemaAdapterFactory` is set, in which case only th

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3082294797 @parthchandra @mbutrovich please take a look at https://github.com/apache/datafusion/pull/16803. As per the comments in the example it looks like Comet already has a custom

[PR] Add example of custom file schema casting rules [datafusion]

2025-07-16 Thread via GitHub
adriangb opened a new pull request, #16803: URL: https://github.com/apache/datafusion/pull/16803 https://github.com/apache/datafusion/issues/16800#issuecomment-3080175396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Allow comparison between boolean and int values [datafusion]

2025-07-16 Thread via GitHub
2010YOUY01 commented on PR #16798: URL: https://github.com/apache/datafusion/pull/16798#issuecomment-3082274998 what about using explicit casting in applications? For example: ```sh > select not(arrow_cast(1, 'Boolean')); +--+ | NOT arrow_ca

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082269096 > > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: > > [@MrPowers](https://github.com/MrPowers) I am using t

Re: [I] [EPIC] TPC-H performance improvements [datafusion-comet]

2025-07-16 Thread via GitHub
comphead commented on issue #2004: URL: https://github.com/apache/datafusion-comet/issues/2004#issuecomment-3082214873 DF has similar work for q1-q4 H2O benchmarks https://github.com/apache/datafusion/issues/16710 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
comphead commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2212009586 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -88,19 +88,33 @@ impl HadoopFileSystem { fn read_range(range: &Range, file: &HdfsFile) -> Result {

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-07-16 Thread via GitHub
github-actions[bot] closed pull request #15423: Introduce selection vector repartitioning URL: https://github.com/apache/datafusion/pull/15423 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb merged PR #16791: URL: https://github.com/apache/datafusion/pull/16791 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3082082344 > [@UBarney](https://github.com/UBarney) - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: @MrPowers I am using the **1e8** dataset. ``` target

Re: [PR] fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on PR #1987: URL: https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3082076029 @hsiang-c created https://github.com/apache/datafusion-comet/issues/2033 to track this issue -- This is an automated message from the Apache Git Service. To respond to th

[I] Comet cannot execute some iceberg metadata table queries [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra opened a new issue, #2033: URL: https://github.com/apache/datafusion-comet/issues/2033 ### Describe the bug Reproducing issue and steps to reproduce from this https://github.com/apache/datafusion-comet/pull/1987#issuecomment-3075575929 Many Iceberg Spark SQL Tes

Re: [PR] fix : cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on code in PR #1996: URL: https://github.com/apache/datafusion-comet/pull/1996#discussion_r2211898302 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -677,7 +677,14 @@ object QueryPlanSerde extends Logging with CometExprShim {

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
parthchandra commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3081929167 > Having looked at your implementation I think it may not be that bad! It seems like most of what your SchemaAdapter is doing is customizing casting rules, right? Al

Re: [PR] fix: clean up iceberg integration APIs [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on code in PR #2032: URL: https://github.com/apache/datafusion-comet/pull/2032#discussion_r2211880244 ## common/src/main/java/org/apache/comet/parquet/BatchReader.java: ## @@ -183,9 +183,7 @@ public BatchReader( this.taskContext = TaskContext$.MODULE$

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3081902444 Having looked at your implementation I think it may not be that bad! It seems like most of what your SchemaAdapter is doing is customizing casting rules, right? -- This is a

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
parthchandra commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3081890593 I feel it may be a fair amount of work in Comet to move from `SchemaAdapter` to `PhysicalExprAdapter` but from the pseudocode example it appears tractable. I think we'll be

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#issuecomment-3081817670 @Kontinuation @andygrove @comphead, updated based on review comments -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] fix: Refactor arithmetic serde and fix correctness issues with EvalMode::TRY [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on code in PR #2018: URL: https://github.com/apache/datafusion-comet/pull/2018#discussion_r2211835763 ## spark/src/main/scala/org/apache/comet/serde/arithmetic.scala: ## @@ -0,0 +1,293 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one +

Re: [PR] fix: skip predicates on struct unnest in PushDownFilter [datafusion]

2025-07-16 Thread via GitHub
akoshchiy commented on code in PR #16790: URL: https://github.com/apache/datafusion/pull/16790#discussion_r2210860376 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -128,12 +128,31 @@ physical_plan 06)--ProjectionExec: expr=[column1@0 as column1, colu

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211616561 ## datafusion/core/tests/parquet/schema_adapter.rs: ## @@ -0,0 +1,92 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
alamb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211267704 ## docs/source/library-user-guide/upgrading.md: ## @@ -120,6 +120,17 @@ SET datafusion.execution.spill_compression = 'zstd'; For more details about this configura

Re: [PR] fix: clean up iceberg integration APIs [datafusion-comet]

2025-07-16 Thread via GitHub
codecov-commenter commented on PR #2032: URL: https://github.com/apache/datafusion-comet/pull/2032#issuecomment-3080720676 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/2032?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[PR] Snowflake: CREATE USER [datafusion-sqlparser-rs]

2025-07-16 Thread via GitHub
yoavcloud opened a new pull request, #1950: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1950 Added support for the `CREATE USER` statement in Snowflake. Enhanced the KeyValueOptions struct with: 1. A custom delimiter 2. Optional parentheses 3. Optional keywords that

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-16 Thread via GitHub
akupchinskiy commented on code in PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#discussion_r2211570646 ## native/spark-expr/src/nondetermenistic_funcs/randn.rs: ## @@ -0,0 +1,265 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-16 Thread via GitHub
akupchinskiy commented on code in PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#discussion_r2211570646 ## native/spark-expr/src/nondetermenistic_funcs/randn.rs: ## @@ -0,0 +1,265 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more

Re: [I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra closed issue #2029: Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 URL: https://github.com/apache/datafusion-comet/issues/2029 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on issue #2029: URL: https://github.com/apache/datafusion-comet/issues/2029#issuecomment-3080492782 > One surprising thing to note is that Gluten and Blaze were working fine with the data containing the flag That is, in fact, quite surprising. Double is not a good

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3080175396 Here's some more examples: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/json_shredding.rs, https://github.com/apache/datafusion/blob/main/datafu

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3080077103 Here's an example from Comet: https://github.com/vaibhawvipul/datafusion-comet/blob/main/native/core/src/parquet/schema_adapter.rs. As you can see it's _a lot_ of code with [a

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
MrPowers commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3080025409 @UBarney - here are the 1e7 join results on my M3 Macbook with 16GB of RAM: ![Image](https://github.com/user-attachments/assets/f96b5301-1986-4824-a715-3e2a53895ca8)

Re: [PR] Support multiple ordered array_agg aggregations [datafusion]

2025-07-16 Thread via GitHub
findepi commented on PR #16625: URL: https://github.com/apache/datafusion/pull/16625#issuecomment-3080003484 @ozankabak @alamb can you please help me understand where you would want to go with this? or maybe DF doesn't need to support ordered array_aggs (more than one in a query)?

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
viirya commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3079995268 Reach here from @alamb's post on dev. Although I saw what is proposed to do but it is unclear to me how it will work or what steps there will be. At least could you describe the

Re: [PR] Expand parse without semicolons [datafusion-sqlparser-rs]

2025-07-16 Thread via GitHub
aharpervc commented on PR #1949: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1949#issuecomment-3079970637 @alamb fyi, [as previously discussed](https://github.com/apache/datafusion-sqlparser-rs/pull/1937#issuecomment-3070806780) -- This is an automated message from the Ap

Re: [I] Expected: decimal(7,2), Found: DOUBLE when running TPC-DS benchmarks on Spark 3.5 [datafusion-comet]

2025-07-16 Thread via GitHub
ShreyeshArangath commented on issue #2029: URL: https://github.com/apache/datafusion-comet/issues/2029#issuecomment-3079962752 Yeah, I was able to fix this issue by fixing the data-generation. We are using https://github.com/maropu/spark-tpcds-datagen for our datagen, removing the `--use-d

[PR] Expand parse without semicolons [datafusion-sqlparser-rs]

2025-07-16 Thread via GitHub
aharpervc opened a new pull request, #1949: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1949 This PR is a followup ([ref](https://github.com/apache/datafusion-sqlparser-rs/pull/1937#issuecomment-3070806780)) to recent work on parsing without requiring semicolon statement del

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211280296 ## datafusion/datasource-parquet/src/source.rs: ## @@ -468,10 +468,50 @@ impl FileSource for ParquetSource { let projection = base_config .f

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211279282 ## datafusion/core/tests/parquet/schema_adapter.rs: ## @@ -0,0 +1,92 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
alamb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211266501 ## datafusion/datasource-parquet/src/source.rs: ## @@ -468,10 +468,50 @@ impl FileSource for ParquetSource { let projection = base_config .file

Re: [I] Release DataFusion `49.0.0` (July 2025) [datafusion]

2025-07-16 Thread via GitHub
alamb commented on issue #16235: URL: https://github.com/apache/datafusion/issues/16235#issuecomment-3079804143 I am hoping to start with some of the various release tasks tomorrow (like ensuring hte upgrade guide is in a good place) but I have been very busy with other things recently. Hop

Re: [I] count_all() aggregations cannot be aliased [datafusion]

2025-07-16 Thread via GitHub
Loaki07 commented on issue #16795: URL: https://github.com/apache/datafusion/issues/16795#issuecomment-3079789813 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2211108359 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -1095,4 +1124,167 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0);

[PR] fix: clean up iceberg integration APIs [datafusion-comet]

2025-07-16 Thread via GitHub
huaxingao opened a new pull request, #2032: URL: https://github.com/apache/datafusion-comet/pull/2032 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

2025-07-16 Thread via GitHub
GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files The above results were performed with the following setup: * `datafusion-cli -m 8G -d 50G --top-memory-consumers 25` * The default `datafusion.execution.par

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

2025-07-16 Thread via GitHub
GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files Addressing Question 2) It's not possible to remove the `first_value()` aggregate from the above query since `col_7` and `col_8` won't appear in the `GROUP

[PR] chore: add tests for out of bounds for NullArray [datafusion]

2025-07-16 Thread via GitHub
comphead opened a new pull request, #16802: URL: https://github.com/apache/datafusion/pull/16802 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/16187. ## Rationale for this change Add tests proving the issue is fixed after

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

2025-07-16 Thread via GitHub
GitHub user zheniasigayev added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files Addressing Question 1. The query plan for the original query: ```sql CREATE EXTERNAL TABLE example ( col_1 VARCHAR(50) NOT NULL, col_2 BIGINT NOT

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
parthchandra commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2210968794 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -88,19 +88,18 @@ impl HadoopFileSystem { fn read_range(range: &Range, file: &HdfsFile) -> Result {

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3079371233 Thank you for chiming in! > We'll be able to start making the migration with the DF 49.0 release? Yes that's the plan. I'm trying to figure out the way to make the

Re: [PR] fix: skip predicates on struct unnest in PushDownFilter [datafusion]

2025-07-16 Thread via GitHub
akoshchiy commented on code in PR #16790: URL: https://github.com/apache/datafusion/pull/16790#discussion_r2210860376 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -128,12 +128,31 @@ physical_plan 06)--ProjectionExec: expr=[column1@0 as column1, colu

Re: [PR] fix: skip predicates on struct unnest in PushDownFilter [datafusion]

2025-07-16 Thread via GitHub
akoshchiy commented on code in PR #16790: URL: https://github.com/apache/datafusion/pull/16790#discussion_r2210860376 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -128,12 +128,31 @@ physical_plan 06)--ProjectionExec: expr=[column1@0 as column1, colu

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3079295059 > > My guess is that some of the new slowdown / less predictability is due to many more `Box`es (and thus allocations) -- I suggest we reconsider Boxing frequently used structure

Re: [I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
mbutrovich commented on issue #16800: URL: https://github.com/apache/datafusion/issues/16800#issuecomment-3079257969 Comet makes increasing use of `SchemaAdapter`, but nothing you describe here sounds like a dealbreaker for Comet at first glance. I think we'd be able to make the necessary c

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-16 Thread via GitHub
mbutrovich commented on code in PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#discussion_r2210833524 ## native/spark-expr/src/nondetermenistic_funcs/randn.rs: ## @@ -0,0 +1,265 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more co

[I] Integration tests are not being run [datafusion]

2025-07-16 Thread via GitHub
adriangb opened a new issue, #16801: URL: https://github.com/apache/datafusion/issues/16801 ### Describe the bug There's something wrong with [datafusion/core/tests/integration_tests/schema_adapter_integration_tests.rs](https://github.com/apache/datafusion/blob/main/datafusion/core/te

Re: [I] Integration tests are not being run [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on issue #16801: URL: https://github.com/apache/datafusion/issues/16801#issuecomment-3079240904 @kosiew could you take a look at this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] fix: skip predicates on struct unnest in PushDownFilter [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16790: URL: https://github.com/apache/datafusion/pull/16790#discussion_r2210812065 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -128,12 +128,31 @@ physical_plan 06)--ProjectionExec: expr=[column1@0 as column1, colum

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
Kontinuation commented on PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#issuecomment-3079205816 > Sorry @Kontinuation if I check your references https://github.com/datafusion-contrib/fs-hdfs/blob/8c03c5ef0942b75abc79ed673931355fa9552131/c_src/libhdfs/hdfs.c#L1564C15-L1

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-16 Thread via GitHub
mbutrovich commented on code in PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#discussion_r2210768385 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -2765,6 +2765,26 @@ class CometExpressionSuite extends CometTestBase with Adapti

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3079128237 > My guess is that some of the new slowdown / less predictability is due to many more `Box`es (and thus allocations) -- I suggest we reconsider Boxing frequently used structures

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
Kontinuation commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2210729874 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -141,13 +140,15 @@ impl ObjectStore for HadoopFileSystem { let file_status = file.get_file_sta

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
Kontinuation commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2210716251 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -88,19 +88,18 @@ impl HadoopFileSystem { fn read_range(range: &Range, file: &HdfsFile) -> Result {

Re: [PR] feat: randn expression support [datafusion-comet]

2025-07-16 Thread via GitHub
mbutrovich commented on PR #2010: URL: https://github.com/apache/datafusion-comet/pull/2010#issuecomment-3079038284 > In those scenarios we do have reproducibility and I believe a native implementation should also have this property. Thank you for the great explanation! This makes se

Re: [PR] 48.0.1 [datafusion]

2025-07-16 Thread via GitHub
alamb commented on PR #16755: URL: https://github.com/apache/datafusion/pull/16755#issuecomment-3079019727 What is the purpose of this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3079000603 My guess is that some of the new slowdown / less predictability is due to many more `Box`es (and thus allocations) -- I suggest we reconsider Boxing frequently used structures (like Co

Re: [I] Release DataFusion `50.0.0` (Aug/Sep 2025) [datafusion]

2025-07-16 Thread via GitHub
alamb commented on issue #16799: URL: https://github.com/apache/datafusion/issues/16799#issuecomment-3078987206 > Marked the reduce Expr size task here: > > [#16771](https://github.com/apache/datafusion/pull/16771) Added -- This is an automated message from the Apache Git Ser

Re: [PR] fix: hdfs read into buffer fully [datafusion-comet]

2025-07-16 Thread via GitHub
comphead commented on code in PR #2031: URL: https://github.com/apache/datafusion-comet/pull/2031#discussion_r2210658711 ## native/hdfs/src/object_store/hdfs.rs: ## @@ -88,19 +88,18 @@ impl HadoopFileSystem { fn read_range(range: &Range, file: &HdfsFile) -> Result {

Re: [I] q9 [datafusion-comet]

2025-07-16 Thread via GitHub
comphead closed issue #2005: q9 URL: https://github.com/apache/datafusion-comet/issues/2005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on PR #16791: URL: https://github.com/apache/datafusion/pull/16791#issuecomment-3078933391 I opened https://github.com/apache/datafusion/issues/16800 to track the big picture -- This is an automated message from the Apache Git Service. To respond to the message, please

[I] Plan to replace `SchemaAdapter` with `PhysicalExprAdapter` [datafusion]

2025-07-16 Thread via GitHub
adriangb opened a new issue, #16800: URL: https://github.com/apache/datafusion/issues/16800 As discussed in https://github.com/apache/datafusion/pull/16791 the long term plan in my mind (and that I would like to discuss with the community) is to replace `SchemaAdapter` with `PhysicalExprAda

Re: [I] Optimize the join operators [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on issue #16710: URL: https://github.com/apache/datafusion/issues/16710#issuecomment-3078931663 > Thanks [@nuno-faria](https://github.com/nuno-faria) that's a great insight (for TPC-H / very nested joins we probably should implement a smarter join order algorithm). >

Re: [I] count_all() aggregations cannot be aliased [datafusion]

2025-07-16 Thread via GitHub
Loaki07 commented on issue #16795: URL: https://github.com/apache/datafusion/issues/16795#issuecomment-3078924010 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] DuckDB, Postgres, SQLite: NOT NULL and NOTNULL expressions [datafusion-sqlparser-rs]

2025-07-16 Thread via GitHub
iffyio commented on code in PR #1927: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1927#discussion_r2210549842 ## src/parser/mod.rs: ## @@ -7724,6 +7737,27 @@ impl<'a> Parser<'a> { return option; } +self.with_state( +Co

Re: [PR] Restore custom SchemaAdapter functionality for Parquet [datafusion]

2025-07-16 Thread via GitHub
adriangb commented on code in PR #16791: URL: https://github.com/apache/datafusion/pull/16791#discussion_r2210616673 ## datafusion/datasource-parquet/src/row_filter.rs: ## @@ -140,6 +143,8 @@ impl ArrowPredicate for DatafusionArrowPredicate { } fn evaluate(&mut self,

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3078771877 > After opening the DF50.0.0 release issue, you can add it to the list Thank you @xudong963 , added it in https://github.com/apache/datafusion/issues/16799#issuecomment-307

Re: [I] Release DataFusion `50.0.0` (Aug/Sep 2025) [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on issue #16799: URL: https://github.com/apache/datafusion/issues/16799#issuecomment-3078770965 Marked the reduce Expr size task here: https://github.com/apache/datafusion/pull/16771 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
zhuqi-lucas commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3078761333 > 🤖: Benchmark completed > > Details > > ``` > group main reduce_expr_size > -

Re: [PR] feat: change Expr Alias ,OuterReferenceColumn, Column to Box type for reducing expr struct size [datafusion]

2025-07-16 Thread via GitHub
alamb commented on PR #16771: URL: https://github.com/apache/datafusion/pull/16771#issuecomment-3078726061 🤖: Benchmark completed Details ``` group main reduce_expr_size -

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-16 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3078620431 I looked into this failure running clickbench: ``` │ QQuery 27│ 2328.28 ms │ FAIL │ incomparable │ ``` I ran the [`q27.sql`](https://github.com/apache

Re: [D] Best practices for memory-efficient deduplication of pre-sorted Parquet files [datafusion]

2025-07-16 Thread via GitHub
GitHub user alamb added a comment to the discussion: Best practices for memory-efficient deduplication of pre-sorted Parquet files 👋 Give your description, I am surprised that this query is using a HashAggregateStream -- the hash aggregate needs to buffer the entire dataset in RAM / spill i

Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA - Sep 15, 2025 [datafusion]

2025-07-16 Thread via GitHub
GitHub user leoDYL edited a discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA - Sep 15, 2025 We are organizing an NYC meetup to celebrate the upcoming release 50. Currently planning on Sept 15th, 2025. We will organize it in the same location as #11213 Registration link: https://

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA - Nov 12, 2025 [datafusion]

2025-07-16 Thread via GitHub
GitHub user NGA-TRAN edited a discussion: DISCUSSION: DataFusion Meetup in Boston, USA - Nov 12, 2025 With the upcoming New York meetup on the horizon, the DataDog Boston team is excited to plan a local DataFusion-themed gathering this fall! **Date:** Wednesday, November 12 📍 Location: Data

[I] Release DataFusion `50.0.0` (Aug/Sep 2025) [datafusion]

2025-07-16 Thread via GitHub
alamb opened a new issue, #16799: URL: https://github.com/apache/datafusion/issues/16799 ### Is your feature request related to a problem or challenge? Tracking ticket for next release, also a place to track desired inclusions Previous release will be https://crates.io/crates/d

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3078462264 > > select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; > > I'm happy to include this benchmark in the bench suite this week,

  1   2   >