Re: [I] Add tests for Scalar and Inverval values for UnaryMinus [datafusion-comet]

2024-06-05 Thread via GitHub


vaibhawvipul commented on issue #508:
URL: 
https://github.com/apache/datafusion-comet/issues/508#issuecomment-2151535000

   I will work on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Remove 3.4.2.diff [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on PR #528:
URL: https://github.com/apache/datafusion-comet/pull/528#issuecomment-2151533724

   Thank you merged @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Remove 3.4.2.diff [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura merged PR #528:
URL: https://github.com/apache/datafusion-comet/pull/528


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Fix `ScalarUDFImpl::propagate_constraints` doc [datafusion]

2024-06-05 Thread via GitHub


lewiszlw opened a new pull request, #10810:
URL: https://github.com/apache/datafusion/pull/10810

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   I'm reading somd udf code, and found the example in 
`ScalarUDFImpl::propagate_constraints` doc is confused. I checked 
`PhysicalExpr::propagate_constraints` doc example
   ```
   /// If the expression is `a + b`, the current `interval` is `[4, 5]` and 
the
   /// inputs `a` and `b` are respectively given as `[0, 2]` and `[-∞, 4]`, 
then
   /// propagation would would return `[0, 2]` and `[2, 4]` as `b` must be 
at
   /// least `2` to make the output at least `4`.
   ```
   If I understand correctly, the updated bounds of child expressions should be 
subinterval of their original intervals.
   
   ## What changes are included in this PR?
   
   
   
   ## Are these changes tested?
   
   
   
   ## Are there any user-facing changes?
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Input batch to ShuffleRepartitioner.insert_batch should not be larger than configured batch size [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #523:
URL: https://github.com/apache/datafusion-comet/pull/523#issuecomment-2151485524

   cc @huaxingao @kazuyukitanimura 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: Remove 3.4.2.diff [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura opened a new pull request, #528:
URL: https://github.com/apache/datafusion-comet/pull/528

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


viirya merged PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#issuecomment-2151361526

   Merged. Thanks @huaxingao @advancedxy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Incorrect return type on aggregate functions implemented by AggregateUDF when upgrading to latest DataFusion [datafusion-comet]

2024-06-05 Thread via GitHub


viirya closed issue #511: Incorrect return type on aggregate functions 
implemented by AggregateUDF when upgrading to latest DataFusion
URL: https://github.com/apache/datafusion-comet/issues/511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection [datafusion]

2024-06-05 Thread via GitHub


Ted-Jiang commented on code in PR #10738:
URL: https://github.com/apache/datafusion/pull/10738#discussion_r1628732681


##
datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs:
##
@@ -236,6 +225,24 @@ impl PagePruningPredicate {
 }
 }
 
+/// returns the number of rows skipped in the selection
+/// TODO should this be upstreamed to RowSelection?
+fn rows_skipped(selection: &RowSelection) -> usize {
+selection
+.iter()
+.fold(0, |acc, x| if x.skip { acc + x.row_count } else { acc })
+}
+
+fn update_selection(
+current_selection: Option,
+row_selection: RowSelection,
+) -> Option {
+match current_selection {
+None => Some(row_selection),
+Some(current_selection) => 
Some(current_selection.intersection(&row_selection)),

Review Comment:
   @alamb  maybe we can check after the `intersection` the result  
row_selection is select 0 row ? 🤔  then we can cast it to skip. IMO select-0 
seems need do more effort in next round check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add `ParquetAccessPlan`, unify RowGroup selection and PagePruning selection [datafusion]

2024-06-05 Thread via GitHub


Ted-Jiang commented on PR #10738:
URL: https://github.com/apache/datafusion/pull/10738#issuecomment-2151312356

   Thanks for ping me,  i will review this carefully later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Input batch to ShuffleRepartitioner.insert_batch should not be larger than configured batch size [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on code in PR #523:
URL: https://github.com/apache/datafusion-comet/pull/523#discussion_r1628700174


##
core/src/execution/datafusion/shuffle_writer.rs:
##
@@ -947,12 +948,18 @@ async fn external_shuffle(
 partitioning,
 metrics,
 context.runtime_env(),
-context.session_config().batch_size(),
+batch_size,
 );
 
 while let Some(batch) = input.next().await {
 let batch = batch?;
-repartitioner.insert_batch(batch).await?;
+let mut start = 0;
+while start < batch.num_rows() {
+let end = (start + batch_size).min(batch.num_rows());
+let batch = batch.slice(start, end - start);
+repartitioner.insert_batch(batch).await?;
+start = end;
+}

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya closed pull request #527: build: Update plan stability for Spark 4.0
URL: https://github.com/apache/datafusion-comet/pull/527


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] update deps [datafusion-python]

2024-06-05 Thread via GitHub


Michael-J-Ward opened a new pull request, #723:
URL: https://github.com/apache/datafusion-python/pull/723

   I'd like to clear out the `dependabot` pull requests.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#issuecomment-2151280110

   I guess that with #526, we don't need to update the plan stability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Upgrade spark to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya merged PR #526:
URL: https://github.com/apache/datafusion-comet/pull/526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Upgrade spark4.0 to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya closed issue #525: Upgrade spark4.0 to 4.0.0-preview1
URL: https://github.com/apache/datafusion-comet/issues/525


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on code in PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#discussion_r1628673092


##
spark/src/test/resources/tpcds-plan-stability/approved-plans-v1_4-spark4_0/q1/explain.txt:
##
@@ -31,9 +31,9 @@ TakeOrderedAndProject (40)
   : :   : +- CometScan parquet 
spark_catalog.default.store_returns (11)
   : :   +- ReusedExchange (14)
   : +- BroadcastExchange (31)
-  :+- * ColumnarToRow (30)
-  :   +- CometProject (29)
-  :  +- CometFilter (28)
+  :+- * Project (30)
+  :   +- * Filter (29)
+  :  +- * ColumnarToRow (28)

Review Comment:
   No. But just want to restore CI as soon as possible. We can investigate it 
later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Use file cache to list partitions if available [datafusion]

2024-06-05 Thread via GitHub


github-actions[bot] commented on PR #9655:
URL: https://github.com/apache/datafusion/pull/9655#issuecomment-2151257622

   Thank you for your contribution. Unfortunately, this pull request is stale 
because it has been open 60 days with no activity. Please remove the stale 
label or comment or this will be closed in 7 days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Add GroupValuesFullyOrdered mode to GroupValues trait for aggregate grouping. [datafusion]

2024-06-05 Thread via GitHub


github-actions[bot] closed pull request #9662: Add GroupValuesFullyOrdered mode 
to GroupValues trait for aggregate grouping.
URL: https://github.com/apache/datafusion/pull/9662


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: move array_except in SetOp and support Null columnar in `array_except`, `array_union` and `array_intersect` [datafusion]

2024-06-05 Thread via GitHub


github-actions[bot] closed pull request #9710: fix: move array_except in SetOp 
and support Null columnar in `array_except`, `array_union` and `array_intersect`
URL: https://github.com/apache/datafusion/pull/9710


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] `select array_concat([])` panicked [datafusion]

2024-06-05 Thread via GitHub


jonahgao commented on issue #10200:
URL: https://github.com/apache/datafusion/issues/10200#issuecomment-2151251654

   Fixed by #10790 10790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Int64 as default type for make_array function empty or null case [datafusion]

2024-06-05 Thread via GitHub


jonahgao commented on code in PR #10790:
URL: https://github.com/apache/datafusion/pull/10790#discussion_r1628633358


##
datafusion/sqllogictest/test_files/array.slt:
##
@@ -2622,6 +2629,12 @@ drop table large_array_repeat_table;
 
 ## array_concat (aliases: `array_cat`, `list_concat`, `list_cat`)
 
+# test with empty array
+query ?
+select array_concat([]);
+
+[]

Review Comment:
   Fixed #10200 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] `select array_concat([])` panicked [datafusion]

2024-06-05 Thread via GitHub


jonahgao closed issue #10200: `select array_concat([])` panicked
URL: https://github.com/apache/datafusion/issues/10200


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


codecov-commenter commented on PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#issuecomment-2151220549

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/527?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 34.22%. Comparing base 
[(`7ab37eb`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/7ab37ebdac1e077e157e292fbddf7ca23e78429c?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`19ba0c5`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/19ba0c5c8ace74a8726e80e5f0eca5e4882d47e4?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 1 commits behind head on main.
   
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main #527  +/-   ##
   
   - Coverage 34.23%   34.22%   -0.01% 
 Complexity  803  803  
   
 Files   105  105  
 Lines 3848338483  
 Branches   8560 8560  
   
   - Hits  1317313171   -2 
   - Misses2255522556   +1 
   - Partials   2755 2756   +1 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/527?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Upgrade spark to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


codecov-commenter commented on PR #526:
URL: https://github.com/apache/datafusion-comet/pull/526#issuecomment-2151219521

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/526?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 34.32%. Comparing base 
[(`7ab37eb`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/7ab37ebdac1e077e157e292fbddf7ca23e78429c?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`7dde3de`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/7dde3de52d5516bb7d58cc7058ee3273c266709e?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 1 commits behind head on main.
   
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main #526  +/-   ##
   
   + Coverage 34.23%   34.32%   +0.09% 
   - Complexity  803  807   +4 
   
 Files   105  105  
 Lines 3848338483  
 Branches   8560 8562   +2 
   
   + Hits  1317313209  +36 
   + Misses2255522522  -33 
   + Partials   2755 2752   -3 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/526?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on code in PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#discussion_r1628603328


##
spark/src/test/resources/tpcds-plan-stability/approved-plans-v1_4-spark4_0/q1/explain.txt:
##
@@ -31,9 +31,9 @@ TakeOrderedAndProject (40)
   : :   : +- CometScan parquet 
spark_catalog.default.store_returns (11)
   : :   +- ReusedExchange (14)
   : +- BroadcastExchange (31)
-  :+- * ColumnarToRow (30)
-  :   +- CometProject (29)
-  :  +- CometFilter (28)
+  :+- * Project (30)
+  :   +- * Filter (29)
+  :  +- * ColumnarToRow (28)

Review Comment:
   Hmmm do you know why CometFilter is no longer activated by any chance?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#issuecomment-2151168828

   cc @kazuyukitanimura @andygrove @huaxingao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #527:
URL: https://github.com/apache/datafusion-comet/pull/527#issuecomment-2151168689

   The plan stability results of Spark 4.0 is changed. It causes current CI 
pipelines failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] build: Update plan stability for Spark 4.0 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya opened a new pull request, #527:
URL: https://github.com/apache/datafusion-comet/pull/527

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   Fixing broken CI pipelines.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Upgrade spark to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on PR #526:
URL: https://github.com/apache/datafusion-comet/pull/526#issuecomment-2151168099

   @kazuyukitanimura would you mind take a look at this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: Upgrade spark to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy opened a new pull request, #526:
URL: https://github.com/apache/datafusion-comet/pull/526

   ## Which issue does this PR close?
   Closes #525 
   
   ## Rationale for this change
   Reduce CI time and for local testing with spark 4.0.
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Upgrade spark4.0 to 4.0.0-preview1 [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy opened a new issue, #525:
URL: https://github.com/apache/datafusion-comet/issues/525

   ### What is the problem the feature request solves?
   
   Since spark 4.0.0-preview1 has already been 
[released](https://lists.apache.org/thread/y0fhglwjdrt90qjd0ntgvy0qodzdtzmn),  
we should use that instead of building a snapshot from scratch. 
   
   ### Describe the potential solution
   
   I will submit a pr to use spark 4.0.0-preview1 instead.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Int64 as default type for make_array function empty or null case [datafusion]

2024-06-05 Thread via GitHub


jayzhan211 merged PR #10790:
URL: https://github.com/apache/datafusion/pull/10790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] `Int64` as default type for `make_array` function empty or null case [datafusion]

2024-06-05 Thread via GitHub


jayzhan211 closed issue #10789: `Int64` as  default type for `make_array` 
function empty or null case
URL: https://github.com/apache/datafusion/issues/10789


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Int64 as default type for make_array function empty or null case [datafusion]

2024-06-05 Thread via GitHub


jayzhan211 commented on PR #10790:
URL: https://github.com/apache/datafusion/pull/10790#issuecomment-2151149267

   Thanks @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Int64 as default type for make_array function empty or null case [datafusion]

2024-06-05 Thread via GitHub


jayzhan211 commented on PR #10790:
URL: https://github.com/apache/datafusion/pull/10790#issuecomment-2151148927

   > Looks like a reasonable change to me. Thanks @jayzhan211
   > 
   > BTW I tested in duckdb and it seems like the default type is actually 
`int32` but I think `int64` is close enough
   > 
   > ```
   > D select [];
   > ┌───┐
   > │ main.list_value() │
   > │  int32[]  │
   > ├───┤
   > │ []│
   > └───┘
   > ```
   
   Yes, they use i32. The reason I choose i64 is that the default value in 
datafusion is mostly i64 so we can avoid the cast for most of the case. We can 
easily convert it to i32 anytime if there is any need


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Input batch to ShuffleRepartitioner.insert_batch should not be larger than configured batch size [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on code in PR #523:
URL: https://github.com/apache/datafusion-comet/pull/523#discussion_r1628574405


##
core/src/execution/datafusion/shuffle_writer.rs:
##
@@ -947,12 +948,18 @@ async fn external_shuffle(
 partitioning,
 metrics,
 context.runtime_env(),
-context.session_config().batch_size(),
+batch_size,
 );
 
 while let Some(batch) = input.next().await {
 let batch = batch?;
-repartitioner.insert_batch(batch).await?;
+let mut start = 0;
+while start < batch.num_rows() {
+let end = (start + batch_size).min(batch.num_rows());
+let batch = batch.slice(start, end - start);
+repartitioner.insert_batch(batch).await?;
+start = end;
+}

Review Comment:
   It seems like this code should be handled in the ShuffleRepartitioner's 
insert_batch method? We can rename the previous `insert_batch` function to 
`insert_batch_internal` or something similar and call it multiple time in the 
new `insert_batch` function.
   
   The batch size is an internal parameter of ShuffleRepartitioner's 
constructor, which is not exposed. We may change the implementation in the 
future. By handling it internally, we can ensure the buffer size is respected 
even the implementation changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support join filter in NestedLoopJoin in fizz join test cases [datafusion]

2024-06-05 Thread via GitHub


comphead commented on issue #10787:
URL: https://github.com/apache/datafusion/issues/10787#issuecomment-2151145063

   Thats interesting, I'll try to add a test case for it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] Bench: Add `PREFER_HASH_JOIN` env variable [datafusion]

2024-06-05 Thread via GitHub


comphead opened a new pull request, #10809:
URL: https://github.com/apache/datafusion/pull/10809

   ## Which issue does this PR close?
   
   
   
   Closes #.
   
   ## Rationale for this change
   
   
   
   By default benches run with hash join algorithm, this PR introduces new env 
variable `PREFER_HASH_JOIN` to give the user a switch for running benchmarks 
with SMJ as well.
   
   ## What changes are included in this PR?
   Removed old SMJ benches code, added env variable and fixed the doc
   
   
   ## Are these changes tested?
   Yes manually
   
   
   ## Are there any user-facing changes?
   No
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add fuzz testing for arithmetic expressions [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on PR #519:
URL: https://github.com/apache/datafusion-comet/pull/519#issuecomment-2151137653

   Thanks for the review @huaxingao @kazuyukitanimura @viirya I have added the 
bitwise expressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#issuecomment-2151133054

   > LGTM.
   > 
   > Do we need to contribute the UnboundColumn back to DataFusion?
   
   DataFusion doesn't need it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Experiment: Coalesce batches after scan [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove closed pull request #496: Experiment: Coalesce batches after scan
URL: https://github.com/apache/datafusion-comet/pull/496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Experiment: Coalesce batches after scan [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on PR #496:
URL: https://github.com/apache/datafusion-comet/pull/496#issuecomment-2151128821

   I ran benchmarks with this change and see no improvement, so closing this as 
a failed experiement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] TPC-H q8 hangs with xxhash64 enabled [datafusion-comet]

2024-06-05 Thread via GitHub


parthchandra commented on issue #517:
URL: 
https://github.com/apache/datafusion-comet/issues/517#issuecomment-2151122785

   Was this running on a K8s cluster? It should not result in a OutOfMemory if 
cpu limits are reached.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Create initial release process scripts for official ASF source release [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on code in PR #429:
URL: https://github.com/apache/datafusion-comet/pull/429#discussion_r1628556183


##
dev/release/run-rat.sh:
##
@@ -0,0 +1,43 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+RAT_VERSION=0.13
+
+# download apache rat
+if [ ! -f apache-rat-${RAT_VERSION}.jar ]; then
+  curl -s 
https://repo1.maven.org/maven2/org/apache/rat/apache-rat/${RAT_VERSION}/apache-rat-${RAT_VERSION}.jar
 > apache-rat-${RAT_VERSION}.jar
+fi
+
+RAT="java -jar apache-rat-${RAT_VERSION}.jar -x "
+
+RELEASE_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd)
+
+# generate the rat report
+$RAT $1 > rat.txt
+python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt 
rat.txt > filtered_rat.txt
+cat filtered_rat.txt

Review Comment:
   I see. Nit:
   How about moving this line to the else block in L41 and L42? So that it 
doesn't pollute the stdout.



##
dev/release/verify-release-candidate.sh:
##
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+case $# in
+  2) VERSION="$1"
+ RC_NUMBER="$2"
+ ;;
+  *) echo "Usage: $0 X.Y.Z RC_NUMBER"
+ exit 1
+ ;;
+esac
+
+set -e
+set -x
+set -o pipefail
+
+SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)"
+COMET_DIR="$(dirname $(dirname ${SOURCE_DIR}))"
+COMET_DIST_URL='https://dist.apache.org/repos/dist/dev/datafusion'
+
+download_dist_file() {
+  curl \
+--silent \
+--show-error \
+--fail \
+--location \
+--remote-name $COMET_DIST_URL/$1
+}
+
+download_rc_file() {
+  download_dist_file apache-datafusion-comet-${VERSION}-rc${RC_NUMBER}/$1
+}
+
+import_gpg_keys() {
+  download_dist_file KEYS
+  gpg --import KEYS
+}
+
+if type shasum >/dev/null 2>&1; then
+  sha256_verify="shasum -a 256 -c"
+  sha512_verify="shasum -a 512 -c"
+else
+  sha256_verify="sha256sum -c"
+  sha512_verify="sha512sum -c"
+fi
+
+fetch_archive() {
+  local dist_name=$1
+  download_rc_file ${dist_name}.tar.gz
+  download_rc_file ${dist_name}.tar.gz.asc
+  download_rc_file ${dist_name}.tar.gz.sha256
+  download_rc_file ${dist_name}.tar.gz.sha512
+  verify_dir_artifact_signatures
+}
+
+verify_dir_artifact_signatures() {
+  # verify the signature and the checksums of each artifact
+  find . -name '*.asc' | while read sigfile; do
+artifact=${sigfile/.asc/}
+gpg --verify $sigfile $artifact || exit 1
+
+# go into the directory because the checksum files contain only the
+# basename of the artifact
+pushd $(dirname $artifact)
+base_artifact=$(basename $artifact)
+${sha256_verify} $base_artifact.sha256 || exit 1
+${sha512_verify} $base_artifact.sha512 || exit 1
+popd
+  done
+}
+
+setup_tempdir() {
+  cleanup() {
+if [ "${TEST_SUCCESS}" = "yes" ]; then
+  rm -fr "${COMET_TMPDIR}"
+else
+  echo "Failed to verify release candidate. See ${COMET_TMPDIR} for 
details."
+fi
+  }
+
+  if [ -z "${COMET_TMPDIR}" ]; then
+# clean up automatically if COMET_TMPDIR is not defined
+COMET_TMPDIR=$(mktemp -d -t "$1.X")
+trap cleanup EXIT
+  else
+# don't clean up automatically
+mkdir -p "${COMET_TMPDIR}"
+  fi
+}
+
+test_source_distribution() {
+  set -e
+  pushd core
+RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
+  popd
+  ./mvnw install -Prelease -DskipTests -P"spark-3.4" 
-Dmaven.gitcommitid.skip=true

Review Comment:
   > Spark-3.4 is hardcoded now. What about making it a parameter.
   
   +1 fo

Re: [PR] chore: Create initial release process scripts for official ASF source release [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on code in PR #429:
URL: https://github.com/apache/datafusion-comet/pull/429#discussion_r1627813262


##
dev/release/run-rat.sh:
##
@@ -0,0 +1,43 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+RAT_VERSION=0.13
+
+# download apache rat
+if [ ! -f apache-rat-${RAT_VERSION}.jar ]; then
+  curl -s 
https://repo1.maven.org/maven2/org/apache/rat/apache-rat/${RAT_VERSION}/apache-rat-${RAT_VERSION}.jar
 > apache-rat-${RAT_VERSION}.jar
+fi
+
+RAT="java -jar apache-rat-${RAT_VERSION}.jar -x "
+
+RELEASE_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd)
+
+# generate the rat report
+$RAT $1 > rat.txt
+python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt 
rat.txt > filtered_rat.txt
+cat filtered_rat.txt

Review Comment:
   I don't think this line is needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Create initial release process scripts for official ASF source release [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on code in PR #429:
URL: https://github.com/apache/datafusion-comet/pull/429#discussion_r1628554607


##
dev/release/create-tarball.sh:
##
@@ -0,0 +1,135 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Adapted from 
https://github.com/apache/arrow-rs/tree/master/dev/release/create-tarball.sh
+
+# This script creates a signed tarball in
+# dev/dist/apache-datafusion-comet--.tar.gz and uploads it to
+# the "dev" area of the dist.apache.datafusion repository and prepares an
+# email for sending to the d...@datafusion.apache.org list for a formal
+# vote.
+#
+# See release/README.md for full release instructions
+#
+# Requirements:
+#
+# 1. gpg setup for signing and have uploaded your public
+# signature to https://pgp.mit.edu/
+#
+# 2. Logged into the apache svn server with the appropriate
+# credentials
+#
+# 3. Install the requests python package
+#
+#
+# Based in part on 02-source.sh from apache/arrow
+#
+
+set -e
+
+DEV_RELEASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+DEV_RELEASE_TOP_DIR="$(cd "${DEV_RELEASE_DIR}/../../" && pwd)"
+
+if [ "$#" -ne 2 ]; then
+echo "Usage: $0  "
+echo "ex. $0 4.1.0 2"
+exit
+fi
+
+if [[ -z "${GH_TOKEN}" ]]; then
+echo "Please set personal github token through GH_TOKEN environment 
variable"
+exit
+fi
+
+version=$1
+rc=$2
+tag="${version}-rc${rc}"
+
+echo "Attempting to create ${tarball} from tag ${tag}"
+release_hash=$(cd "${DEV_RELEASE_TOP_DIR}" && git rev-list --max-count=1 
${tag})
+
+release=apache-datafusion-comet-${version}
+distdir=${DEV_RELEASE_TOP_DIR}/dev/dist/${release}-rc${rc}
+tarname=${release}.tar.gz
+tarball=${distdir}/${tarname}
+url="https://dist.apache.org/repos/dist/dev/datafusion/${release}-rc${rc}";
+
+if [ -z "$release_hash" ]; then
+echo "Cannot continue: unknown git tag: ${tag}"
+fi
+
+echo "Draft email for d...@datafusion.apache.org mailing list"
+echo ""
+echo "-"
+cat  ${tarball})
+
+echo "Running rat license checker on ${tarball}"
+${DEV_RELEASE_DIR}/run-rat.sh ${tarball}
+
+echo "Signing tarball and creating checksums"
+gpg --armor --output ${tarball}.asc --detach-sig ${tarball}
+# create signing with relative path of tarball
+# so that they can be verified with a command such as
+#  shasum --check apache-datafusion-comet-0.1.0-rc1.tar.gz.sha512
+(cd ${distdir} && shasum -a 256 ${tarname}) > ${tarball}.sha256
+(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512
+
+
+echo "Uploading to datafusion dist/dev to ${url}"

Review Comment:
   Thanks for the information, good to know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Plan Comet 0.1.0 Release [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on issue #369:
URL: 
https://github.com/apache/datafusion-comet/issues/369#issuecomment-2151114500

   > We plan to do binary release, although it might not be able to catch up 
the 0.1.0 source release. 
   
   I think we are on the same page. 
   >  Comet involves native code, so it becomes more complicated than pure 
Java/Scala projects. We need to include pre-built binaries for different 
platforms in the published jar.
   
   yes, it might be a bit complicated. But I think the rust toolchain has done 
an excellent job of cross compiling. If I'm not wrong, the Makefile in this 
repo already has `release-linux` target, which builds both linux/mac(both intel 
and arm cpus) libs. It should be a good starting point. 
   
   
   > Even though a maven artifact would not be published, it does allow 
projects to build their own, or even add comet as a git submodule, based on a 
relatively 'stable' version.
   
   Of course, that could be an option.  However, a maven artifact should be the 
convenient/easy way for people in the JVM echosystem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] [EPIC] Document all known incompatibilities [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove opened a new issue, #524:
URL: https://github.com/apache/datafusion-comet/issues/524

   ### What is the problem the feature request solves?
   
   As we approach the [0.1.0 
release](https://github.com/apache/datafusion-comet/issues/369), we want to 
make sure that we address all (currently) known expression compatibility issues 
in one of the following ways:
   
   1. Fix the compatibility issue, or
   2. Disable the expression by default but allow the user to enable it if they 
are ok with the incompatibility, or
   3. Accept that we can't fix it in time and add documentation about the 
incompatibility
   
   By taking these steps, we empower our users to make an informed decision 
about enabling Comet (or not) for their production workflows.
   
   At the time of writing, we have around 25 open 
[bugs](https://github.com/apache/datafusion-comet/issues?q=is%3Aopen+is%3Aissue+label%3Abug)
 that we should review and decide on the action to take for each one.
   
   
   
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


codecov-commenter commented on PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#issuecomment-215554

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/518?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   All modified and coverable lines are covered by tests :white_check_mark:
   > Project coverage is 34.07%. Comparing base 
[(`7ba5693`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/7ba569357cadade64b49b6bdf5f0946f81f95301?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`3c5240c`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/3c5240c9cf26436496819be249967207eb6d70c0?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 20 commits behind head on main.
   
   
   Additional details and impacted files
   
   
   ```diff
   @@ Coverage Diff  @@
   ##   main #518  +/-   ##
   
   + Coverage 34.05%   34.07%   +0.02% 
   + Complexity  859  809  -50 
   
 Files   116  105  -11 
 Lines 3867938509 -170 
 Branches   8567 8552  -15 
   
   - Hits  1317313123  -50 
   + Misses2274522643 -102 
   + Partials   2761 2743  -18 
   ```
   
   
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/518?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Possible regression in memory usage [datafusion-comet]

2024-06-05 Thread via GitHub


advancedxy commented on issue #517:
URL: 
https://github.com/apache/datafusion-comet/issues/517#issuecomment-2151102014

   > all/most cores are at 100% utilization
   
   So it might be more about cpu usage rather than memory consumption?  If it’s 
indeed related to xxhash64, one possible reason might be that every hash call 
it creates a XXHash64 struct on the stack.
   
   Thanks for filing this, it would be great that we  can have a minimum 
reproduction example with smaller size. I can help with this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support OneRowRelation to Support Scalar Inputs? [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on issue #516:
URL: 
https://github.com/apache/datafusion-comet/issues/516#issuecomment-2151095074

   Thank you @tshauck assigned this to you.
   Now `OneRowRelation` issue and `scalar` test issues are decoupled. If you 
would like to separate the ticket, please do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Plan Comet 0.1.0 Release [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on issue #369:
URL: 
https://github.com/apache/datafusion-comet/issues/369#issuecomment-2151086300

   I created a milestone where we can track the priority issues for the 0.1.0 
release
   
   https://github.com/apache/datafusion-comet/milestone/1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on code in PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#discussion_r1628506549


##
core/src/execution/datafusion/expressions/unbound.rs:
##
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::execution::datafusion::expressions::utils::down_cast_any_ref;
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Schema};
+use datafusion::physical_plan::ColumnarValue;
+use datafusion_common::{internal_err, Result};
+use datafusion_physical_expr::PhysicalExpr;
+use std::{
+any::Any,
+hash::{Hash, Hasher},
+sync::Arc,
+};
+
+/// This is similar to `UnKnownColumn` in DataFusion, but it has data type.
+/// This is only used when the column is not bound to a schema, for example, 
the
+/// inputs to aggregation functions in final aggregation. In the case, we 
cannot
+/// bind the aggregation functions to the input schema which is grouping 
columns
+/// and aggregate buffer attributes in Spark (DataFusion has different design).
+/// But when creating certain aggregation functions, we need to know its input
+/// data types. As `UnKnownColumn` doesn't have data type, we implement this
+/// `UnboundColumn` to carry the data type.
+#[derive(Debug, Hash, PartialEq, Eq, Clone)]
+pub struct UnboundColumn {
+name: String,
+datatype: DataType,
+}
+
+impl UnboundColumn {
+/// Create a new unbound column expression
+pub fn new(name: &str, datatype: DataType) -> Self {
+Self {
+name: name.to_owned(),
+datatype,
+}
+}
+
+/// Get the column name
+pub fn name(&self) -> &str {
+&self.name
+}
+}
+
+impl std::fmt::Display for UnboundColumn {
+fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+write!(f, "{}, datatype: {}", self.name, self.datatype)
+}
+}
+
+impl PhysicalExpr for UnboundColumn {
+/// Return a reference to Any that can be used for downcasting
+fn as_any(&self) -> &dyn std::any::Any {
+self
+}
+
+/// Get the data type of this expression, given the schema of the input
+fn data_type(&self, _input_schema: &Schema) -> Result {
+Ok(self.datatype.clone())
+}
+
+/// Decide whehter this expression is nullable, given the schema of the 
input

Review Comment:
   I copied it from DataFusion. 😂 



##
core/src/execution/datafusion/expressions/unbound.rs:
##
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::execution::datafusion::expressions::utils::down_cast_any_ref;
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Schema};
+use datafusion::physical_plan::ColumnarValue;
+use datafusion_common::{internal_err, Result};
+use datafusion_physical_expr::PhysicalExpr;
+use std::{
+any::Any,
+hash::{Hash, Hasher},
+sync::Arc,
+};
+
+/// This is similar to `UnKnownColumn` in DataFusion, but it has data type.
+/// This is only used when the column is not bound to a schema, for example, 
the
+/// inputs to aggregation functions in final aggregation. In the case, we 
cannot
+/// bind the aggregation functions to the input schema which is grouping 
columns
+/// and aggregate buffer attributes in Spark (DataFusion has different design).
+/// But when creating certain aggregation functions, we need to know its input
+/// data types. As `UnKnownColumn` doesn't have data type, we implement this
+/// `UnboundColumn` to carr

Re: [I] Possible regression in memory usage [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on issue #517:
URL: 
https://github.com/apache/datafusion-comet/issues/517#issuecomment-2151057179

   The issue is happening with query 8. The behavior I see is that all/most 
cores are at 100% utilization, and memory is not actually high, but the cluster 
becomes unresponsive before an OutOfMemory error is shown in the driver.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on code in PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#discussion_r1628505993


##
core/src/execution/datafusion/expressions/unbound.rs:
##
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::execution::datafusion::expressions::utils::down_cast_any_ref;
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Schema};
+use datafusion::physical_plan::ColumnarValue;
+use datafusion_common::{internal_err, Result};
+use datafusion_physical_expr::PhysicalExpr;
+use std::{
+any::Any,
+hash::{Hash, Hasher},
+sync::Arc,
+};
+
+/// This is similar to `UnKnownColumn` in DataFusion, but it has data type.
+/// This is only used when the column is not bound to a schema, for example, 
the
+/// inputs to aggregation functions in final aggregation. In the case, we 
cannot
+/// bind the aggregation functions to the input schema which is grouping 
columns
+/// and aggregate buffer attributes in Spark (DataFusion has different design).
+/// But when creating certain aggregation functions, we need to know its input
+/// data types. As `UnKnownColumn` doesn't have data type, we implement this
+/// `UnboundColumn` to carry the data type.
+#[derive(Debug, Hash, PartialEq, Eq, Clone)]
+pub struct UnboundColumn {
+name: String,
+datatype: DataType,
+}
+
+impl UnboundColumn {
+/// Create a new unbound column expression
+pub fn new(name: &str, datatype: DataType) -> Self {
+Self {
+name: name.to_owned(),
+datatype,
+}
+}
+
+/// Get the column name
+pub fn name(&self) -> &str {
+&self.name
+}
+}
+
+impl std::fmt::Display for UnboundColumn {
+fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+write!(f, "{}, datatype: {}", self.name, self.datatype)
+}
+}
+
+impl PhysicalExpr for UnboundColumn {
+/// Return a reference to Any that can be used for downcasting
+fn as_any(&self) -> &dyn std::any::Any {
+self
+}
+
+/// Get the data type of this expression, given the schema of the input
+fn data_type(&self, _input_schema: &Schema) -> Result {
+Ok(self.datatype.clone())
+}
+
+/// Decide whehter this expression is nullable, given the schema of the 
input

Review Comment:
   Oops



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Add UnboundColumn to carry datatype for unbound reference [datafusion-comet]

2024-06-05 Thread via GitHub


huaxingao commented on code in PR #518:
URL: https://github.com/apache/datafusion-comet/pull/518#discussion_r1628501503


##
core/src/execution/datafusion/expressions/unbound.rs:
##
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::execution::datafusion::expressions::utils::down_cast_any_ref;
+use arrow_array::RecordBatch;
+use arrow_schema::{DataType, Schema};
+use datafusion::physical_plan::ColumnarValue;
+use datafusion_common::{internal_err, Result};
+use datafusion_physical_expr::PhysicalExpr;
+use std::{
+any::Any,
+hash::{Hash, Hasher},
+sync::Arc,
+};
+
+/// This is similar to `UnKnownColumn` in DataFusion, but it has data type.
+/// This is only used when the column is not bound to a schema, for example, 
the
+/// inputs to aggregation functions in final aggregation. In the case, we 
cannot
+/// bind the aggregation functions to the input schema which is grouping 
columns
+/// and aggregate buffer attributes in Spark (DataFusion has different design).
+/// But when creating certain aggregation functions, we need to know its input
+/// data types. As `UnKnownColumn` doesn't have data type, we implement this
+/// `UnboundColumn` to carry the data type.
+#[derive(Debug, Hash, PartialEq, Eq, Clone)]
+pub struct UnboundColumn {
+name: String,
+datatype: DataType,
+}
+
+impl UnboundColumn {
+/// Create a new unbound column expression
+pub fn new(name: &str, datatype: DataType) -> Self {
+Self {
+name: name.to_owned(),
+datatype,
+}
+}
+
+/// Get the column name
+pub fn name(&self) -> &str {
+&self.name
+}
+}
+
+impl std::fmt::Display for UnboundColumn {
+fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+write!(f, "{}, datatype: {}", self.name, self.datatype)
+}
+}
+
+impl PhysicalExpr for UnboundColumn {
+/// Return a reference to Any that can be used for downcasting
+fn as_any(&self) -> &dyn std::any::Any {
+self
+}
+
+/// Get the data type of this expression, given the schema of the input
+fn data_type(&self, _input_schema: &Schema) -> Result {
+Ok(self.datatype.clone())
+}
+
+/// Decide whehter this expression is nullable, given the schema of the 
input

Review Comment:
   nit: `whehter` typo



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: changes in documentation [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on PR #512:
URL: https://github.com/apache/datafusion-comet/pull/512#issuecomment-2151046096

   Thank you @SemyonSinchenko @andygrove @viirya merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Remove git-commit-id-maven-plugin [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura closed issue #191: Remove git-commit-id-maven-plugin
URL: https://github.com/apache/datafusion-comet/issues/191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] [NOT A BUG] Why comet does not convert the HashAggregate expression to native in my query? [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura closed issue #503: [NOT A BUG] Why comet does not convert the 
HashAggregate expression to native in my query?
URL: https://github.com/apache/datafusion-comet/issues/503


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Remove git-commit-id-maven-plugin [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura closed issue #191: Remove git-commit-id-maven-plugin
URL: https://github.com/apache/datafusion-comet/issues/191


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: changes in documentation [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura merged PR #512:
URL: https://github.com/apache/datafusion-comet/pull/512


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Plan Comet 0.1.0 Release [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on issue #369:
URL: 
https://github.com/apache/datafusion-comet/issues/369#issuecomment-2151002022

   > I assume the source release will tag the repo with a `release-0.1.0` tag. 
Even though a maven artifact would not be published, it does allow projects to 
build their own, or even add comet as a git submodule, based on a relatively 
'stable' version.
   
   Yes, absolutely. We (or anyone) can choose to release binary artifacts from 
the source release or the tag in the repo. ASF does not have any special 
involvement in that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Simplify code in CometExecIterator and avoid some small overhead [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove merged PR #522:
URL: https://github.com/apache/datafusion-comet/pull/522


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Possible regression in memory usage [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on issue #517:
URL: 
https://github.com/apache/datafusion-comet/issues/517#issuecomment-2150948824

   Is the failed one with OOM a query with xxhash64 so it is not natively run 
before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: Input batch to ShuffleRepartitioner.insert_batch should not be larger than configured batch size [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on PR #523:
URL: https://github.com/apache/datafusion-comet/pull/523#issuecomment-2150944200

   cc @andygrove @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Possible regression in memory usage [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on issue #517:
URL: 
https://github.com/apache/datafusion-comet/issues/517#issuecomment-2150901696

   The issue starts with the PR that added xxhash64 
(https://github.com/apache/datafusion-comet/pull/424).
   
   I wonder if this is an issue with hashing itself or just that more queries 
are running natively now and I am hitting some other issue :thinking: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] docs: changes in documentation [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on code in PR #512:
URL: https://github.com/apache/datafusion-comet/pull/512#discussion_r1628376610


##
docs/source/user-guide/tuning.md:
##
@@ -39,6 +39,8 @@ It must be set before the Spark context is created. You can 
enable or disable Co
 at runtime by setting `spark.comet.exec.shuffle.enabled` to `true` or `false`.
 Once it is disabled, Comet will fallback to the default Spark shuffle manager.
 
+> **_NOTE:_** At the moment Comet Shuffle is not compatible with Spark AQE 
partition coalesce. To disable set 
`spark.sql.adaptive.coalescePartitions.enabled` to `false`.

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] fix: Input batch to ShuffleRepartitioner.insert_batch should not be larger than configured batch size [datafusion-comet]

2024-06-05 Thread via GitHub


viirya opened a new pull request, #523:
URL: https://github.com/apache/datafusion-comet/pull/523

   ## Which issue does this PR close?
   
   
   
   Closes #498.
   
   ## Rationale for this change
   
   
   
   ## What changes are included in this PR?
   
   
   
   ## How are these changes tested?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#issuecomment-2150851089

   > > 1. Interval statistics not supported; possibly due to those 
[lines](https://github.com/apache/datafusion/issues/10752#issuecomment-2150024521)
   > > 2. Type IntervalUnit::MonthDayNano not supported at all by the 
[arrow_writer](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L747)
   > > 
   > > I think those are two distinct issues?
   > 
   > 
   > Yes I think you are right -- any chance you can file a ticket in arrow-rs 
to track writing IntervalMonthDayNano?
   > 
   > 
   
   Sure, I can do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Make ANSI fallback more granular [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on PR #509:
URL: https://github.com/apache/datafusion-comet/pull/509#issuecomment-2150842134

This is a pretty interesting test failure:

```
   - postgreSQL/groupingsets.sql *** FAILED *** (4 seconds, 331 milliseconds)
 postgreSQL/groupingsets.sql
 Expected "... 1   128
 NULL  1   1   0   1   [1
 NULL  1   1   0   1   2
 NULL  2   1   0   1   4
 NULL  2   1   0   1   8
 NULL  2   1   0   1   16
 NULL  2   1   0   1   32
 NULL  1   1   0   1   64
 NULL  1   1   0   1   128]", but got "... 1   128
 NULL  1   1   0   1   [-1673396112
NULL  2   1   0   1   -1673198320
 NULL  2   1   0   1   -596264016
 NULL  1   1   0   1   -596231120
 NULL  1   1   0   2   65102
 NULL  2   1   0   2   65102]"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


alamb commented on PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#issuecomment-2150838943

   > 1. Interval statistics not supported; possibly due to those 
[lines](https://github.com/apache/datafusion/issues/10752#issuecomment-2150024521)
   > 2. Type IntervalUnit::MonthDayNano not supported at all by the 
[arrow_writer](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L747)
   > 
   > I think those are two distinct issues?
   
   
   Yes I think you are right -- any chance you can file a ticket in arrow-rs to 
track writing IntervalMonthDayNano?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] bug: org.apache.comet.CometNativeException: range end index 8221 out of range for slice of length 8192 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on issue #498:
URL: 
https://github.com/apache/datafusion-comet/issues/498#issuecomment-2150837074

   Ah, I probably found where it goes wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Prune Parquet RowGroup in a single call to `PruningPredicate::prune`, update StatisticsExtractor API [datafusion]

2024-06-05 Thread via GitHub


NGA-TRAN commented on code in PR #10802:
URL: https://github.com/apache/datafusion/pull/10802#discussion_r1628162150


##
datafusion-examples/examples/parquet_index.rs:
##
@@ -518,21 +518,17 @@ impl ParquetMetadataIndexBuilder {
 
 // extract the parquet statistics from the file's footer
 let metadata = reader.metadata();
+let row_groups = metadata.row_groups();
 
 // Extract the min/max values for each row group from the statistics
-let row_counts = StatisticsConverter::row_counts(reader.metadata())?;
-let value_column_mins = StatisticsConverter::try_new(
+let converter = StatisticsConverter::try_new(
 "value",
-RequestedStatistics::Min,
 reader.schema(),
-)?
-.extract(reader.metadata())?;
-let value_column_maxes = StatisticsConverter::try_new(
-"value",
-RequestedStatistics::Max,
-reader.schema(),
-)?
-.extract(reader.metadata())?;
+reader.parquet_schema(),
+)?;
+let row_counts = 
StatisticsConverter::row_group_row_counts(row_groups.iter())?;
+let value_column_mins = converter.row_group_mins(row_groups.iter())?;
+let value_column_maxes = converter.row_group_maxes(row_groups.iter())?;

Review Comment:
   ❤️ 



##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -323,17 +324,6 @@ fn collect_scalars>>(
 }
 }
 
-/// What type of statistics should be extracted?
-#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
-pub enum RequestedStatistics {
-/// Minimum Value
-Min,
-/// Maximum Value
-Max,
-/// Null Count, returned as a [`UInt64Array`])
-NullCount,
-}
-

Review Comment:
   I agree we do not need this. We store each min/max/row_count as an array 
instead :thus



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] bug: attempt to multiply with overflow in cast string to date [datafusion-comet]

2024-06-05 Thread via GitHub


eejbyfeldt commented on issue #481:
URL: 
https://github.com/apache/datafusion-comet/issues/481#issuecomment-2150830474

   @andygrove Then there is more than one issue. I also recreated a similar 
crash using the date `29-01-01` inside a timestamp.  Spark will correctly 
return the year given this date, but comet crashes.
   
   Stack trace:
   ```
   at comet::parquet::read::valuesdecode(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/values.rs:800)
   at  as 
comet::parquet::read::values::Decoder>::read_batch(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/values.rs:853)
   at  as 
comet::parquet::read::values::Decoder>::read(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/values.rs:844)
   at 
comet::parquet::read::levels::LevelDecoder::read_batch(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/levels.rs:135)
   at 
comet::parquet::read::column::TypedColumnReader::read_batch(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/column.rs:546)
   at 
comet::parquet::read::column::ColumnReader::read_batch(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/read/column.rs:444)
   at 
comet::parquet::Java_org_apache_comet_parquet_Native_readBatch::{{closure}}(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/parquet/mod.rs:508)
   at 
comet::errors::curry::{{closure}}(/home/eejbyfeldt/dev/apache/datafusion-comet/core/src/errors.rs:442)
   at 
std::panicking::try::do_call(/rustc/ec08a0337f3556212525dbf1d3b41e19bdf27621/library/std/src/panicking.rs:526)
   
   ```
   Even if the overflow is fixed at that location full support might not be 
easy as that date is out of range for what is supported by chrono::DateTime: 
https://docs.rs/chrono/latest/chrono/naive/struct.NaiveDate.html so the 
existing code using that will be unable to return the date. If one allows the 
overflow at the first location (or runs in release mode) it instead crashes 
here: 
https://github.com/apache/datafusion-comet/blob/24781fb7b3966f787cf72c97f42e2613f24fb2ac/core/src/execution/datafusion/expressions/utils.rs#L178
   since that code assumes that all timestamps will be possible to represent in 
the `chrono` type. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] bug: org.apache.comet.CometNativeException: range end index 8221 out of range for slice of length 8192 [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on issue #498:
URL: 
https://github.com/apache/datafusion-comet/issues/498#issuecomment-2150814297

   I don't see this before. If there is reproducible example for this, it will 
be easier to debug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Support Ansi mode in abs function [datafusion-comet]

2024-06-05 Thread via GitHub


planga82 commented on PR #500:
URL: https://github.com/apache/datafusion-comet/pull/500#issuecomment-2150802402

   I have done a refactor with all the comments, thanks for the revision!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Make ANSI fallback more granular [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove commented on code in PR #509:
URL: https://github.com/apache/datafusion-comet/pull/509#discussion_r1628307927


##
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##
@@ -712,17 +712,6 @@ class CometSparkSessionExtensions
 }
 
 override def apply(plan: SparkPlan): SparkPlan = {
-  // DataFusion doesn't have ANSI mode. For now we just disable CometExec 
if ANSI mode is
-  // enabled.
-  if (isANSIEnabled(conf)) {
-if (COMET_ANSI_MODE_ENABLED.get()) {
-  logWarning("Using Comet's experimental support for ANSI mode.")
-} else {
-  logInfo("Comet extension disabled for ANSI mode")
-  return plan
-}
-  }

Review Comment:
   This current check will make all plans fall back to Spark if ANSI is 
enabled, so no native plans will run, unless `COMET_ANSI_MODE_ENABLED` is 
enabled.
   
   The changes in this PR mean that we still have the same check but only if 
the plan actually contains any expressions that would be affected by enabling 
ANSI support and we still fall back to Spark by default unless 
`COMET_ANSI_MODE_ENABLED` is enabled.
   
   The main risk with this PR is if we don't have the complete list of 
expressions that support ANSI mode.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support OneRowRelation to Support Scalar Inputs? [datafusion-comet]

2024-06-05 Thread via GitHub


tshauck commented on issue #516:
URL: 
https://github.com/apache/datafusion-comet/issues/516#issuecomment-2150771572

   Thanks, that's my understanding as well.
   
   Here's the extended explain for reference...
   
   ```
   spark-sql (default)> EXPLAIN EXTENDED SELECT trim('123 ');
   24/06/05 12:03:58 WARN CometSparkSessionExtensions$CometExecRule: Comet 
cannot execute some parts of this plan natively because Execute ExplainCommand 
is not supported
   Unsupported op node name: Scan OneRowRelation, Full class path: 
org.apache.spark.sql.execution.RDDScanExec
   24/06/05 12:03:58 WARN CometSparkSessionExtensions$CometExecRule: Comet 
cannot execute some parts of this plan natively because:
   - Execute ExplainCommand is not supported
   - CommandResult is not supported
   == Parsed Logical Plan ==
   'Project [unresolvedalias('trim(123 ), None)]
   +- OneRowRelation
   
   == Analyzed Logical Plan ==
   trim(123 ): string
   Project [trim(123 , None) AS trim(123 )#11]
   +- OneRowRelation
   
   == Optimized Logical Plan ==
   Project [123 AS trim(123 )#11]
   +- OneRowRelation
   
   == Physical Plan ==
   *(1) Project [123 AS trim(123 )#11]
   +- *(1) Scan OneRowRelation[]
   
   Time taken: 0.071 seconds, Fetched 1 row(s)
   ```
   
   This is from a debug statement I put in that shows the node that can't be 
transformed...
   
   ```
   Unsupported op node name: Scan OneRowRelation, Full class path: 
org.apache.spark.sql.execution.RDDScanExec
   ```
   
   When a constant is used with a parquet table things seem to work fine.
   
   ```
   spark-sql (default)> EXPLAIN EXTENDED SELECT trim(col), trim('123 ') FROM qq;
   == Parsed Logical Plan ==
   'Project [unresolvedalias('trim('col), None), unresolvedalias('trim(123 ), 
None)]
   +- 'UnresolvedRelation [qq], [], false
   
   == Analyzed Logical Plan ==
   trim(col): string, trim(123 ): string
   Project [trim(col#15, None) AS trim(col)#26, trim(123 , None) AS trim(123 
)#27]
   +- SubqueryAlias spark_catalog.default.qq
  +- Relation spark_catalog.default.qq[col#15] parquet
   
   == Optimized Logical Plan ==
   Project [trim(col#15, None) AS trim(col)#26, 123 AS trim(123 )#27]
   +- Relation spark_catalog.default.qq[col#15] parquet
   
   == Physical Plan ==
   *(1) ColumnarToRow
   +- CometProject [trim(col)#26, trim(123 )#27], [trim(col#15, None) AS 
trim(col)#26, 123 AS trim(123 )#27]
  +- CometScan parquet spark_catalog.default.qq[col#15] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/Users/thauck/personal/code/github.com/tshauck/arrow-datafusion-c...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: struct
   
   Time taken: 0.065 seconds, Fetched 1 row(s)
   ```
   
   Sounds like it makes sense to add the native node, so I'll give it a shot 
over the next week or so. Though please let me know if you immediate thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[PR] chore: Simplify code in CometExecIterator and avoid some small overhead [datafusion-comet]

2024-06-05 Thread via GitHub


andygrove opened a new pull request, #522:
URL: https://github.com/apache/datafusion-comet/pull/522

   ## Which issue does this PR close?
   
   
   
   N/A
   
   ## Rationale for this change
   
   
   
   In the original code, `getNextBatch` called `executeNative` which would 
create a `Batch` object which is then discarded in `getNextBatch`. This seems 
like unnecessary overhead even though it is small.
   
   ## What changes are included in this PR?
   
   
   
   - Remove the interim `Batch` structure. Combine the two methods.
   - Add some comments
   
   ## How are these changes tested?
   
   
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: use total ordering in the min & max accumulator for floats [datafusion]

2024-06-05 Thread via GitHub


westonpace commented on PR #10627:
URL: https://github.com/apache/datafusion/pull/10627#issuecomment-2150740158

   I suspect this means that the min/max function for intervals is also 
incorrectly propagating nulls but I tried to make a unit test for intervals and 
get an error that there is no accumulator for intervals and so I guess we can 
cross that bridge when we come to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] fix: use total ordering in the min & max accumulator for floats [datafusion]

2024-06-05 Thread via GitHub


westonpace commented on PR #10627:
URL: https://github.com/apache/datafusion/pull/10627#issuecomment-2150738379

   > @westonpace what is the status / plan with this PR? It has failing CI 
tests but is not marked as a draft. Are you still planning on working with it? 
Do you need help to push it along?
   
   Thanks for the push.  I have now discovered sqllogictests :)
   
   Turns out my approach didn't work because partial_cmp propagates nulls (and 
the previous impl ignored nulls).  I think I've got it fixed now.  The earlier 
describe test failure was also for this same reason (it was propagating a null, 
not a nan) and so I was able to revert that change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#issuecomment-2150714644

   @alamb thanks for the review. I have adressed your comments PTAL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628177847


##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -256,6 +259,13 @@ macro_rules! get_statistic {
 Some(DataType::Float16) => {
 
Some(ScalarValue::Float16(from_bytes_to_f16(s.$bytes_func(
 }
+Some(DataType::Interval(unit)) => {
+match unit {
+IntervalUnit::YearMonth => 
unimplemented!("Interval statistics not yet supported by parquet"),

Review Comment:
   I merged #10711 and [this 
](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L499-L515)
 should do the trick already. 
   
   So for now, I'm expecting a `null_array` in the test-cases which should be 
changed once apache/arrow-rs#5847 is solved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



[I] Add ability to receive an iterator over the inputs of a LogicalPlan instead of a Vec. [datafusion]

2024-06-05 Thread via GitHub


LorrensP-2158466 opened a new issue, #10808:
URL: https://github.com/apache/datafusion/issues/10808

   ### Is your feature request related to a problem or challenge?
   
   Currently, the only way to get the inputs of a LogicalPlan is to call 
`inputs()`, which returns a `Vec<&LogicalPlan>`. But I have noticed that there 
can be unnecessary calls to `into_iter()` or `iter()` on this vector.
   Furthermore, the function returns a lot of Vectors of size 1, which can 
create unnecessary allocations.
   This also applies to UserDefinedLogicalNode(Core), since I don't think the 
compiler can see through the use of `inputs()` and convert the Vec to an 
iterator.
   
   ### Describe the solution you'd like
   
   To change the API to return an Iterator instead of a vector requires a lot 
of rewriting, so I think it's maybe nicer to create a new function that returns 
an iterator over the inputs like this:
   ```rust 
   fn inputs_iter(&self) -> impl Iterator {}
   ```.
   We also have to extend the API of UserDefinedLogicalNode(Core) to have the 
same function so extension node's have this ability as well.
   
   So instead of all the `vec![ input ]` calls, we can use `std::iter::once`, 
empty vec's can be empty iterators and in the case of an extension node we can 
just call `node.inputs_iter()`. 
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   This issue is purely for changing the API, so I'm willing to do it if this 
is accepted. To use this function in the source code is a bit more work, so I 
think it will be better to open another issue for changing the calls to 
`inputs()` into `inputs_iter()`.
   
   I may be entirely wrong since I'm fairly new to DataFusion, so any feedback 
is greatly appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628177847


##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -256,6 +259,13 @@ macro_rules! get_statistic {
 Some(DataType::Float16) => {
 
Some(ScalarValue::Float16(from_bytes_to_f16(s.$bytes_func(
 }
+Some(DataType::Interval(unit)) => {
+match unit {
+IntervalUnit::YearMonth => 
unimplemented!("Interval statistics not yet supported by parquet"),

Review Comment:
   > in general, in rust `unimplemented!()` results in a panic which is not a 
great user experience.
   > 
   > I think this code purposely ignores errors (in order to gracefully handle 
parquet files that might not have the expected statistics or that were created 
from some other writer)
   > 
   > Thus, I suggest changing these cases from `panic` to `None` (and then 
adjusting the test appropriately)
   > 
   > If we return `None`, once statistics are properly stored by the parquet-rs 
writer, the test will fail on next upgrade and we can update the test with the 
correct values
   
   Yes, I see with the merge of #10711 the match arms changed; so I'll take a 
look at that and adjust accordingly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628209250


##
datafusion/core/tests/parquet/mod.rs:
##
@@ -925,6 +932,71 @@ fn make_dict_batch() -> RecordBatch {
 .unwrap()
 }
 
+fn make_interval_batch(offset: i32) -> RecordBatch {
+let schema = Schema::new(vec![
+Field::new(
+"year_month",
+DataType::Interval(IntervalUnit::YearMonth),
+true,
+),
+Field::new("day_time", DataType::Interval(IntervalUnit::DayTime), 
true),
+Field::new(
+"month_day_nano",
+DataType::Interval(IntervalUnit::MonthDayNano),
+true,
+),
+]);
+let schema = Arc::new(schema);
+
+let ym_arr = IntervalYearMonthArray::from(vec![
+Some(IntervalYearMonthType::make_value(1 + offset, 1 + offset)),

Review Comment:
   thank you - fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Make ANSI fallback more granular [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on code in PR #509:
URL: https://github.com/apache/datafusion-comet/pull/509#discussion_r162818


##
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##
@@ -712,17 +712,6 @@ class CometSparkSessionExtensions
 }
 
 override def apply(plan: SparkPlan): SparkPlan = {
-  // DataFusion doesn't have ANSI mode. For now we just disable CometExec 
if ANSI mode is
-  // enabled.
-  if (isANSIEnabled(conf)) {
-if (COMET_ANSI_MODE_ENABLED.get()) {
-  logWarning("Using Comet's experimental support for ANSI mode.")
-} else {
-  logInfo("Comet extension disabled for ANSI mode")
-  return plan
-}
-  }

Review Comment:
   For Spark 3.x, I feel we still need this protection in case users enable 
ANSI mode as well as native execution.
   If I understand correctly we have not set up CI yet with ANSI enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support OneRowRelation to Support Scalar Inputs? [datafusion-comet]

2024-06-05 Thread via GitHub


viirya commented on issue #516:
URL: 
https://github.com/apache/datafusion-comet/issues/516#issuecomment-2150642140

   `OneRowRelation ` is logical node. It will be planned as a physical node 
`RDDScanExec` with one single row (empty columns). It should be easy to 
implement in Comet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#issuecomment-2150625815

   > > @alamb PTAL.
   > > As described in the PR I tried to prepare as much as possible, although 
statistics are not yet supported.
   > > Yet another issue: The `IntervalUnit::MonthDayNano` is also not 
[supported](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L747).
   > > So I'll guess we have to open 2 tickets in arrow-rs?
   > 
   > I filed 
[apache/arrow-rs#5847](https://github.com/apache/arrow-rs/issues/5847). I 
didn't quite understand your suggestion about two tickets
   
   I was thinking about two tickets:
   1. Interval statistics not supported; possibly due to those 
[lines](https://github.com/apache/datafusion/issues/10752#issuecomment-2150024521)
   2. Type IntervalUnit::MonthDayNano not supported at all by the 
[arrow_writer](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L747)
   
   I think those are two distinct issues?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


marvinlanhenke commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628177847


##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -256,6 +259,13 @@ macro_rules! get_statistic {
 Some(DataType::Float16) => {
 
Some(ScalarValue::Float16(from_bytes_to_f16(s.$bytes_func(
 }
+Some(DataType::Interval(unit)) => {
+match unit {
+IntervalUnit::YearMonth => 
unimplemented!("Interval statistics not yet supported by parquet"),

Review Comment:
   in this case we never reach the match arm, since 
   
   > in general, in rust `unimplemented!()` results in a panic which is not a 
great user experience.
   > 
   > I think this code purposely ignores errors (in order to gracefully handle 
parquet files that might not have the expected statistics or that were created 
from some other writer)
   > 
   > Thus, I suggest changing these cases from `panic` to `None` (and then 
adjusting the test appropriately)
   > 
   > If we return `None`, once statistics are properly stored by the parquet-rs 
writer, the test will fail on next upgrade and we can update the test with the 
correct values
   
   Yes, I see with the merge of #10711 the match arms changed; so I'll take a 
look at that and adjust accordingly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Support OneRowRelation to Support Scalar Inputs? [datafusion-comet]

2024-06-05 Thread via GitHub


kazuyukitanimura commented on issue #516:
URL: 
https://github.com/apache/datafusion-comet/issues/516#issuecomment-2150607171

   @tshauck Thank you for working on this.
   I need to look into the details but the name `OneRowRelation` sounds like it 
is reading from a table. What about reading parquet directly?
   Also getting `.explain(true)` may help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


alamb commented on PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#issuecomment-2150605789

   > @alamb PTAL.
   > 
   > As described in the PR I tried to prepare as much as possible, although 
statistics are not yet supported.
   > 
   > Yet another issue: The `IntervalUnit::MonthDayNano` is also not 
[supported](https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L747).
   > 
   > So I'll guess we have to open 2 tickets in arrow-rs?
   
   I filed https://github.com/apache/arrow-rs/issues/5847. I didn't quite 
understand your suggestion about two tickets


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Support Ansi mode in abs function [datafusion-comet]

2024-06-05 Thread via GitHub


parthchandra commented on code in PR #500:
URL: https://github.com/apache/datafusion-comet/pull/500#discussion_r1628163758


##
core/src/execution/datafusion/expressions/abs.rs:
##
@@ -0,0 +1,87 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::datatypes::DataType;
+use arrow_schema::ArrowError;
+use datafusion::logical_expr::{ColumnarValue, ScalarUDFImpl, Signature};
+use datafusion_common::DataFusionError;
+use datafusion_functions::math;
+use std::{any::Any, sync::Arc};
+
+use crate::errors::CometError;
+
+use super::EvalMode;
+
+fn arithmetic_overflow_error(from_type: &str) -> CometError {
+CometError::ArithmeticOverflow {
+from_type: from_type.to_string(),
+}
+}
+
+#[derive(Debug)]
+pub struct CometAbsFunc {
+inner_abs_func: Arc,
+eval_mode: EvalMode,
+data_type_name: String,
+}
+
+impl CometAbsFunc {
+pub fn new(eval_mode: EvalMode, data_type_name: String) -> Self {
+Self {
+inner_abs_func: math::abs().inner(),
+eval_mode,
+data_type_name,
+}
+}
+}
+
+impl ScalarUDFImpl for CometAbsFunc {
+fn as_any(&self) -> &dyn Any {
+self
+}
+fn name(&self) -> &str {
+"abs"
+}
+
+fn signature(&self) -> &Signature {
+self.inner_abs_func.signature()
+}
+
+fn return_type(&self, arg_types: &[DataType]) -> Result {
+self.inner_abs_func.return_type(arg_types)
+}
+
+fn invoke(&self, args: &[ColumnarValue]) -> Result {
+match self.inner_abs_func.invoke(args) {
+Ok(result) => Ok(result),
+Err(DataFusionError::ArrowError(ArrowError::ComputeError(msg), 
trace))
+if msg.contains("overflow") =>

Review Comment:
   It would help to log an issue to keep track



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [I] Feedback request for providing configurable UDF functions [datafusion]

2024-06-05 Thread via GitHub


Omega359 commented on issue #10744:
URL: https://github.com/apache/datafusion/issues/10744#issuecomment-2150594778

   > I have read the context now and understand that this is about `safe` mode 
or what Spark calls `ANSI` mode.
   > 
   > Isn't this just a case of adding a new flag to the session context that 
UDFs can choose to use when deciding whether to return null or throw an error?
   
   That would be nice ... except UDF's don't have a way to access the session 
context currently :( Option #2 and #3 provide that via different mechanisms.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] chore: Create initial release process scripts for official ASF source release [datafusion-comet]

2024-06-05 Thread via GitHub


parthchandra commented on code in PR #429:
URL: https://github.com/apache/datafusion-comet/pull/429#discussion_r1628161655


##
dev/release/create-tarball.sh:
##
@@ -0,0 +1,135 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+
+# Adapted from 
https://github.com/apache/arrow-rs/tree/master/dev/release/create-tarball.sh
+
+# This script creates a signed tarball in
+# dev/dist/apache-datafusion-comet--.tar.gz and uploads it to
+# the "dev" area of the dist.apache.datafusion repository and prepares an
+# email for sending to the d...@datafusion.apache.org list for a formal
+# vote.
+#
+# See release/README.md for full release instructions
+#
+# Requirements:
+#
+# 1. gpg setup for signing and have uploaded your public
+# signature to https://pgp.mit.edu/
+#
+# 2. Logged into the apache svn server with the appropriate
+# credentials
+#
+# 3. Install the requests python package
+#
+#
+# Based in part on 02-source.sh from apache/arrow
+#
+
+set -e
+
+DEV_RELEASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+DEV_RELEASE_TOP_DIR="$(cd "${DEV_RELEASE_DIR}/../../" && pwd)"
+
+if [ "$#" -ne 2 ]; then
+echo "Usage: $0  "
+echo "ex. $0 4.1.0 2"
+exit
+fi
+
+if [[ -z "${GH_TOKEN}" ]]; then
+echo "Please set personal github token through GH_TOKEN environment 
variable"
+exit
+fi
+
+version=$1
+rc=$2
+tag="${version}-rc${rc}"
+
+echo "Attempting to create ${tarball} from tag ${tag}"
+release_hash=$(cd "${DEV_RELEASE_TOP_DIR}" && git rev-list --max-count=1 
${tag})
+
+release=apache-datafusion-comet-${version}
+distdir=${DEV_RELEASE_TOP_DIR}/dev/dist/${release}-rc${rc}
+tarname=${release}.tar.gz
+tarball=${distdir}/${tarname}
+url="https://dist.apache.org/repos/dist/dev/datafusion/${release}-rc${rc}";
+
+if [ -z "$release_hash" ]; then
+echo "Cannot continue: unknown git tag: ${tag}"
+fi
+
+echo "Draft email for d...@datafusion.apache.org mailing list"
+echo ""
+echo "-"
+cat  ${tarball})
+
+echo "Running rat license checker on ${tarball}"
+${DEV_RELEASE_DIR}/run-rat.sh ${tarball}
+
+echo "Signing tarball and creating checksums"
+gpg --armor --output ${tarball}.asc --detach-sig ${tarball}
+# create signing with relative path of tarball
+# so that they can be verified with a command such as
+#  shasum --check apache-datafusion-comet-0.1.0-rc1.tar.gz.sha512
+(cd ${distdir} && shasum -a 256 ${tarname}) > ${tarball}.sha256
+(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512
+
+
+echo "Uploading to datafusion dist/dev to ${url}"

Review Comment:
   Also, IIRC, when we do binary releases, for maven artifacts,  promoting 
release candidates to releases also requires PMC. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] feat: Add fuzz testing for arithmetic expressions [datafusion-comet]

2024-06-05 Thread via GitHub


huaxingao commented on PR #519:
URL: https://github.com/apache/datafusion-comet/pull/519#issuecomment-2150588332

   Thanks @andygrove for the PR! For binary arithmetic expressions, shall we 
also include bitwise operation such as `BitwiseAnd`, `BitwiseOr`, `BitwiseXor`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Handle EmptyRelation during SQL unparsing [datafusion]

2024-06-05 Thread via GitHub


goldmedal commented on PR #10803:
URL: https://github.com/apache/datafusion/pull/10803#issuecomment-2150586926

   Thanks @alamb @devinjdangelo 
   I think I will address @alamb's comments in the follow-up PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Extract Parquet statistics from `Interval` column [datafusion]

2024-06-05 Thread via GitHub


alamb commented on code in PR #10801:
URL: https://github.com/apache/datafusion/pull/10801#discussion_r1628143605


##
datafusion/core/tests/parquet/mod.rs:
##
@@ -925,6 +932,71 @@ fn make_dict_batch() -> RecordBatch {
 .unwrap()
 }
 
+fn make_interval_batch(offset: i32) -> RecordBatch {
+let schema = Schema::new(vec![
+Field::new(
+"year_month",
+DataType::Interval(IntervalUnit::YearMonth),
+true,
+),
+Field::new("day_time", DataType::Interval(IntervalUnit::DayTime), 
true),
+Field::new(
+"month_day_nano",
+DataType::Interval(IntervalUnit::MonthDayNano),
+true,
+),
+]);
+let schema = Arc::new(schema);
+
+let ym_arr = IntervalYearMonthArray::from(vec![
+Some(IntervalYearMonthType::make_value(1 + offset, 1 + offset)),

Review Comment:
   in general I suggest changing this so the values of the two fields are 
different (so that it would catch bugs where the fields weren't properly 
interpreted)
   
   For example, instead of
   
   ```rust
   Some(IntervalYearMonthType::make_value(1 + offset, 1 + offset)),
   ```
   
   Something like (use `10 + offset` in the second field so the values are 
different)
   
   ```rust
   Some(IntervalYearMonthType::make_value(1 + offset, 10 + offset)),
   ```
   
   The same applies to the rest of the values in this code



##
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##
@@ -256,6 +259,13 @@ macro_rules! get_statistic {
 Some(DataType::Float16) => {
 
Some(ScalarValue::Float16(from_bytes_to_f16(s.$bytes_func(
 }
+Some(DataType::Interval(unit)) => {
+match unit {
+IntervalUnit::YearMonth => 
unimplemented!("Interval statistics not yet supported by parquet"),

Review Comment:
   in general, in rust `unimplemented!()` results in a panic which is not a 
great user experience.
   
   I think this code purposely ignores errors (in order to gracefully handle 
parquet files that might not have the expected statistics or that were created 
from some other writer) 
   
   Thus, I suggest changing these cases from `panic` to `None` (and then 
adjusting the test appropriately)
   
   If we return `None`, once statistics are properly stored by the parquet-rs 
writer, the test will fail on next upgrade and we can update the test with the 
correct values



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



  1   2   3   >