Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
mbutrovich closed issue #1820: Use sha2 implementation from datafusion-spark crate URL: https://github.com/apache/datafusion-comet/issues/1820 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-3146120825 Opened PR https://github.com/apache/datafusion-comet/pull/2063 for using DF SHA2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-3063537883 Should be able to open Comet's PR after https://github.com/apache/datafusion-comet/issues/1993 is closed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
andygrove commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-3006624446 > [@andygrove](https://github.com/andygrove) can I backport [SHA2-fix](https://github.com/apache/datafusion/pull/16350) to branch-48 of datafusion ? I tried updating with datafusion-main branch until my commit, however, looks like there are many changes in between and they are causing different test failures in Comet. We plan on releasing Comet 0.9.0 in the next few days, hopefully, and then we can have Comet main branch start using a pinned dependency on DataFusion and start making updated for the changes that have happened in DF since 48 was released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820:
URL:
https://github.com/apache/datafusion-comet/issues/1820#issuecomment-3002947426
> > [@andygrove](https://github.com/andygrove) can I backport
[SHA2-fix](https://github.com/apache/datafusion/pull/16350) to branch-48 of
datafusion ? I tried updating with datafusion-main branch until my commit,
however, looks like there are many changes in between and they are causing
different test failures in Comet.
>
> Looks like the test-cases are failing in Comet after updating the
Datafusion dependency with main-branch due to this
[commit](https://github.com/apache/datafusion/commit/0c3037404929fc3a3c4fbf6b9b7325d422ce10bd).
This commit was revert in the branch-48
[here](https://github.com/apache/datafusion/commit/c76c1f076cca6f1922de8ba86b98c05b6a27e7ac),
however it wasn't in the main branch. And the failures that I encountered were
also related to Window functions tests,
>
> For eg, I ran this: `./mvnw test -Dtest=None
-Dsuites="org.apache.comet.exec.CometExecSuite Windows support"` and it failed
with subraction underflow,
>
> ```
> Windows support *** FAILED *** (7 seconds, 210 milliseconds)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task
0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0
(TID 17) (pop-os executor driver): org.apache.comet.CometNativeException:
attempt to subtract with overflow
> at
comet::errors::init::{{closure}}(/home/rishabjoshi/Projects/datafusion-comet/native/core/src/errors.rs:151)
> at as
core::ops::function::Fn>::call(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/alloc/src/boxed.rs:2007)
> at
std::panicking::rust_panic_with_hook(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:836)
> at
std::panicking::begin_panic_handler::{{closure}}(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:694)
> at
std::sys::backtrace::__rust_end_short_backtrace(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/sys/backtrace.rs:168)
> at
rust_begin_unwind(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692)
> at
core::panicking::panic_fmt(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75)
> at
core::panicking::panic_const::panic_const_sub_overflow(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:178)
> at
datafusion_expr::window_state::WindowAggState::update(/home/rishabjoshi/Projects/datafusion/datafusion/expr/src/window_state.rs:95)
> at
datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/home/rishabjoshi/Projects/datafusion/datafusion/physical-expr/src/window/window_expr.rs:260)
> at
::evaluate_stateful(/home/rishabjoshi/Projects/datafusion/datafusion/physical-expr/src/window/aggregate.rs:148)
> at
datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
> at
datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
> at
::poll_next(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)
> ```
>
> After applying the revert I see the test case passing in main also.
>
> [@andygrove](https://github.com/andygrove) would you prefer if I apply
this revert in the main also? Looks like applying the revert in main is no
longer clean because there were many conflicting patches in between now.
However, if you prefer, rather than revert I might disable that pieces of code
within some property/feature-flag that will be remain turned off. I can do this
piece of work as part of this ticket itself.
Ok, this issue is fixed here:
https://github.com/apache/datafusion/pull/16430
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820:
URL:
https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2990142584
> [@andygrove](https://github.com/andygrove) can I backport
[SHA2-fix](https://github.com/apache/datafusion/pull/16350) to branch-48 of
datafusion ? I tried updating with datafusion-main branch until my commit,
however, looks like there are many changes in between and they are causing
different test failures in Comet.
Looks like the test-cases are failing in Comet after updating the Datafusion
dependency with main-branch due to this
[commit](https://github.com/apache/datafusion/commit/0c3037404929fc3a3c4fbf6b9b7325d422ce10bd).
This commit was revert in the branch-48
[here](https://github.com/apache/datafusion/commit/c76c1f076cca6f1922de8ba86b98c05b6a27e7ac),
however it wasn't in the main branch. And the failures that I encountered were
also related to Window functions tests,
For eg, I ran this: `./mvnw test -Dtest=None
-Dsuites="org.apache.comet.exec.CometExecSuite Windows support"`
and it failed with subraction underflow,
```
Windows support *** FAILED *** (7 seconds, 210 milliseconds)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0
(TID 17) (pop-os executor driver): org.apache.comet.CometNativeException:
attempt to subtract with overflow
at
comet::errors::init::{{closure}}(/home/rishabjoshi/Projects/datafusion-comet/native/core/src/errors.rs:151)
at as
core::ops::function::Fn>::call(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/alloc/src/boxed.rs:2007)
at
std::panicking::rust_panic_with_hook(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:836)
at
std::panicking::begin_panic_handler::{{closure}}(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:694)
at
std::sys::backtrace::__rust_end_short_backtrace(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/sys/backtrace.rs:168)
at
rust_begin_unwind(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/std/src/panicking.rs:692)
at
core::panicking::panic_fmt(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:75)
at
core::panicking::panic_const::panic_const_sub_overflow(/rustc/4d91de4e48198da2e33413efdcd9cd2cc0c46688/library/core/src/panicking.rs:178)
at
datafusion_expr::window_state::WindowAggState::update(/home/rishabjoshi/Projects/datafusion/datafusion/expr/src/window_state.rs:95)
at
datafusion_physical_expr::window::window_expr::AggregateWindowExpr::aggregate_evaluate_stateful(/home/rishabjoshi/Projects/datafusion/datafusion/physical-expr/src/window/window_expr.rs:260)
at
::evaluate_stateful(/home/rishabjoshi/Projects/datafusion/datafusion/physical-expr/src/window/aggregate.rs:148)
at
datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::compute_aggregates(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:983)
at
datafusion_physical_plan::windows::bounded_window_agg_exec::BoundedWindowAggStream::poll_next_inner(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:1033)
at
::poll_next(/home/rishabjoshi/Projects/datafusion/datafusion/physical-plan/src/windows/bounded_window_agg_exec.rs:949)
```
After applying the revert I see the test case passing in main also.
@andygrove would you prefer if I apply this revert in the main also? Looks
like applying the revert in main is no longer clean because there were many
conflicting patches in between now. However, if you prefer, rather than revert
I might disable that pieces of code within some property/feature-flag that will
be remain turned off.
I can do this piece of work as part of this ticket itself.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2988838360 @andygrove can I backport [SHA2-fix](https://github.com/apache/datafusion/pull/16350) to branch-48 of datafusion ? I tried updating with datafusion-main branch until my commit, however, looks like there are many changes in between and they are causing different test failures in Comet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2982012440 I should be able to resume working on this later this week. I might have to file another ticket to first upgrade the datafusion dependency before making Comet changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
andygrove commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2954102662 Thanks for working on this @rishvin. You are correct that Spark doesn't support UInt32 because it is JVM-based. JVM only has signed ints. The upper-case output seems to be a bug. I suggest filing a bug report in DataFusion, linking to this issue. Perhaps you would be interested in creating a PR in DataFusion to resolve these issues? cc @shehabgamin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820:
URL:
https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2953616585
Hi @andygrove, I tried the following approach and looks like there is some
discrepancy in the Datafusion's `SparkSha2` output with Spark.
**This is what I attempted**
Removed the existing ssh(224|256|384|512) and other related codes from
[comet_scalar_funcs.rs](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/spark-expr/src/comet_scalar_funcs.rs#L120)
Registered the `SparkSha2` UDF
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/core/src/execution/planner.rs#L160).
Modified the serialization logic at Spark to roughly like so,
```
case Sha2(left, numBits) =>
if (!numBits.foldable) {
withInfo(expr, "non literal numBits is not supported")
return None
}
val childExpr = exprToProtoInternal(left, inputs, binding)
val bitsExpr = exprToProtoInternal(numBits, inputs, binding)
scalarFunctionExprToProtoWithReturnType("sha2", StringType, childExpr,
bitsExpr)
```
With these changes we hit the following exception - `Unsupported argument
types for sha2 function`.
This is because the `SparkSha2` pattern matching supports `numBits` type to
be UInt32 but we are sending type Int32 type from spark (Ref:
[here](https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/spark/src/function/hash/sha2.rs#L141)).
I checked the supported types for Spark which we define
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala#L90),
but looks like Spark doesn't support UInt32.
To test things further, I duplicated the `SparkSha2` into Comet and added
Int32 into the pattern matching. This time the output response was successfully
returned but the results did not match with the expected spark result. The hash
returned by the `SparkSha2` is in uppercase, however, Spark returns SHA2 in
lowercase. Here is an example of the two,
Comet returned -
`2C83E9E8A39D60F7FCD3CFEC29C154260AA069F91CD40C972756F9354C64594E`
Spark returned -
`2c83e9e8a39d60f7fcd3cfec29c154260aa069f91cd40c972756f9354c64594e`.
This happened because the
[hex_encode](https://github.com/apache/datafusion/blob/1daa5ed5cc51546904d45e23cc148601d973942a/datafusion/spark/src/function/math/hex.rs#L163)
method used by the `SparkSha2` sets the lowercase=false (second parameter).
However in the existing Comet implementation lowercase is set to true,
[here](https://github.com/rishvin/datafusion-comet/blob/1b1d6185ede9175887de1e9ec7f48422c2a64a10/native/spark-expr/src/math_funcs/hex.rs#L56).
Hence, we are getting the matching results from comet.
To make it work we might have to do following changes to the Datafusion's
`SparkSha2`,
- Add support for pattern matching Int32 type.
- Add support to return hashes in lowercase.
Alternatively, if are not suppose to make these change in Datafusion, we
might have to create some wrappers over `SparkSha2` in Comet.
Let me know your thoughts on this or I'm missing anything.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
andygrove commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2945503093 Thanks @rishvin. I assigned the issue to you. Let me know if you have any questions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2945492404 @andygrove I can work on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
