Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
SKY-ALIN closed pull request #1901: fix: SortMergeJoin for timestamp keys URL: https://github.com/apache/datafusion-comet/pull/1901 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
parthchandra commented on PR #1901:
URL:
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-3006474262
I don't think this PR is making the correct change.
With this PR the removed test fails to execute the query (let alone pass the
assertion)
```
test("SortMergeJoin with unsupported key type should fall back to Spark") {
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
withTable("t1", "t2") {
sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01
11:11:11')")
sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01
11:11:11')")
val df = sql("SELECT * FROM t1 JOIN t2 ON t1.time = t2.time")
val (sparkPlan, cometPlan) = checkSparkAnswer(df)//
should NOT fail here, but does
assert(sparkPlan.canonicalized === cometPlan.canonicalized) //
should fail here
}
}
}
```
The reason for the failure is -
```
org.apache.comet.CometNativeException: Unsupported data type in sort merge
join comparator: Timestamp(Microsecond, Some("UTC"))
```
This means that `supportedSortMergeJoinEqualType` should not return true for
`Timestamp`
The newly added test does not exercise the change. The plan (see below) does
not include a SMJ and the test passes with and without the changes in PR.
```
AdaptiveSparkPlan isFinalPlan=false
+- BroadcastHashJoin [ts#4], [ts#10], Inner, BuildRight, false
:- LocalTableScan [ts#4]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, timestamp,
true]),false), [plan_id=106]
+- LocalTableScan [ts#10]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
parthchandra commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2167482793
##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
.toSeq)
}
- test("SortMergeJoin with unsupported key type should fall back to Spark") {
Review Comment:
A DF join on timestamp columns does not take into account time zones? Then
it wouldn't be correct, would it?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
SKY-ALIN commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2164969731
##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
.toSeq)
}
- test("SortMergeJoin with unsupported key type should fall back to Spark") {
Review Comment:
Such a test will not work as the current implementation of datafusion
doesn't include this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
parthchandra commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2164519556
##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
.toSeq)
}
- test("SortMergeJoin with unsupported key type should fall back to Spark") {
Review Comment:
The test you have added is great. Thank you!
However this removed test exercises a few things that we should also check
for - Timestamps read from Parquet, and the actual plan created. For the latter
you can simply change the test to use `checkSparkAnswerAndOperator` and remove
the check that the canonicalized plans are the same.
Also, to really make sure that we are testing timestamps, we really should
have the left side and the right side of the join use timestamps with
different timezones.
To create timestamps with different timezones, we can modify this test to
create the test files separately -
```
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
withTable("t1", "t2") {
withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key ->
"Australia/Darwin") {
sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01
11:11:11')")
}
withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Canada/Pacific")
{
sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01
11:11:11')")
}
...
```
The join above with different timezones will return zero rows since the
timezones are not the same.
Also, we could rename the test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
SKY-ALIN commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2160644398
##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {
*/
private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean =
dataType match {
case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _:
FloatType |
-_: DoubleType | _: StringType | _: DateType | _: DecimalType | _:
BooleanType =>
+_: DoubleType | _: StringType | _: DateType | _: DecimalType | _:
BooleanType |
+_: TimestampType =>
Review Comment:
TimestampNTZType is supported; there is another case 1 line below.
```scala
case TimestampNTZType => true
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
SKY-ALIN commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2994779058 @mbutrovich done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
codecov-commenter commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2986286369 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) Report Attention: Patch coverage is `0%` with `4 lines` in your changes missing coverage. Please review. > Project coverage is 32.91%. Comparing base [(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) to head [(`75d681d`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/75d681d4eede4b392f0d0d2f77a8ca74e534e5d6?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). > Report is 270 commits behind head on main. | [Files with missing lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Patch % | Lines | |---|---|---| | [.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==) | 0.00% | [4 Missing :warning: ](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Additional details and impacted files ```diff @@ Coverage Diff @@ ## main#1901 +/- ## = - Coverage 56.12% 32.91% -23.22% + Complexity 976 800 -176 = Files 119 130 +11 Lines 1174312743 +1000 Branches 2251 2405 +154 = - Hits 6591 4194 -2397 - Misses 4012 7577 +3565 + Partials 1140 972 -168 ``` [:umbrella: View full report in Codecov by Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :loudspeaker: Have feedback on the report? [Share it here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache). :rocket: New features to boost your workflow: - :snowflake: [Test Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, report on failures, and find test suite problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
parthchandra commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2985979040 > Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key? The test should have the left side and the right side timestamps be in different timezones. > Should TimestampNTZType also be supported? If the above test case passes we probably can. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
andygrove commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2154673214
##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {
*/
private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean =
dataType match {
case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _:
FloatType |
-_: DoubleType | _: StringType | _: DateType | _: DecimalType | _:
BooleanType =>
+_: DoubleType | _: StringType | _: DateType | _: DecimalType | _:
BooleanType |
+_: TimestampType =>
Review Comment:
Should `TimestampNTZType` also be supported?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]
mbutrovich commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2984157981 Could we add a test case with timestamps as the join key? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix SortMergeJoin for timestamp keys [datafusion-comet]
SKY-ALIN commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2983795662 It fixes formatting also, now it looks like this: ```shell 25/06/18 12:53:47 WARN CometExecRule: Comet cannot execute some parts of this plan natively (set spark.comet.explainFallback.enabled=false to disable this logging): Project +- BroadcastHashJoin :- Project : +- Window : +- Sort :+- Exchange [COMET: ] : +- Project : +- SortMergeJoin [COMET: Unsupported join key type TimestampType on key: CAST(time AS TIMESTAMP)] ... ``` but after adding TimestampType into match statement there is no this message anyway:), just for clearance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
