Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-08-03 Thread via GitHub


SKY-ALIN closed pull request #1901: fix: SortMergeJoin for timestamp keys
URL: https://github.com/apache/datafusion-comet/pull/1901


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-25 Thread via GitHub


parthchandra commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-3006474262

   I don't think this PR is making the correct change. 
   With this PR the removed test fails to execute the query (let alone pass the 
assertion)
   ```
 test("SortMergeJoin with unsupported key type should fall back to Spark") {
   withSQLConf(
 SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
 SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
 SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
 withTable("t1", "t2") {
   sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
   sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01 
11:11:11')")
   
   sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
   sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01 
11:11:11')")
   
   val df = sql("SELECT * FROM t1 JOIN t2 ON t1.time = t2.time")
   val (sparkPlan, cometPlan) = checkSparkAnswer(df)// 
should NOT fail here, but does
   assert(sparkPlan.canonicalized === cometPlan.canonicalized)   // 
should fail here
 }
   }
 }
   
   ```
   The reason for the failure is - 
   ```
   org.apache.comet.CometNativeException: Unsupported data type in sort merge 
join comparator: Timestamp(Microsecond, Some("UTC"))
   ```
   This means that `supportedSortMergeJoinEqualType` should not return true for 
`Timestamp`
   
   The newly added test does not exercise the change. The plan (see below) does 
not include a SMJ and the test passes with and without the changes in PR. 
   ```
   AdaptiveSparkPlan isFinalPlan=false
   +- BroadcastHashJoin [ts#4], [ts#10], Inner, BuildRight, false
  :- LocalTableScan [ts#4]
  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, timestamp, 
true]),false), [plan_id=106]
 +- LocalTableScan [ts#10]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-25 Thread via GitHub


parthchandra commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2167482793


##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
 .toSeq)
   }
 
-  test("SortMergeJoin with unsupported key type should fall back to Spark") {

Review Comment:
   A DF join on timestamp columns does not take into account time zones? Then 
it wouldn't be correct, would it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-24 Thread via GitHub


SKY-ALIN commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2164969731


##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
 .toSeq)
   }
 
-  test("SortMergeJoin with unsupported key type should fall back to Spark") {

Review Comment:
   Such a test will not work as the current implementation of datafusion 
doesn't include this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-24 Thread via GitHub


parthchandra commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2164519556


##
spark/src/test/scala/org/apache/comet/exec/CometJoinSuite.scala:
##
@@ -54,25 +54,6 @@ class CometJoinSuite extends CometTestBase {
 .toSeq)
   }
 
-  test("SortMergeJoin with unsupported key type should fall back to Spark") {

Review Comment:
   The test you have added is great. Thank you!
   
   However this removed test exercises a few things that we should also check 
for - Timestamps read from Parquet, and the actual plan created. For the latter 
you can simply change the test to use `checkSparkAnswerAndOperator` and remove 
the check that the canonicalized plans are the same. 
   
   Also, to really make sure that we are testing timestamps, we really should 
have the left side and the right side of the  join use timestamps with 
different timezones. 
   
   To create timestamps with different timezones, we can modify this test to 
create the test files separately - 
   ```
 withSQLConf(
 SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Asia/Kathmandu",
 SQLConf.ADAPTIVE_AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
 SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
 withTable("t1", "t2") {
   withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> 
"Australia/Darwin") {
   sql("CREATE TABLE t1(name STRING, time TIMESTAMP) USING PARQUET")
   sql("INSERT OVERWRITE t1 VALUES('a', timestamp'2019-01-01 
11:11:11')")
   }
   withSQLconf( SQLConf.SESSION_LOCAL_TIMEZONE.key -> "Canada/Pacific") 
{
   sql("CREATE TABLE t2(name STRING, time TIMESTAMP) USING PARQUET")
   sql("INSERT OVERWRITE t2 VALUES('a', timestamp'2019-01-01 
11:11:11')")
   }
   ...
   ```
   The join above with different timezones will return zero rows since the 
timezones are not the same.
   
   Also, we could rename the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-23 Thread via GitHub


SKY-ALIN commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2160644398


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {
*/
   private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean = 
dataType match {
 case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: 
FloatType |
-_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: 
BooleanType =>
+_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: 
BooleanType |
+_: TimestampType =>

Review Comment:
   TimestampNTZType is supported; there is another case 1 line below.
   
   ```scala
   case TimestampNTZType => true
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-22 Thread via GitHub


SKY-ALIN commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2994779058

   @mbutrovich done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub


codecov-commenter commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2986286369

   ## 
[Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 Report
   Attention: Patch coverage is `0%` with `4 lines` in your changes missing 
coverage. Please review.
   > Project coverage is 32.91%. Comparing base 
[(`f09f8af`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/f09f8af64c6599255e116a376f4f008f2fd63b43?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 to head 
[(`75d681d`)](https://app.codecov.io/gh/apache/datafusion-comet/commit/75d681d4eede4b392f0d0d2f77a8ca74e534e5d6?dropdown=coverage&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   > Report is 270 commits behind head on main.
   
   | [Files with missing 
lines](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 | Patch % | Lines |
   |---|---|---|
   | 
[.../scala/org/apache/comet/serde/QueryPlanSerde.scala](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?src=pr&el=tree&filepath=spark%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fcomet%2Fserde%2FQueryPlanSerde.scala&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#diff-c3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9jb21ldC9zZXJkZS9RdWVyeVBsYW5TZXJkZS5zY2FsYQ==)
 | 0.00% | [4 Missing :warning: 
](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache)
 |
   
   Additional details and impacted files
   
   
   ```diff
   @@  Coverage Diff  @@
   ##   main#1901   +/-   ##
   =
   - Coverage 56.12%   32.91%   -23.22% 
   + Complexity  976  800  -176 
   =
 Files   119  130   +11 
 Lines 1174312743 +1000 
 Branches   2251 2405  +154 
   =
   - Hits   6591 4194 -2397 
   - Misses 4012 7577 +3565 
   + Partials   1140  972  -168 
   ```
   
   
   
   [:umbrella: View full report in Codecov by 
Sentry](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
   :loudspeaker: Have feedback on the report? [Share it 
here](https://about.codecov.io/codecov-pr-comment-feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache).
   
:rocket: New features to boost your workflow: 
   
   - :snowflake: [Test 
Analytics](https://docs.codecov.com/docs/test-analytics): Detect flaky tests, 
report on failures, and find test suite problems.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub


parthchandra commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2985979040

   > Thanks for the contribution, @SKY-ALIN! Could we add a test case with 
timestamps as the join key?
   
   The test should have the left side and the right side timestamps be in 
different timezones. 
   
   > Should TimestampNTZType also be supported?
   
   If the above test case passes we probably can.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub


andygrove commented on code in PR #1901:
URL: https://github.com/apache/datafusion-comet/pull/1901#discussion_r2154673214


##
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:
##
@@ -2168,7 +2168,8 @@ object QueryPlanSerde extends Logging with CometExprShim {
*/
   private def supportedSortMergeJoinEqualType(dataType: DataType): Boolean = 
dataType match {
 case _: ByteType | _: ShortType | _: IntegerType | _: LongType | _: 
FloatType |
-_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: 
BooleanType =>
+_: DoubleType | _: StringType | _: DateType | _: DecimalType | _: 
BooleanType |
+_: TimestampType =>

Review Comment:
   Should `TimestampNTZType` also be supported?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub


mbutrovich commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2984157981

   Could we add a test case with timestamps as the join key?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub


SKY-ALIN commented on PR #1901:
URL: 
https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2983795662

   It fixes formatting also, now it looks like this:
   
   ```shell
   25/06/18 12:53:47 WARN CometExecRule: Comet cannot execute some parts of 
this plan natively (set spark.comet.explainFallback.enabled=false to disable 
this logging):
   Project
   +- BroadcastHashJoin
  :- Project
  :  +- Window
  : +- Sort
  :+-  Exchange [COMET: ]
  :   +- Project
  :  +-  SortMergeJoin [COMET: Unsupported join key type 
TimestampType on key: CAST(time AS TIMESTAMP)]
   ...
   ```
   
   but after adding TimestampType into match statement there is no this message 
anyway:), just for clearance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]