LoseYSelf commented on PR #5468:
URL: https://github.com/apache/kyuubi/pull/5468#issuecomment-1831568808
i can provide some detail information,
as the test suit code
```java
val df0 = spark.table("default.src") // only col2 is masked which is not used
.select($"col0", $"col1")
doAs("bob",
df0.as("a")
.join(
right = df0.as("b"),
joinExprs = $"a.col0" === $"b.col0" && $"a.col1" === $"b.col1",
joinType = "left_outer").explain()) // crashes
```
the error happens in RuleApplyDataMaskingStage1,
```java
finally in applyDataMasking method,
newPlan is
Join LeftOuter, ((col0#131 = col0#131) AND (col1#132 = col1#132))
:- SubqueryAlias a
: +- Project [col0#98, col1#99]
: +- SubqueryAlias testcat.default.src
: +- Filter (key#93 < 20)
: +- RowFilterMarker
: +- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
: +- Project [key#93, md5(cast(cast(value1#94 as string) as
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95,
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117,
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x,
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97,
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100,
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#100, 4)) AS value5#120]
: +- RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
+- SubqueryAlias b
+- Project [col0#131, col1#132]
+- SubqueryAlias testcat.default.src
+- Filter (key#126 < 20)
+- RowFilterMarker
+- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
+- Project [key#126, md5(cast(cast(value1#127 as string)
as binary)) AS value1#121,
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z],
x, 1), [0-9], n, 1) AS value2#122,
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z],
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130,
Some(America/Los_Angeles)) AS value4#124, col0#131, col1#132,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133,
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#133, 4)) AS value5#125]
+- RelationV2[key#126, value1#127, value2#128,
value3#129, value4#130, col0#131, col1#132, value5#133] testcat.default.src
the plan is
Join LeftOuter, ((col0#98 = col0#131) AND (col1#99 = col1#132))
:- SubqueryAlias a
: +- Project [col0#98, col1#99]
: +- SubqueryAlias testcat.default.src
: +- Filter (key#93 < 20)
: +- RowFilterMarker
: +- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
: +- Project [key#93, md5(cast(cast(value1#94 as string) as
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95,
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117,
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x,
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97,
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100,
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#100, 4)) AS value5#120]
: +- RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
+- SubqueryAlias b
+- Project [col0#131, col1#132]
+- SubqueryAlias testcat.default.src
+- Filter (key#126 < 20)
+- RowFilterMarker
+- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
+- Project [key#126, md5(cast(cast(value1#127 as string)
as binary)) AS value1#121,
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z],
x, 1), [0-9], n, 1) AS value2#122,
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z],
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130,
Some(America/Los_Angeles)) AS value4#124, col0#131, col1#132,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133,
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#133, 4)) AS value5#125]
+- RelationV2[key#126, value1#127, value2#128,
value3#129, value4#130, col0#131, col1#132, value5#133] testcat.default.src
```
the mainly difference is in SubqueryAlias b,DataMaskingStage0Marker
RelationV2 is not same as Project RelationV2
it is changed in spark rule DeduplicateRelations
```java
spark log
=== Applying Rule
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
'Join LeftOuter, (('a.col0 = 'b.col0) AND ('a.col1 = 'b.col1))
'Join LeftOuter, (('a.col0 = 'b.col0) AND
('a.col1 = 'b.col1))
:- SubqueryAlias a
:- SubqueryAlias a
: +- Project [col0#98, col1#99]
: +- Project [col0#98, col1#99]
: +- SubqueryAlias testcat.default.src
: +- SubqueryAlias testcat.default.src
: +- Filter (key#93 < 20)
: +- Filter (key#93 < 20)
: +- RowFilterMarker
: +- RowFilterMarker
: +- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
: +-
DataMaskingStage0Marker RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
: +- Project [key#93, md5(cast(cast(value1#94 as string) as
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95,
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117,
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x,
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97,
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100,
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#100, 4)) AS value5#120] : +- Project [key#93,
md5(cast(cast(value1#94 as string) as binary)) AS value1#116,
regexp_replace(regexp_replace(regexp_replace(value2#95, [A-Z], X, 1), [a-z], x,
1), [0-9], n, 1) AS value2#117,
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x,
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97,
Some(America/Los_Angeles)) AS value4#119, col0#98, co
l1#99, concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100,
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#100, 4)) AS value5#120]
: +- RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
: +- RelationV2[key#93,
value1#94, value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
+- SubqueryAlias b
+- SubqueryAlias b
! +- Project [col0#98, col1#99]
+- Project [col0#131, col1#132]
+- SubqueryAlias testcat.default.src
+- SubqueryAlias testcat.default.src
! +- Filter (key#93 < 20)
+- Filter (key#126 < 20)
+- RowFilterMarker
+- RowFilterMarker
+- DataMaskingStage0Marker RelationV2[key#93, value1#94,
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100]
testcat.default.src
+-
DataMaskingStage0Marker RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
! +- Project [key#93, md5(cast(cast(value1#94 as string) as
binary)) AS value1#121, regexp_replace(regexp_replace(regexp_replace(value2#95,
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#122,
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x,
5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#97,
Some(America/Los_Angeles)) AS value4#124, col0#98, col1#99,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100,
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#100, 4)) AS value5#125] +- Project [key#126,
md5(cast(cast(value1#127 as string) as binary)) AS value1#121,
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z],
x, 1), [0-9], n, 1) AS value2#122,
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z],
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130,
Some(America/Los_Angeles)) AS value4#124, col0#1
31, col1#132,
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133,
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1),
right(value5#133, 4)) AS value5#125]
! +- RelationV2[key#93, value1#94, value2#95, value3#96,
value4#97, col0#98, col1#99, value5#100] testcat.default.src
+- RelationV2[key#126,
value1#127, value2#128, value3#129, value4#130, col0#131, col1#132, value5#133]
testcat.default.src
```
so i think we can inject a rule before RuleApplyDataMaskingStage1 to keep
DataMaskingStage0Marker RelationV2 is same as
Project RelationV2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]