LoseYSelf commented on PR #5468:
URL: https://github.com/apache/kyuubi/pull/5468#issuecomment-1831568808

   i can provide some detail information,
   as the test suit code 
   ```java
   val df0 = spark.table("default.src") // only col2 is masked which is not used
         .select($"col0", $"col1")
   
           doAs("bob",
             df0.as("a")
               .join(
                 right = df0.as("b"),
                 joinExprs = $"a.col0" === $"b.col0" && $"a.col1" === $"b.col1",
                 joinType = "left_outer").explain()) // crashes
   ```
   the error happens in RuleApplyDataMaskingStage1,
   ```java
   finally in applyDataMasking method,
   newPlan is 
   Join LeftOuter, ((col0#131 = col0#131) AND (col1#132 = col1#132))
   :- SubqueryAlias a
   :  +- Project [col0#98, col1#99]
   :     +- SubqueryAlias testcat.default.src
   :        +- Filter (key#93 < 20)
   :           +- RowFilterMarker
   :              +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src
   :                 +- Project [key#93, md5(cast(cast(value1#94 as string) as 
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95, 
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117, 
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x, 
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97, 
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100, 
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#100, 4)) AS value5#120]
   :                    +- RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src
   +- SubqueryAlias b
      +- Project [col0#131, col1#132]
         +- SubqueryAlias testcat.default.src
            +- Filter (key#126 < 20)
               +- RowFilterMarker
                  +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src
                     +- Project [key#126, md5(cast(cast(value1#127 as string) 
as binary)) AS value1#121, 
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z], 
x, 1), [0-9], n, 1) AS value2#122, 
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z], 
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130, 
Some(America/Los_Angeles)) AS value4#124, col0#131, col1#132, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133, 
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#133, 4)) AS value5#125]
                        +- RelationV2[key#126, value1#127, value2#128, 
value3#129, value4#130, col0#131, col1#132, value5#133] testcat.default.src
   
   the plan is 
   
   Join LeftOuter, ((col0#98 = col0#131) AND (col1#99 = col1#132))
   :- SubqueryAlias a
   :  +- Project [col0#98, col1#99]
   :     +- SubqueryAlias testcat.default.src
   :        +- Filter (key#93 < 20)
   :           +- RowFilterMarker
   :              +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src
   :                 +- Project [key#93, md5(cast(cast(value1#94 as string) as 
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95, 
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117, 
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x, 
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97, 
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100, 
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#100, 4)) AS value5#120]
   :                    +- RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src
   +- SubqueryAlias b
      +- Project [col0#131, col1#132]
         +- SubqueryAlias testcat.default.src
            +- Filter (key#126 < 20)
               +- RowFilterMarker
                  +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src
                     +- Project [key#126, md5(cast(cast(value1#127 as string) 
as binary)) AS value1#121, 
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z], 
x, 1), [0-9], n, 1) AS value2#122, 
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z], 
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130, 
Some(America/Los_Angeles)) AS value4#124, col0#131, col1#132, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133, 
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#133, 4)) AS value5#125]
                        +- RelationV2[key#126, value1#127, value2#128, 
value3#129, value4#130, col0#131, col1#132, value5#133] testcat.default.src
   
   ```
   the mainly difference is in SubqueryAlias b,DataMaskingStage0Marker 
RelationV2 is not same as Project RelationV2 
   
   it is changed in spark rule DeduplicateRelations
   
   ```java
   spark log
   === Applying Rule 
org.apache.spark.sql.catalyst.analysis.DeduplicateRelations ===
    'Join LeftOuter, (('a.col0 = 'b.col0) AND ('a.col1 = 'b.col1))              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 'Join LeftOuter, (('a.col0 = 'b.col0) AND 
('a.col1 = 'b.col1))
    :- SubqueryAlias a                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :- SubqueryAlias a
    :  +- Project [col0#98, col1#99]                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :  +- Project [col0#98, col1#99]
    :     +- SubqueryAlias testcat.default.src                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :     +- SubqueryAlias testcat.default.src
    :        +- Filter (key#93 < 20)                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :        +- Filter (key#93 < 20)
    :           +- RowFilterMarker                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :           +- RowFilterMarker
    :              +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src                                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                      :              +- 
DataMaskingStage0Marker RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src
    :                 +- Project [key#93, md5(cast(cast(value1#94 as string) as 
binary)) AS value1#116, regexp_replace(regexp_replace(regexp_replace(value2#95, 
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#117, 
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x, 
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97, 
Some(America/Los_Angeles)) AS value4#119, col0#98, col1#99, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100, 
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#100, 4)) AS value5#120]   :                 +- Project [key#93, 
md5(cast(cast(value1#94 as string) as binary)) AS value1#116, 
regexp_replace(regexp_replace(regexp_replace(value2#95, [A-Z], X, 1), [a-z], x, 
1), [0-9], n, 1) AS value2#117, 
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x, 
5), [0-9], n, 5) AS value3#118, date_trunc(YEAR, value4#97, 
Some(America/Los_Angeles)) AS value4#119, col0#98, co
 l1#99, concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100, 
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#100, 4)) AS value5#120]
    :                    +- RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 :                    +- RelationV2[key#93, 
value1#94, value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src
    +- SubqueryAlias b                                                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                 +- SubqueryAlias b
   !   +- Project [col0#98, col1#99]                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                    +- Project [col0#131, col1#132]
          +- SubqueryAlias testcat.default.src                                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                       +- SubqueryAlias testcat.default.src
   !         +- Filter (key#93 < 20)                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                          +- Filter (key#126 < 20)
                +- RowFilterMarker                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             +- RowFilterMarker
                   +- DataMaskingStage0Marker RelationV2[key#93, value1#94, 
value2#95, value3#96, value4#97, col0#98, col1#99, value5#100] 
testcat.default.src                                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                     +- 
DataMaskingStage0Marker RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src
   !                  +- Project [key#93, md5(cast(cast(value1#94 as string) as 
binary)) AS value1#121, regexp_replace(regexp_replace(regexp_replace(value2#95, 
[A-Z], X, 1), [a-z], x, 1), [0-9], n, 1) AS value2#122, 
regexp_replace(regexp_replace(regexp_replace(value3#96, [A-Z], X, 5), [a-z], x, 
5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#97, 
Some(America/Los_Angeles)) AS value4#124, col0#98, col1#99, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#100, 
(length(value5#100) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#100, 4)) AS value5#125]                     +- Project [key#126, 
md5(cast(cast(value1#127 as string) as binary)) AS value1#121, 
regexp_replace(regexp_replace(regexp_replace(value2#128, [A-Z], X, 1), [a-z], 
x, 1), [0-9], n, 1) AS value2#122, 
regexp_replace(regexp_replace(regexp_replace(value3#129, [A-Z], X, 5), [a-z], 
x, 5), [0-9], n, 5) AS value3#123, date_trunc(YEAR, value4#130, 
Some(America/Los_Angeles)) AS value4#124, col0#1
 31, col1#132, 
concat(regexp_replace(regexp_replace(regexp_replace(left(value5#133, 
(length(value5#133) - 4)), [A-Z], X, 1), [a-z], x, 1), [0-9], n, 1), 
right(value5#133, 4)) AS value5#125]
   !                     +- RelationV2[key#93, value1#94, value2#95, value3#96, 
value4#97, col0#98, col1#99, value5#100] testcat.default.src                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                      +- RelationV2[key#126, 
value1#127, value2#128, value3#129, value4#130, col0#131, col1#132, value5#133] 
testcat.default.src
             
   ```
   
   so i think we can inject a rule before RuleApplyDataMaskingStage1 to keep 
DataMaskingStage0Marker RelationV2 is same as 
   Project RelationV2
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to