sauliusvl commented on PR #5468:
URL: https://github.com/apache/kyuubi/pull/5468#issuecomment-1831774255

   Coincidentally I also tried debugging this a bit, here's my summary, 
hopefully it helps:
   
   Before the join happens the table is captured 
[here](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage0.scala#L72)
 into the following:
   
   ```
   DataMaskingStage0Marker Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
   +- Project [portal#0, user_id#1L, id#2, null AS login#261, real_name#4]
      +- Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
   ```
   
   All good so far, however when evaluating the join the deduplication rule 
transforms it into the following:
   ```
   Join LeftOuter, ((portal#0 = portal#529) AND (id#2 = id#531))
   :- SubqueryAlias a
   :  +- Project [portal#0, id#2]
   :     +- SubqueryAlias spark_catalog.tbl0
   :        +- DataMaskingStage0Marker Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
   :           +- Project [portal#0, user_id#1L, id#2, null AS login#261, 
real_name#4,]
   :              +- Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
   +- SubqueryAlias b
      +- Project [portal#529, id#531]
         +- SubqueryAlias spark_catalog.tbl0
            +- DataMaskingStage0Marker Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
               +- Project [portal#529, user_id#530L, id#531, null AS login#261, 
real_name#533]
                  +- Relation 
spark_catalog.tbl0[portal#529,user_id#530L,id#531,login#532,real_name#533] 
parquet
   ```
   
   Note that the IDs in the second `DataMaskingStage0Marker` are not updated - 
I'd guess it's because the original relation was captured by 
`DataMaskingStage0Marker` and Spark has no knowledge about it during the 
deduplication stage.
   
   Because of this `RuleApplyDataMaskingStage1` gets confused: at [this 
point](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage1.scala#L64)
 all columns are considered masked in the second relation, because [this 
condition](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/DataMaskingStage0Marker.scala#L29)
 is always false due to rewritten ids. The end result of the stage is this:
   ```
   DataMaskingStage1Marker
   +- Join LeftOuter, ((portal#529 = portal#529) AND (id#531 = id#531))
      :- SubqueryAlias a
      :  +- Project [portal#0, id#2]
      :     +- SubqueryAlias spark_catalog.tbl0
      :        +- DataMaskingStage0Marker Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
      :           +- Project [portal#0, user_id#1L, id#2, null AS login#261, 
real_name#4]
      :              +- Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
      +- SubqueryAlias b
         +- Project [portal#529, id#531]
            +- SubqueryAlias spark_catalog.tbl0
               +- DataMaskingStage0Marker Relation 
spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet
                  +- Project [portal#529, user_id#530L, id#531, null AS 
login#261, real_name#533]
                     +- Relation 
spark_catalog.tbl0[portal#529,user_id#530L,id#531,login#532,real_name#533] 
parquet
   ```
   i.e. it got wrapped into `DataMaskingStage1Marker` 
[here](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage1.scala#L81)
 because the join condition changed (and looks wrong now). And thus we end up 
with the class cast exception, as Spark expected this to be a `Join`.
   
   It wasn't obvious to me how to best solve this, just leaving my findings 
here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to