sauliusvl commented on PR #5468: URL: https://github.com/apache/kyuubi/pull/5468#issuecomment-1831774255
Coincidentally I also tried debugging this a bit, here's my summary, hopefully it helps: Before the join happens the table is captured [here](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage0.scala#L72) into the following: ``` DataMaskingStage0Marker Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet +- Project [portal#0, user_id#1L, id#2, null AS login#261, real_name#4] +- Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet ``` All good so far, however when evaluating the join the deduplication rule transforms it into the following: ``` Join LeftOuter, ((portal#0 = portal#529) AND (id#2 = id#531)) :- SubqueryAlias a : +- Project [portal#0, id#2] : +- SubqueryAlias spark_catalog.tbl0 : +- DataMaskingStage0Marker Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet : +- Project [portal#0, user_id#1L, id#2, null AS login#261, real_name#4,] : +- Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet +- SubqueryAlias b +- Project [portal#529, id#531] +- SubqueryAlias spark_catalog.tbl0 +- DataMaskingStage0Marker Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet +- Project [portal#529, user_id#530L, id#531, null AS login#261, real_name#533] +- Relation spark_catalog.tbl0[portal#529,user_id#530L,id#531,login#532,real_name#533] parquet ``` Note that the IDs in the second `DataMaskingStage0Marker` are not updated - I'd guess it's because the original relation was captured by `DataMaskingStage0Marker` and Spark has no knowledge about it during the deduplication stage. Because of this `RuleApplyDataMaskingStage1` gets confused: at [this point](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage1.scala#L64) all columns are considered masked in the second relation, because [this condition](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/DataMaskingStage0Marker.scala#L29) is always false due to rewritten ids. The end result of the stage is this: ``` DataMaskingStage1Marker +- Join LeftOuter, ((portal#529 = portal#529) AND (id#531 = id#531)) :- SubqueryAlias a : +- Project [portal#0, id#2] : +- SubqueryAlias spark_catalog.tbl0 : +- DataMaskingStage0Marker Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet : +- Project [portal#0, user_id#1L, id#2, null AS login#261, real_name#4] : +- Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet +- SubqueryAlias b +- Project [portal#529, id#531] +- SubqueryAlias spark_catalog.tbl0 +- DataMaskingStage0Marker Relation spark_catalog.tbl0[portal#0,user_id#1L,id#2,login#3,real_name#4] parquet +- Project [portal#529, user_id#530L, id#531, null AS login#261, real_name#533] +- Relation spark_catalog.tbl0[portal#529,user_id#530L,id#531,login#532,real_name#533] parquet ``` i.e. it got wrapped into `DataMaskingStage1Marker` [here](https://github.com/apache/kyuubi/blob/master/extensions/spark/kyuubi-spark-authz/src/main/scala/org/apache/kyuubi/plugin/spark/authz/rule/datamasking/RuleApplyDataMaskingStage1.scala#L81) because the join condition changed (and looks wrong now). And thus we end up with the class cast exception, as Spark expected this to be a `Join`. It wasn't obvious to me how to best solve this, just leaving my findings here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
