Hi all, We are adding the support for Ranger column masking and need to reach a consensus on the behavior design.
A column masking policy is something like "only show last 4 chars of phone column to user X". When user X reads the phone column, the value woule be something like "xxxxx6789" instead of the real value "123456789". The behavior is clear when the query is simple. However, there're two different behaviors when the query contains subqueries. The key part is where we should perform the masking, whether in the outer most select list, or in the select list of the inner most subquery. To be specifit, consider these two queries: (1) subquery contains predicates on unmasked value SELECT concat(name, phone) FROM ( SELECT name, phone FROM customer WHERE phone = '123456789' ) t; (2) subquery contains predicates on masked value SELECT concat(name, phone) FROM ( SELECT name, phone FROM customer WHERE phone = 'xxxxx6789' ) t; Let's say there's actually one row in table 'customer' satisfying phone = '123456789'. When user X runs the queries, the two different behaviors are: (a) Query1 returns nothing. Query2 returns one result: "Bobxxxxx6789". (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns nothing. Hive is in behavior (a) since it does a table masking that replaces the TableRef with a subquery containing masked columns. See more in codes: https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 and some experiments I did: https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing Kurt mentions that traditional dbs like DB2 are in behavior (b). I think we need to decide which behavior we'd like to support. The pros for behavior (a) is no security leak. Because user X can't guess whether there are some customers with phone number '123456789'. The pros for behavior (b) is users don't need to rewrite their existing queries after admin applies column masking policies. What do you think? Thanks, Quanlong