Hey Quanlong, For me it seems more important not to leak confidential information so I'd vote for (a). I wonder what others think.
Gabor On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <huangquanl...@gmail.com> wrote: > Hi all, > > We are adding the support for Ranger column masking and need to reach a > consensus on the behavior design. > > A column masking policy is something like "only show last 4 chars of phone > column to user X". When user X reads the phone column, the value woule be > something like "xxxxx6789" instead of the real value "123456789". > > The behavior is clear when the query is simple. However, there're two > different behaviors when the query contains subqueries. The key part is > where we should perform the masking, whether in the outer most select list, > or in the select list of the inner most subquery. > > To be specifit, consider these two queries: > (1) subquery contains predicates on unmasked value > SELECT concat(name, phone) FROM ( > SELECT name, phone FROM customer WHERE phone = '123456789' > ) t; > (2) subquery contains predicates on masked value > SELECT concat(name, phone) FROM ( > SELECT name, phone FROM customer WHERE phone = 'xxxxx6789' > ) t; > > Let's say there's actually one row in table 'customer' satisfying phone = > '123456789'. When user X runs the queries, the two different behaviors are: > (a) Query1 returns nothing. Query2 returns one result: "Bobxxxxx6789". > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns nothing. > > Hive is in behavior (a) since it does a table masking that replaces the > TableRef with a subquery containing masked columns. See more in codes: > > https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 > and some experiments I did: > > https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I think we > need to decide which behavior we'd like to support. The pros for behavior > (a) is no security leak. Because user X can't guess whether there are some > customers with phone number '123456789'. The pros for behavior (b) is users > don't need to rewrite their existing queries after admin applies column > masking policies. > > What do you think? > > Thanks, > Quanlong >