I think compatibility with Hive is pretty important - the default expectation will be that Ranger policies behave consistently across SQL engines. I think it would be hard to argue for differing default behaviour if it's in some sense less secure.
On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <gaborkas...@apache.org> wrote: > Hey Quanlong, > > For me it seems more important not to leak confidential information so I'd > vote for (a). I wonder what others think. > > Gabor > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <huangquanl...@gmail.com> > wrote: > > > Hi all, > > > > We are adding the support for Ranger column masking and need to reach a > > consensus on the behavior design. > > > > A column masking policy is something like "only show last 4 chars of > phone > > column to user X". When user X reads the phone column, the value woule be > > something like "xxxxx6789" instead of the real value "123456789". > > > > The behavior is clear when the query is simple. However, there're two > > different behaviors when the query contains subqueries. The key part is > > where we should perform the masking, whether in the outer most select > list, > > or in the select list of the inner most subquery. > > > > To be specifit, consider these two queries: > > (1) subquery contains predicates on unmasked value > > SELECT concat(name, phone) FROM ( > > SELECT name, phone FROM customer WHERE phone = '123456789' > > ) t; > > (2) subquery contains predicates on masked value > > SELECT concat(name, phone) FROM ( > > SELECT name, phone FROM customer WHERE phone = 'xxxxx6789' > > ) t; > > > > Let's say there's actually one row in table 'customer' satisfying phone = > > '123456789'. When user X runs the queries, the two different behaviors > are: > > (a) Query1 returns nothing. Query2 returns one result: "Bobxxxxx6789". > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns nothing. > > > > Hive is in behavior (a) since it does a table masking that replaces the > > TableRef with a subquery containing masked columns. See more in codes: > > > > > https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 > > and some experiments I did: > > > > > https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I think > we > > need to decide which behavior we'd like to support. The pros for behavior > > (a) is no security leak. Because user X can't guess whether there are > some > > customers with phone number '123456789'. The pros for behavior (b) is > users > > don't need to rewrite their existing queries after admin applies column > > masking policies. > > > > What do you think? > > > > Thanks, > > Quanlong > > >