Any sense what the consumers and end users have asked for regarding behavior?
On Tue, Nov 12, 2019, 1:57 PM Todd Lipcon <t...@cloudera.com> wrote: > I'd agree that applying it at the innermost column ref makes the most sense > from a security perspective. Otherwise it's trivial to "binary search" your > way to the value of a masked column, even if the masking is > completely "xed" out. > > I'm surprised to hear that DB2 implements it otherwise, though quick > googling agrees with that. Perhaps the assumption there is that anyone who > is binary-searching to exposes data will be caught by audit or other > security features. > > -Todd > > On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong <tarmstr...@cloudera.com> > wrote: > > > I think compatibility with Hive is pretty important - the default > > expectation will be that Ranger policies behave consistently across SQL > > engines. I think it would be hard to argue for differing default > behaviour > > if it's in some sense less secure. > > > > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <gaborkas...@apache.org> > > wrote: > > > > > Hey Quanlong, > > > > > > For me it seems more important not to leak confidential information so > > I'd > > > vote for (a). I wonder what others think. > > > > > > Gabor > > > > > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang < > huangquanl...@gmail.com> > > > wrote: > > > > > > > Hi all, > > > > > > > > We are adding the support for Ranger column masking and need to > reach a > > > > consensus on the behavior design. > > > > > > > > A column masking policy is something like "only show last 4 chars of > > > phone > > > > column to user X". When user X reads the phone column, the value > woule > > be > > > > something like "xxxxx6789" instead of the real value "123456789". > > > > > > > > The behavior is clear when the query is simple. However, there're two > > > > different behaviors when the query contains subqueries. The key part > is > > > > where we should perform the masking, whether in the outer most select > > > list, > > > > or in the select list of the inner most subquery. > > > > > > > > To be specifit, consider these two queries: > > > > (1) subquery contains predicates on unmasked value > > > > SELECT concat(name, phone) FROM ( > > > > SELECT name, phone FROM customer WHERE phone = '123456789' > > > > ) t; > > > > (2) subquery contains predicates on masked value > > > > SELECT concat(name, phone) FROM ( > > > > SELECT name, phone FROM customer WHERE phone = 'xxxxx6789' > > > > ) t; > > > > > > > > Let's say there's actually one row in table 'customer' satisfying > > phone = > > > > '123456789'. When user X runs the queries, the two different > behaviors > > > are: > > > > (a) Query1 returns nothing. Query2 returns one result: > "Bobxxxxx6789". > > > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns > nothing. > > > > > > > > Hive is in behavior (a) since it does a table masking that replaces > the > > > > TableRef with a subquery containing masked columns. See more in > codes: > > > > > > > > > > > > > > https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155 > > > > and some experiments I did: > > > > > > > > > > > > > > https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing > > > > > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I > > think > > > we > > > > need to decide which behavior we'd like to support. The pros for > > behavior > > > > (a) is no security leak. Because user X can't guess whether there are > > > some > > > > customers with phone number '123456789'. The pros for behavior (b) is > > > users > > > > don't need to rewrite their existing queries after admin applies > column > > > > masking policies. > > > > > > > > What do you think? > > > > > > > > Thanks, > > > > Quanlong > > > > > > > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >