Any sense what the consumers and end users have asked for regarding
behavior?

On Tue, Nov 12, 2019, 1:57 PM Todd Lipcon <t...@cloudera.com> wrote:

> I'd agree that applying it at the innermost column ref makes the most sense
> from a security perspective. Otherwise it's trivial to "binary search" your
> way to the value of a masked column, even if the masking is
> completely "xed" out.
>
> I'm surprised to hear that DB2 implements it otherwise, though quick
> googling agrees with that. Perhaps the assumption there is that anyone who
> is binary-searching to exposes data will be caught by audit or other
> security features.
>
> -Todd
>
> On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong <tarmstr...@cloudera.com>
> wrote:
>
> > I think compatibility with Hive is pretty important - the default
> > expectation will be that Ranger policies behave consistently across SQL
> > engines. I think it would be hard to argue for differing default
> behaviour
> > if it's in some sense less secure.
> >
> > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <gaborkas...@apache.org>
> > wrote:
> >
> > > Hey Quanlong,
> > >
> > > For me it seems more important not to leak confidential information so
> > I'd
> > > vote for (a). I wonder what others think.
> > >
> > > Gabor
> > >
> > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <
> huangquanl...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are adding the support for Ranger column masking and need to
> reach a
> > > > consensus on the behavior design.
> > > >
> > > > A column masking policy is something like "only show last 4 chars of
> > > phone
> > > > column to user X". When user X reads the phone column, the value
> woule
> > be
> > > > something like "xxxxx6789" instead of the real value "123456789".
> > > >
> > > > The behavior is clear when the query is simple. However, there're two
> > > > different behaviors when the query contains subqueries. The key part
> is
> > > > where we should perform the masking, whether in the outer most select
> > > list,
> > > > or in the select list of the inner most subquery.
> > > >
> > > > To be specifit, consider these two queries:
> > > > (1) subquery contains predicates on unmasked value
> > > >   SELECT concat(name, phone) FROM (
> > > >     SELECT name, phone FROM customer WHERE phone = '123456789'
> > > >   ) t;
> > > > (2) subquery contains predicates on masked value
> > > >   SELECT concat(name, phone) FROM (
> > > >     SELECT name, phone FROM customer WHERE phone = 'xxxxx6789'
> > > >   ) t;
> > > >
> > > > Let's say there's actually one row in table 'customer' satisfying
> > phone =
> > > > '123456789'. When user X runs the queries, the two different
> behaviors
> > > are:
> > > > (a) Query1 returns nothing. Query2 returns one result:
> "Bobxxxxx6789".
> > > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns
> nothing.
> > > >
> > > > Hive is in behavior (a) since it does a table masking that replaces
> the
> > > > TableRef with a subquery containing masked columns. See more in
> codes:
> > > >
> > > >
> > >
> >
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> > > > and some experiments I did:
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
> > > >
> > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I
> > think
> > > we
> > > > need to decide which behavior we'd like to support. The pros for
> > behavior
> > > > (a) is no security leak. Because user X can't guess whether there are
> > > some
> > > > customers with phone number '123456789'. The pros for behavior (b) is
> > > users
> > > > don't need to rewrite their existing queries after admin applies
> column
> > > > masking policies.
> > > >
> > > > What do you think?
> > > >
> > > > Thanks,
> > > > Quanlong
> > > >
> > >
> >
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to