I think compatibility with Hive is pretty important - the default
expectation will be that Ranger policies behave consistently across SQL
engines. I think it would be hard to argue for differing default behaviour
if it's in some sense less secure.

On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <gaborkas...@apache.org>
wrote:

> Hey Quanlong,
>
> For me it seems more important not to leak confidential information so I'd
> vote for (a). I wonder what others think.
>
> Gabor
>
> On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <huangquanl...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > We are adding the support for Ranger column masking and need to reach a
> > consensus on the behavior design.
> >
> > A column masking policy is something like "only show last 4 chars of
> phone
> > column to user X". When user X reads the phone column, the value woule be
> > something like "xxxxx6789" instead of the real value "123456789".
> >
> > The behavior is clear when the query is simple. However, there're two
> > different behaviors when the query contains subqueries. The key part is
> > where we should perform the masking, whether in the outer most select
> list,
> > or in the select list of the inner most subquery.
> >
> > To be specifit, consider these two queries:
> > (1) subquery contains predicates on unmasked value
> >   SELECT concat(name, phone) FROM (
> >     SELECT name, phone FROM customer WHERE phone = '123456789'
> >   ) t;
> > (2) subquery contains predicates on masked value
> >   SELECT concat(name, phone) FROM (
> >     SELECT name, phone FROM customer WHERE phone = 'xxxxx6789'
> >   ) t;
> >
> > Let's say there's actually one row in table 'customer' satisfying phone =
> > '123456789'. When user X runs the queries, the two different behaviors
> are:
> > (a) Query1 returns nothing. Query2 returns one result: "Bobxxxxx6789".
> > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns nothing.
> >
> > Hive is in behavior (a) since it does a table masking that replaces the
> > TableRef with a subquery containing masked columns. See more in codes:
> >
> >
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> > and some experiments I did:
> >
> >
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
> >
> > Kurt mentions that traditional dbs like DB2 are in behavior (b). I think
> we
> > need to decide which behavior we'd like to support. The pros for behavior
> > (a) is no security leak. Because user X can't guess whether there are
> some
> > customers with phone number '123456789'. The pros for behavior (b) is
> users
> > don't need to rewrite their existing queries after admin applies column
> > masking policies.
> >
> > What do you think?
> >
> > Thanks,
> > Quanlong
> >
>

Reply via email to