I got a little info from Guther on this. Apparently masking behavior was
being driven by specific costomer(s) at the time and was done for all
column references due to concerns about leaking data. Regardless of the
reasoning, we have to follow the semantics that Hive has at this point. We
could always provide the other [top-level select list only] mode later if
that was requested.

On Thu, Nov 14, 2019 at 8:17 PM Shant Hovsepian <sh...@arcadiadata.com>
wrote:

> Any sense what the consumers and end users have asked for regarding
> behavior?
>
> On Tue, Nov 12, 2019, 1:57 PM Todd Lipcon <t...@cloudera.com> wrote:
>
> > I'd agree that applying it at the innermost column ref makes the most
> sense
> > from a security perspective. Otherwise it's trivial to "binary search"
> your
> > way to the value of a masked column, even if the masking is
> > completely "xed" out.
> >
> > I'm surprised to hear that DB2 implements it otherwise, though quick
> > googling agrees with that. Perhaps the assumption there is that anyone
> who
> > is binary-searching to exposes data will be caught by audit or other
> > security features.
> >
> > -Todd
> >
> > On Tue, Nov 12, 2019 at 10:15 AM Tim Armstrong <tarmstr...@cloudera.com>
> > wrote:
> >
> > > I think compatibility with Hive is pretty important - the default
> > > expectation will be that Ranger policies behave consistently across SQL
> > > engines. I think it would be hard to argue for differing default
> > behaviour
> > > if it's in some sense less secure.
> > >
> > > On Tue, Nov 12, 2019 at 12:03 AM Gabor Kaszab <gaborkas...@apache.org>
> > > wrote:
> > >
> > > > Hey Quanlong,
> > > >
> > > > For me it seems more important not to leak confidential information
> so
> > > I'd
> > > > vote for (a). I wonder what others think.
> > > >
> > > > Gabor
> > > >
> > > > On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <
> > huangquanl...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are adding the support for Ranger column masking and need to
> > reach a
> > > > > consensus on the behavior design.
> > > > >
> > > > > A column masking policy is something like "only show last 4 chars
> of
> > > > phone
> > > > > column to user X". When user X reads the phone column, the value
> > woule
> > > be
> > > > > something like "xxxxx6789" instead of the real value "123456789".
> > > > >
> > > > > The behavior is clear when the query is simple. However, there're
> two
> > > > > different behaviors when the query contains subqueries. The key
> part
> > is
> > > > > where we should perform the masking, whether in the outer most
> select
> > > > list,
> > > > > or in the select list of the inner most subquery.
> > > > >
> > > > > To be specifit, consider these two queries:
> > > > > (1) subquery contains predicates on unmasked value
> > > > >   SELECT concat(name, phone) FROM (
> > > > >     SELECT name, phone FROM customer WHERE phone = '123456789'
> > > > >   ) t;
> > > > > (2) subquery contains predicates on masked value
> > > > >   SELECT concat(name, phone) FROM (
> > > > >     SELECT name, phone FROM customer WHERE phone = 'xxxxx6789'
> > > > >   ) t;
> > > > >
> > > > > Let's say there's actually one row in table 'customer' satisfying
> > > phone =
> > > > > '123456789'. When user X runs the queries, the two different
> > behaviors
> > > > are:
> > > > > (a) Query1 returns nothing. Query2 returns one result:
> > "Bobxxxxx6789".
> > > > > (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns
> > nothing.
> > > > >
> > > > > Hive is in behavior (a) since it does a table masking that replaces
> > the
> > > > > TableRef with a subquery containing masked columns. See more in
> > codes:
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> > > > > and some experiments I did:
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
> > > > >
> > > > > Kurt mentions that traditional dbs like DB2 are in behavior (b). I
> > > think
> > > > we
> > > > > need to decide which behavior we'd like to support. The pros for
> > > behavior
> > > > > (a) is no security leak. Because user X can't guess whether there
> are
> > > > some
> > > > > customers with phone number '123456789'. The pros for behavior (b)
> is
> > > > users
> > > > > don't need to rewrite their existing queries after admin applies
> > column
> > > > > masking policies.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Thanks,
> > > > > Quanlong
> > > > >
> > > >
> > >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Reply via email to