Hey Quanlong,

For me it seems more important not to leak confidential information so I'd
vote for (a). I wonder what others think.

Gabor

On Mon, Nov 11, 2019 at 1:04 PM Quanlong Huang <huangquanl...@gmail.com>
wrote:

> Hi all,
>
> We are adding the support for Ranger column masking and need to reach a
> consensus on the behavior design.
>
> A column masking policy is something like "only show last 4 chars of phone
> column to user X". When user X reads the phone column, the value woule be
> something like "xxxxx6789" instead of the real value "123456789".
>
> The behavior is clear when the query is simple. However, there're two
> different behaviors when the query contains subqueries. The key part is
> where we should perform the masking, whether in the outer most select list,
> or in the select list of the inner most subquery.
>
> To be specifit, consider these two queries:
> (1) subquery contains predicates on unmasked value
>   SELECT concat(name, phone) FROM (
>     SELECT name, phone FROM customer WHERE phone = '123456789'
>   ) t;
> (2) subquery contains predicates on masked value
>   SELECT concat(name, phone) FROM (
>     SELECT name, phone FROM customer WHERE phone = 'xxxxx6789'
>   ) t;
>
> Let's say there's actually one row in table 'customer' satisfying phone =
> '123456789'. When user X runs the queries, the two different behaviors are:
> (a) Query1 returns nothing. Query2 returns one result: "Bobxxxxx6789".
> (b) Query1 returns one result: "Bobxxxxx6789". Query2 returns nothing.
>
> Hive is in behavior (a) since it does a table masking that replaces the
> TableRef with a subquery containing masked columns. See more in codes:
>
> https://github.com/apache/hive/blob/rel/release-3.1.2/ql/src/java/org/apache/hadoop/hive/ql/parse/TableMask.java#L86-L155
> and some experiments I did:
>
> https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
>
> Kurt mentions that traditional dbs like DB2 are in behavior (b). I think we
> need to decide which behavior we'd like to support. The pros for behavior
> (a) is no security leak. Because user X can't guess whether there are some
> customers with phone number '123456789'. The pros for behavior (b) is users
> don't need to rewrite their existing queries after admin applies column
> masking policies.
>
> What do you think?
>
> Thanks,
> Quanlong
>

Reply via email to