Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14894 )

Change subject: IMPALA-9009: Core support for Ranger column masking
......................................................................


Patch Set 12:

(7 comments)

Thanks for your detailed reviews, guys! Will fix the bug found. About our 
column masking behaviors, there're more details in

* experiment doc: 
https://docs.google.com/document/d/1LYk2wxT3GMw4ur5y9JBBykolfAs31P3gWRStk21PomM/edit?usp=sharing
* Mailing-list thread: 
https://lists.apache.org/thread.html/93c4a6a7f20c88e9472067074c9746b99f0a47880aaed1417b9771ac%40%3Cdev.impala.apache.org%3E
* design doc: 
https://docs.google.com/document/d/1GC7au6F5Snp8zQisRopOhKSjKsI1XPPg8S2foQxfJrA/edit#

http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/analysis/TableRef.java
File fe/src/main/java/org/apache/impala/analysis/TableRef.java:

http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/analysis/TableRef.java@27
PS12, Line 27: import java.util.UUID;
> Is this needed? Don't see it getting used anywhere in the diffs.
Done


http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java
File 
fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java:

http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@283
PS12, Line 283: maskType
> Do we need to add Preconditions here to check if maskType and maskTypeDef a
Sure. Will add the Preconditions check for maskType. Actually if maskType is 
null, accessResult.isMaskEnabled() will return false. But we do need to verify 
it.

For maskTypeDef, it can be null and we have dealt with it at line 296.


http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/main/java/org/apache/impala/authorization/ranger/RangerAuthorizationChecker.java@290
PS12, Line 290: mask
> Curious to understand: Do these masks work as UDFs? How are they loaded?
Yes, these builtin functions are being added in another patch: 
https://gerrit.cloudera.org/c/14963/


http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java
File 
fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java:

http://gerrit.cloudera.org:8080/#/c/14894/12/fe/src/test/java/org/apache/impala/authorization/AuthorizationStmtTest.java@a2923
PS12, Line 2923:
> As I recall, the second argument to the function substr() starts from 1 ins
Sure. It's added long time before in other patches and this doesn't trigger 
analysis exceptions. Will change 0 to 1 if needed.


http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test
File 
testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test:

http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@227
PS12, Line 227: 100
> If my understanding is correct, both the table 'alltypestiny' and the table
Changing 100 to 1234 then the results is empty. Actually if Impala incorrectly 
performs column masking on the local view 'alltypes', 100 will be masked into 
10000 (id => id * 100) and the results is empty. So I think 100 is better than 
1234.


http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@240
PS12, Line 240: 0aaaaaa
> > I don't understand this value - the mask on alltypestiny.string_col should 
> > only add 'aaa', not 'aaaaaa'. Or I misunderstood something?

> It looks like the 3rd column masking policy (the one concatenating the 
> content of {col} with 'aaa') is applied twice. One in "select id, bool_col, 
> string_col from alltypestiny" and the other in "select iv1.*, iv2.*, 
> v.string_col from iv1, iv2, alltypes_view v". I am not very sure if my 
> understanding is correct.

Sorry that this is a bug... We should not apply the masking policy on 
alltypestiny.string_col twice. Didn't check the copy-pasted result carefully...

> Intuitively, when column masking is enabled, the number of rows returned 
> should still be the same.

No. As discussed in the mailing-list: 
https://lists.apache.org/thread.html/93c4a6a7f20c88e9472067074c9746b99f0a47880aaed1417b9771ac%40%3Cdev.impala.apache.org%3E,
 we choose to be consistent with Hive's behavior. So all expressions using a 
column with a masking policy will perform on its *masked* value. The number of 
rows could change due to this.

> After some thoughts, I have another related question. According to the tests 
> provided in this file, it seems that the predicates involved in a join are 
> evaluated on the masked/transformed column values if there exist some 
> corresponding column masking policies.

Yes, this is Hive's behavior.

> If this is the case, I was wondering whether or not the order of tables in 
> the from clause would affect the results of the join. To be more specific, is 
> it guaranteed that the following two SQL statements produce the same set of 
> result given "any" set of  column masking policies and "any" set of source 
> tables? Thanks!
>
> Query 1:
select a.id, s.bool_col, t.bool_col, s.string_col, t.string_col
from functional.alltypes a,
  functional.alltypestiny t,
  functional.alltypessmall s
where a.id = t.id and t.id = s.id;
>
> Query 2:
select a.id, s.bool_col, t.bool_col, s.string_col, t.string_col
from functional.alltypes a,
  functional.alltypessmall s,
  functional.alltypestiny t
where a.id = t.id and t.id = s.id;

I think for INNER join, it's guaranteed since the join order doesn't affect the 
result. You can image 'alltypestiny' as a new table with masked values. And the 
same for 'alltypes' that also has column masking policies.


http://gerrit.cloudera.org:8080/#/c/14894/12/testdata/workloads/functional-query/queries/QueryTest/ranger_column_masking.test@261
PS12, Line 261: Should not mask the columns when used in sql generations
> I am a bit confused by this comment. Does that mean after creating the view
Sorry to make you misunderstand the comment. "sql generation" here means 
generating the query string for CreateView or AlterView statements. For such 
kinds of statements, Impala will analyze their AST (e.g. for syntax check or 
some rewrites). Then regenerate the QueryStmt part by using toSql() of the AST 
node (
https://github.com/apache/impala/blob/497a17d/fe/src/main/java/org/apache/impala/analysis/CreateOrAlterViewStmtBase.java#L150)
 and use the generated sql to create/alter view in Hive.

The comment means although we have column masking policies on table 
'alltypestiny', we don't rewrite any queries on it used in CreateView or 
AlterView. So the result of ShowCreateView won't be something like

 CREATE VIEW $UNIQUE_DB.masked_view AS
 SELECT * FROM (
   select id * 100 as id, bool_col, tinyint_col, ... from 
functional.alltypestiny) t



--
To view, visit http://gerrit.cloudera.org:8080/14894
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4cad60e0e69ea573b7ecfc011b142c46ef52ed61
Gerrit-Change-Number: 14894
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vih...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jan 2020 03:36:31 +0000
Gerrit-HasComments: Yes

Reply via email to