[ 
https://issues.apache.org/jira/browse/HIVE-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531443#comment-14531443
 ] 

Sergio Peña commented on HIVE-8065:
-----------------------------------

Hey [~thejas]

Here's some answers about the issues:

1. If the encrypted zone where the results will be written is read-only, then 
Hive will try to use the directory set by {{hive.exec.scratchdir}} only if the 
scratch directory is encrypted as well (see HIVE-8945). This might create a 
performance issue if the encrypted scratch directory is in a different 
encryption zone. The user may change that directory to a writable directory 
inside the same encryption zone to make the move faster. This might be a little 
tedious for users, but it is the only way to protect their data.

2. This is a little tricky. Currently, hive selects the encryption zone that 
has the most strength cipher (aes128 vs aes256), and uses that location to 
store all final and intermediate results. This avoids writing intermediate data 
(aes256 to aes128), and then writing back the  final result to aes256. Here we 
have another performance issue where final result files would be copied (and 
not renamed) to the destination table as encryption zones might be different.

We did not do any work to deny access to stored results in another encryption 
zone. The solution only avoids that encrypted data touches non-encrypted zones, 
or weaker encrypted zones. Maybe other solutions, like Sentry, may work on this 
access control. But without an access control mechanism, this issue exists on 
the scratch directory, doesn't it?



> Support HDFS encryption functionality on Hive
> ---------------------------------------------
>
>                 Key: HIVE-8065
>                 URL: https://issues.apache.org/jira/browse/HIVE-8065
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 0.13.1
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>              Labels: Hive-Scrum
>
> The new encryption support on HDFS makes Hive incompatible and unusable when 
> this feature is used.
> HDFS encryption is designed so that an user can configure different 
> encryption zones (or directories) for multi-tenant environments. An 
> encryption zone has an exclusive encryption key, such as AES-128 or AES-256. 
> Because of security compliance, the HDFS does not allow to move/rename files 
> between encryption zones. Renames are allowed only inside the same encryption 
> zone. A copy is allowed between encryption zones.
> See HDFS-6134 for more details about HDFS encryption design.
> Hive currently uses a scratch directory (like /tmp/$user/$random). This 
> scratch directory is used for the output of intermediate data (between MR 
> jobs) and for the final output of the hive query which is later moved to the 
> table directory location.
> If Hive tables are in different encryption zones than the scratch directory, 
> then Hive won't be able to renames those files/directories, and it will make 
> Hive unusable.
> To handle this problem, we can change the scratch directory of the 
> query/statement to be inside the same encryption zone of the table directory 
> location. This way, the renaming process will be successful. 
> Also, for statements that move files between encryption zones (i.e. LOAD 
> DATA), a copy may be executed instead of a rename. This will cause an 
> overhead when copying large data files, but it won't break the encryption on 
> Hive.
> Another security thing to consider is when using joins selects. If Hive joins 
> different tables with different encryption key strengths, then the results of 
> the select might break the security compliance of the tables. Let's say two 
> tables with 128 bits and 256 bits encryption are joined, then the temporary 
> results might be stored in the 128 bits encryption zone. This will conflict 
> with the table encrypted with 256 bits temporary.
> To fix this, Hive should be able to select the scratch directory that is more 
> secured/encrypted in order to save the intermediate data temporary with no 
> compliance issues.
> For instance:
> {noformat}
> SELECT * FROM table-aes128 t1 JOIN table-aes256 t2 WHERE t1.id == t2.id;
> {noformat}
> - This should use a scratch directory (or staging directory) inside the 
> table-aes256 table location.
> {noformat}
> INSERT OVERWRITE TABLE table-unencrypted SELECT * FROM table-aes1;
> {noformat}
> - This should use a scratch directory inside the table-aes1 location.
> {noformat}
> FROM table-unencrypted
> INSERT OVERWRITE TABLE table-aes128 SELECT id, name
> INSERT OVERWRITE TABLE table-aes256 SELECT id, name
> {noformat}
> - This should use a scratch directory on each of the tables locations.
> - The first SELECT will have its scratch directory on table-aes128 directory.
> - The second SELECT will have its scratch directory on table-aes256 directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to