Janos Kovacs created HIVE-27322:
-----------------------------------

             Summary: Iceberg: metadata location overrides can cause data breach
                 Key: HIVE-27322
                 URL: https://issues.apache.org/jira/browse/HIVE-27322
             Project: Hive
          Issue Type: Bug
          Components: Iceberg integration
    Affects Versions: 4.0.0-alpha-2
            Reporter: Janos Kovacs


Set to bug/blocker instead of enhancement due to its security related nature, 
Hive4 should not be released w/o fix for this. Please reset if needed.

 

Context: 
 * There are some core tables with sensitive data that users can only query 
with data masking enforced (e.g. via Ranger). Let's assume this is the 
`default.icebergsecured` table.
 * An end-user can only access the masked form of the sensitive data as 
expected...
 * The users also have privilege to create new tables in their own sandbox 
databases - let's assume this is the `default.trojanhorse` table for now.
 * The user can create a malicious table that exposes the sensitive data 
non-masked leading to a possible data breach.

Repro:
 * First make sure the data is secured by the masking policy:
{noformat}
<kinit as privileged user>
beeline -e "
DROP TABLE IF EXISTS default.icebergsecured PURGE;
CREATE EXTERNAL TABLE default.icebergsecured (txt string, secret string) STORED 
BY ICEBERG;
INSERT INTO default.icebergsecured VALUES ('You might be allowed to see 
this.','You are NOT allowed to see this!');
"

<kinit as end user>
beeline -e "
SELECT * FROM default.icebergsecured;
"

+------------------------------------+--------------------------------+
|         icebergsecured.txt         |     icebergsecured.secret      |
+------------------------------------+--------------------------------+
| You might be allowed to see this.  | MASKED BY RANGER FOR SECURITY  |
+------------------------------------+--------------------------------+
{noformat}

 * Now let the user to create the malicious table exposing the sensitive data:
{noformat}
<kinit as end user>
SECURED_META_LOCATION=$(HADOOP_CLIENT_OPTS="-Djline.terminal=jline.UnsupportedTerminal"
 beeline -e "DESCRIBE FORMATTED default.icebergsecured;" 2>/dev/null |grep 
metadata_location  |grep -v previous_metadata_location | awk '{print $5}')
beeline -e "
DROP TABLE IF EXISTS default.trojanhorse;
CREATE EXTERNAL TABLE default.trojanhorse (txt string, secret string) STORED BY 
ICEBERG
TBLPROPERTIES (
  'metadata_location'='${SECURED_META_LOCATION}');
SELECT * FROM default.trojanhorse;
"

+------------------------------------+-----------------------------------+
|          trojanhorse.txt           |        trojanhorse.secret         |
+------------------------------------+-----------------------------------+
| You might be allowed to see this.  | You are not allowed to see this!  |
+------------------------------------+-----------------------------------+
{noformat}

 

Currently - after HIVE-26707 - the rwstorage authorization only has either the 
dummy path or the explicit path set for uri:  
{noformat}
Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
[iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ftrojanhorse%2Fmetadata%2Fdummy.metadata.json]

Permission denied: user [oozie] does not have [RWSTORAGE] privilege on 
[iceberg://default/trojanhorse?snapshot=%2Fwarehouse%2Ftablespace%2Fexternal%2Fhive%2Ficebergsecured%2Fmetadata%2F00001-f4c2a428-30ce-4afd-82ff-d46ecbf02244.metadata.json]
 
{noformat}
This is can be used only to decide whether a user is allowed to create iceberg 
tables in certain databases with certain names but controlling it's metadata 
location is hard in that form:
 * it does not provide a variable of "default table location" so a rule needs 
to know the per-database table location or per-catalog warehouse location to be 
able to construct it
 * it does not provide a rich regex to filter out `/../` style directory 
references
 * but basically there should be also a flag whether explicit metadata location 
is provided or not instead of the dummy reference, which then again needs 
explicit matching in the policy to handle

 

Proposed enhancement:
 * The URL for the iceberg table's rwstorage authorization should be changed 
the following way
 ** the <database>/<table>?<location> is good but
 *** the location should not be url encoded, or at least the authorizer should 
check the policy against the decoded url
 *** the separator between the table and location should be "/" instead of "?" 
as "?" might be mixed with its regex meaning!
 *** "/" as separator can be also confusing as the absolute paths would start 
with it. Might be that another separator character that does not conflict with 
regex, paths and table-name valid characters would be even better.
 ** the "snapshot=" seems to be non-relevant in this context, it should not be 
part of the <location>
 ** There is a need to differentiate the cases where location is only generated 
or when location is explicitly provided by end user. For this, the "default" 
location might just not be generated as path but replaced with 
"default_location" fixed value - note! it has no leading "/". That way a single 
policy definition could be used to cover all tables in their default locations 
like:
{noformat}
iceberg://mydatabase/*/default_location           or
iceberg://mydatabase/*/snapshot=default_location{noformat}

 * 
 ** 
 *** "default" here means the table's default location which can depend on the 
warehouse location, the database location and the table's explicit location
 *** I know many developers don't like these type of hardcoded static values 
but with such a value there is no need to modify the rwstorage authorization 
(and we already using similar method for "METASTOER" type of storagehandler 
authorization)
 * The authorization request should include the rwstorage authorization only if 
the CERATE/ALTER/DROP is against the Iceberg table and not if in such 
statements the iceberg table is only a source - that might be already in fix 
via HIVE-27304
 * When any custom Iceberg metadata.json location is provided then the 
<location> must contain to provided path to be able to properly authorize it.
 * We either should not allow backstep "/../" in any locations or give an 
option to filter these out in the final authorization step.
Like with a policy on
{noformat}
iceberg://mydatabase/mytable//data/use-case-1/*             or
iceberg://mydatabase/mytable/snapshot=/data/use-case-1/*{noformat}
should not match access for
{noformat}
iceberg://mydatabse/mytable//data/use-case-1/../use-case-2/*          or
iceberg://mydatabse/mytable/snapshot=/data/use-case-1/../use-case-2/*{noformat}

 * 
 ** Not allowing "/../" might be easier to handle on hive/impala side
 ** But "/../" might be still valid if it used within an allowed locations. --> 
users just should use proper location w/o "/../"

 

With the above changes one single new default Ranger policy could be used to 
globally enable the creation of iceberg tables in any databases in any table 
locations but NOT using any custom metadata locations (in case if user doesn't 
want to globally turn storagehandler authorization off - via 
hive.security.authorization.tables.on.storagehandlers=false - because of other 
- hbase, kafka - handlers):
{noformat}
iceberg://*/*/default_location              or
iceberg://*/*/snapshot=default_location  {noformat}
Note that there is no extra "/" in front of "default_location", only the 
separator character in the first case

 

Also with the above changes we could allow users to configure authorizations 
for custom/shared data location via:
{noformat}
iceberg://mydatabase/*//some/shared/table/location/*              or
iceberg://mydatabase/*/snapshot=/some/shared/table/location/* {noformat}
  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to