[jira] [Work logged] (HIVE-27056) Ensure that MR distcp goes to the same Yarn queue as other parts of the query

ASF GitHub Bot (Jira) Mon, 13 Feb 2023 06:12:06 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-27056?focusedWorklogId=845144&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-845144
 ]


ASF GitHub Bot logged work on HIVE-27056:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Feb/23 14:11
            Start Date: 13/Feb/23 14:11
    Worklog Time Spent: 10m 
      Work Description: ayushtkn commented on code in PR #4037:
URL: https://github.com/apache/hive/pull/4037#discussion_r1104501992


##########
shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java:
##########
@@ -1146,7 +1146,8 @@ public boolean runDistCp(List<Path> srcPaths, Path dst, 
Configuration conf) thro
 
       // HIVE-13704 states that we should use run() instead of execute() due 
to a hadoop known issue
       // added by HADOOP-10459
-      if (distcp.run(params.toArray(new String[0])) == 0) {
+      int rc = runDistCpInternal(distcp, params);
+      if (rc == 0) {
         return true;
       } else {
         return false;

Review Comment:
   I don't understand why are you introducing a variable or so,
   why not just change:
   ```
   if (distcp.run(params.toArray(new String[0])) == 0) { 
   ```
   to
   ```
         if (runDistCpInternal(distcp, params) == 0) {
   ```
   or if you want to refactor then may be:
   ```
         return runDistCpInternal(distcp, params) == 0;
   ```



##########
shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java:
##########
@@ -1211,6 +1212,35 @@ public boolean runDistCpWithSnapshots(String 
oldSnapshot, String newSnapshot, Li
     return false;
   }
 
+  protected int runDistCpInternal(DistCp distcp, List<String> params) {
+    ensureMapReduceQueue(distcp.getConf());
+    return distcp.run(params.toArray(new String[0]));
+  }
+
+  /**
+   * This method ensures if there is an explicit tez.queue.name set, the 
hadoop shim will submit jobs
+   * to the same yarn queue. This solves a security issue where e.g settings 
have the following values:
+   * tez.queue.name=sample
+   * hive.server2.tez.queue.access.check=true
+   * In this case, when a query submits Tez DAGs, the tez client layer checks 
whether the end user has access to
+   * the yarn queue 'sample' via YarnQueueHelper, but this is not respected in 
case of MR jobs that run
+   * even if the query execution engine is Tez. E.g. an EXPORT TABLE can 
submit DistCp MR jobs at some stages when
+   * certain criteria are met. We tend to restrict the setting of 
mapreduce.job.queuename in order to bypass this
+   * security flaw, and even the default queue is unexpected if we explicitly 
set tez.queue.name.
+   * Under the hood the desired behavior is to have DistCp jobs in the same 
yarn queue as other parts
+   * of the query. Most of the time, the user isn't aware that a query 
involves DistCp jobs, hence isn't aware
+   * of these details.
+   */
+  protected void ensureMapReduceQueue(Configuration conf) {
+    String queueName = conf.get(TezConfiguration.TEZ_QUEUE_NAME);
+    LOG.debug("Checking tez.queue.name {}", queueName);
+    if (queueName != null && queueName.length() > 0) {
+      LOG.info("Setting mapreduce.job.queuename (current: '{}') to become 
tez.queue.name: '{}'",
+          conf.get(MRJobConfig.QUEUE_NAME), queueName);
+      conf.set(MRJobConfig.QUEUE_NAME, queueName);
+    }

Review Comment:
   Say during replication, External Table are copied they are simple DistCp 
jobs, they should use that only?
   Second, is this an incompatible change as well?



##########
shims/0.23/src/main/java/org/apache/hadoop/hive/shims/Hadoop23Shims.java:
##########
@@ -1211,6 +1212,35 @@ public boolean runDistCpWithSnapshots(String 
oldSnapshot, String newSnapshot, Li
     return false;
   }
 
+  protected int runDistCpInternal(DistCp distcp, List<String> params) {
+    ensureMapReduceQueue(distcp.getConf());
+    return distcp.run(params.toArray(new String[0]));
+  }
+
+  /**
+   * This method ensures if there is an explicit tez.queue.name set, the 
hadoop shim will submit jobs
+   * to the same yarn queue. This solves a security issue where e.g settings 
have the following values:
+   * tez.queue.name=sample
+   * hive.server2.tez.queue.access.check=true
+   * In this case, when a query submits Tez DAGs, the tez client layer checks 
whether the end user has access to
+   * the yarn queue 'sample' via YarnQueueHelper, but this is not respected in 
case of MR jobs that run
+   * even if the query execution engine is Tez. E.g. an EXPORT TABLE can 
submit DistCp MR jobs at some stages when
+   * certain criteria are met. We tend to restrict the setting of 
mapreduce.job.queuename in order to bypass this
+   * security flaw, and even the default queue is unexpected if we explicitly 
set tez.queue.name.
+   * Under the hood the desired behavior is to have DistCp jobs in the same 
yarn queue as other parts
+   * of the query. Most of the time, the user isn't aware that a query 
involves DistCp jobs, hence isn't aware
+   * of these details.
+   */
+  protected void ensureMapReduceQueue(Configuration conf) {
+    String queueName = conf.get(TezConfiguration.TEZ_QUEUE_NAME);
+    LOG.debug("Checking tez.queue.name {}", queueName);
+    if (queueName != null && queueName.length() > 0) {
+      LOG.info("Setting mapreduce.job.queuename (current: '{}') to become 
tez.queue.name: '{}'",
+          conf.get(MRJobConfig.QUEUE_NAME), queueName);
+      conf.set(MRJobConfig.QUEUE_NAME, queueName);
+    }

Review Comment:
   I have doubts here, Distcp itself is a MapReduce Job, it will launch a MR 
job only, so why it should not use MR queue?
   Second, Hive on MR might not be widely used but still exists and we haven't 
mentioned anywhere it won't work upstream.
   So, if someone isn't using Tez, will this create problems?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 845144)
    Time Spent: 50m  (was: 40m)

> Ensure that MR distcp goes to the same Yarn queue as other parts of the query
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-27056
>                 URL: https://issues.apache.org/jira/browse/HIVE-27056
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0-alpha-2
>
>         Attachments: image (1).png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Given the following plan for an EXPORT TABLE query:
> {code}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | STAGE DEPENDENCIES:                                |
> |   Stage-2 is a root stage                          |
> |   Stage-3 depends on stages: Stage-2               |
> |   Stage-1 depends on stages: Stage-3               |
> |   Stage-6 depends on stages: Stage-1               |
> |   Stage-5 depends on stages: Stage-6               |
> |   Stage-7 depends on stages: Stage-5               |
> |                                                    |
> | STAGE PLANS:                                       |
> |   Stage: Stage-2                                   |
> |     Tez                                            |
> |       DagId: hive_20230124103803_b865e63c-4565-4b95-8664-8c983938697a:7 |
> |       DagName: hive_20230124103803_b865e63c-4565-4b95-8664-8c983938697a:7 |
> |       Vertices:                                    |
> |         Map 1                                      |
> |             Map Operator Tree:                     |
> |                 TableScan                          |
> |                   alias: distcptest                |
> |                   Statistics: Num rows: 2 Data size: 182658 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                   Select Operator                  |
> |                     expressions: id (type: int), txt (type: string) |
> |                     outputColumnNames: _col0, _col1 |
> |                     Statistics: Num rows: 2 Data size: 182658 Basic stats: 
> COMPLETE Column stats: COMPLETE |
> |                     File Output Operator           |
> |                       compressed: false            |
> |                       Statistics: Num rows: 2 Data size: 182658 Basic 
> stats: COMPLETE Column stats: COMPLETE |
> |                       table:                       |
> |                           input format: 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat |
> |                           output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat |
> |                           serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde |
> |                           name: 
> bigd35368.distcptest_ced9015c_852b_49a4_8c28_4d284d598523 |
> |             Execution mode: vectorized             |
> |                                                    |
> |   Stage: Stage-3                                   |
> |     Dependency Collection                          |
> |                                                    |
> |   Stage: Stage-1                                   |
> |     Move Operator                                  |
> |       tables:                                      |
> |           replace: false                           |
> |           table:                                   |
> |               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
> |
> |               output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat |
> |               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde |
> |               name: 
> bigd35368.distcptest_ced9015c_852b_49a4_8c28_4d284d598523 |
> |                                                    |
> |   Stage: Stage-6                                   |
> |     Set Properties                                 |
> |       table name: bigd35368.distcptest_ced9015c_852b_49a4_8c28_4d284d598523 
> |
> |       properties:                                  |
> |         transactional true                         |
> |                                                    |
> |   Stage: Stage-5                                   |
> |     Export Work                                    |
> |                                                    |
> |   Stage: Stage-7                                   |
> |     Drop Table                                     |
> |       table: bigd35368.distcptest_ced9015c_852b_49a4_8c28_4d284d598523 |
> |                                                    |
> +----------------------------------------------------+
> {code}
> here, Stage-5 Export Work starts MR distcp jobs
> the problem and solution are described in code comment, pasted below:
> {code}
>   /**
>    * This method ensures if there is an explicit tez.queue.name set, the 
> hadoop shim will submit jobs
>    * to the same yarn queue. This solves a security issue where e.g settings 
> have the following values:
>    * tez.queue.name=sample
>    * hive.server2.tez.queue.access.check=true
>    * In this case, when a query submits Tez DAGs, the tez client layer checks 
> whether the end user has access to
>    * the yarn queue 'sample' via YarnQueueHelper, but this is not respected 
> in case of MR jobs that run
>    * even if the query execution engine is Tez. E.g. an EXPORT TABLE can 
> submit DistCp MR jobs at some stages when
>    * certain criteria are met. We tend to restrict the setting of 
> mapreduce.job.queuename in order to bypass this
>    * security flaw, and even the default queue is unexpected if we explicitly 
> set tez.queue.name.
>    * Under the hood the desired behavior is to have DistCp jobs in the same 
> yarn queue as other parts
>    * of the query. Most of the time, the user isn't aware that a query 
> involves DistCp jobs, hence isn't aware
>    * of these details.
>    */
> {code}
> after the fix, the MR job went into the same yarn queue: 
> https://issues.apache.org/jira/secure/attachment/13055215/image%20%281%29.png



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-27056) Ensure that MR distcp goes to the same Yarn queue as other parts of the query

Reply via email to