[ 
https://issues.apache.org/jira/browse/HIVE-29451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-29451:
--------------------------------
    Description: 
https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java#L662

so, this below is executed for every single partition repeatedly, where this 
logic has no chance to distinguish between the partitions
https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L912-L930
{code}
  public static void configureJobConf(TableDesc tableDesc, JobConf jobConf) {
    try {
      HiveStorageHandler storageHandler = HiveUtils.getStorageHandler(
          jobConf, 
tableDesc.getProperties().getProperty(hive_metastoreConstants.META_TABLE_STORAGE));
      if (storageHandler != null) {
        storageHandler.configureJobConf(tableDesc, jobConf);
      }
      if (tableDesc.getJobSecrets() != null) {
        for (Map.Entry<String, String> entry : 
tableDesc.getJobSecrets().entrySet()) {
          String key = TableDesc.SECRET_PREFIX + TableDesc.SECRET_DELIMIT +
                  tableDesc.getTableName() + TableDesc.SECRET_DELIMIT + 
entry.getKey();
          jobConf.getCredentials().addSecretKey(new Text(key), 
entry.getValue().getBytes());
        }
        tableDesc.getJobSecrets().clear();
      }
    } catch (HiveException e) {
      throw new RuntimeException(e);
    }
  }
{code}
consider a job reading hundreds of partitions (could be thousands, even though 
it's suboptimal for Hive)
we might want to collect distinct tables affected by the MapWork beforehand and 
run this logic once per TableDesc

worst-case is single-partition reads, where we hopefolly won't lose much by the 
new logic proposed above

  was:
https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java#L662

so, this below is executed for every single partition repeatedly, where this 
logic has no chance to distinguish between the partitions
https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L912-L930
{code}
  public static void configureJobConf(TableDesc tableDesc, JobConf jobConf) {
    try {
      HiveStorageHandler storageHandler = HiveUtils.getStorageHandler(
          jobConf, 
tableDesc.getProperties().getProperty(hive_metastoreConstants.META_TABLE_STORAGE));
      if (storageHandler != null) {
        storageHandler.configureJobConf(tableDesc, jobConf);
      }
      if (tableDesc.getJobSecrets() != null) {
        for (Map.Entry<String, String> entry : 
tableDesc.getJobSecrets().entrySet()) {
          String key = TableDesc.SECRET_PREFIX + TableDesc.SECRET_DELIMIT +
                  tableDesc.getTableName() + TableDesc.SECRET_DELIMIT + 
entry.getKey();
          jobConf.getCredentials().addSecretKey(new Text(key), 
entry.getValue().getBytes());
        }
        tableDesc.getJobSecrets().clear();
      }
    } catch (HiveException e) {
      throw new RuntimeException(e);
    }
  }
{code}
consider a job reading hundreds of partitions (can become thousands, even 
though it's suboptimal for Hive)
we might want to collect distinct tables affected by the MapWork beforehand and 
run this logic once per TableDesc

worst-case is single-partition reads, where we hopefolly won't lose much by the 
new logic proposed above


> PlanUtils.configureJobConf is called with a table-level logic for every 
> single partition
> ----------------------------------------------------------------------------------------
>
>                 Key: HIVE-29451
>                 URL: https://issues.apache.org/jira/browse/HIVE-29451
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java#L662
> so, this below is executed for every single partition repeatedly, where this 
> logic has no chance to distinguish between the partitions
> https://github.com/apache/hive/blob/98da62c93f198126c78d3352bf3ac6aeacefa53c/ql/src/java/org/apache/hadoop/hive/ql/plan/PlanUtils.java#L912-L930
> {code}
>   public static void configureJobConf(TableDesc tableDesc, JobConf jobConf) {
>     try {
>       HiveStorageHandler storageHandler = HiveUtils.getStorageHandler(
>           jobConf, 
> tableDesc.getProperties().getProperty(hive_metastoreConstants.META_TABLE_STORAGE));
>       if (storageHandler != null) {
>         storageHandler.configureJobConf(tableDesc, jobConf);
>       }
>       if (tableDesc.getJobSecrets() != null) {
>         for (Map.Entry<String, String> entry : 
> tableDesc.getJobSecrets().entrySet()) {
>           String key = TableDesc.SECRET_PREFIX + TableDesc.SECRET_DELIMIT +
>                   tableDesc.getTableName() + TableDesc.SECRET_DELIMIT + 
> entry.getKey();
>           jobConf.getCredentials().addSecretKey(new Text(key), 
> entry.getValue().getBytes());
>         }
>         tableDesc.getJobSecrets().clear();
>       }
>     } catch (HiveException e) {
>       throw new RuntimeException(e);
>     }
>   }
> {code}
> consider a job reading hundreds of partitions (could be thousands, even 
> though it's suboptimal for Hive)
> we might want to collect distinct tables affected by the MapWork beforehand 
> and run this logic once per TableDesc
> worst-case is single-partition reads, where we hopefolly won't lose much by 
> the new logic proposed above



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to