[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372234#comment-16372234
 ] 

Alexander Kolbasov commented on HIVE-18743:
-------------------------------------------

The code that checks for {{DO_NOT_UPDATE_STATS}} as well as the property itself 
were added as part of HIVE-10228 and it has the following comment:

{code}  // This string constant is used by AlterHandler to figure out that it 
should not attempt to
  // update stats. It is set by any client-side task which wishes to signal 
that no stats
  // update should take place, such as with replication.
 public static final String DO_NOT_UPDATE_STATS = "DO_NOT_UPDATE_STATS";
{code}

The actual check is rather strange:

{code}
    if ((params!=null) && 
params.containsKey(StatsSetupConst.DO_NOT_UPDATE_STATS)){
      boolean doNotUpdateStats = 
Boolean.valueOf(params.get(StatsSetupConst.DO_NOT_UPDATE_STATS));
      params.remove(StatsSetupConst.DO_NOT_UPDATE_STATS);
      tbl.setParameters(params); // to make sure we remove this marker property
      if (doNotUpdateStats){
        return false;
      }
    }
{code}

So after the check the {{DO_NOT_UPDATE_STATS}} is removed from parameters for 
some reason.

[~ashutoshc] [~thejas] Can you comment why the parameter is removed after the 
check and why the check is performed after file system operations are complete? 
To what extent does remote replication depends on existing behavior?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-18743
>                 URL: https://issues.apache.org/jira/browse/HIVE-18743
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 1.2.0, 1.1.0
>            Reporter: Alexander Behm
>            Assignee: Alexander Kolbasov
>            Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>                                              boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
>     if (tbl.getPartitionKeysSize() == 0) {
>       // Update stats only when unpartitioned
>       FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>       return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
>     } else {
>       return false;
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to