Alexander Behm created HIVE-18743:
-------------------------------------

             Summary: CREATE TABLE on S3 data can be extremely slow. 
DO_NOT_UPDATE_STATS workaround is buggy.
                 Key: HIVE-18743
                 URL: https://issues.apache.org/jira/browse/HIVE-18743
             Project: Hive
          Issue Type: Improvement
          Components: Metastore
    Affects Versions: 1.1.0, 1.2.0
            Reporter: Alexander Behm


When hive.stats.autogather=true then the Metastore lists all files under the 
table directory to populate basic stats like file counts and sizes. This file 
listing operation can be very expensive particularly on filesystems like S3.

One way to address this issue is to reconfigure hive.stats.autogather=false.

*Here's the bug*
It is my understanding that the DO_NOT_UPDATE_STATS table property is intended 
to selectively prevent this stats collection. Unfortunately, this table 
property is checked *after* the expensive file listing operation, so the 
DO_NOT_UPDATE_STATS does not seem to work as intended. See:

https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633

Relevant code snippet:
{code}
  public static boolean updateTableStatsFast(Database db, Table tbl, Warehouse 
wh,
                                             boolean madeDir, boolean 
forceRecompute, EnvironmentContext environmentContext) throws MetaException {
    if (tbl.getPartitionKeysSize() == 0) {
      // Update stats only when unpartitioned
      FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
tbl);
      return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
wh.getFileStatusesForUnpartitionedTable() has already been called
    } else {
      return false;
    }
  }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to