[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440502#comment-17440502
 ] 

Jean-Yves STEPHAN commented on HIVE-18743:
------------------------------------------

Hello. We use Hive for a Spark project, and our Spark job hangs in a branch of 
the code controlled by the DO_NOT_POPULATE_QUICK_STATS property. I'd like to 
try switching off this flag, it's currently passed as an "EnvironmentContext". 
Is this something I can control via an environment variable? or via a HiveConf 
(to set in hive-site.xml)? 

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-18743
>                 URL: https://issues.apache.org/jira/browse/HIVE-18743
>             Project: Hive
>          Issue Type: Improvement
>          Components: Metastore
>    Affects Versions: 1.1.0, 1.2.0, 2.0.2, 3.0.0
>            Reporter: Alexander Behm
>            Assignee: Alex Kolbasov
>            Priority: Major
>             Fix For: 3.1.0, 2.4.0, 3.0.0
>
>         Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>                                              boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
>     if (tbl.getPartitionKeysSize() == 0) {
>       // Update stats only when unpartitioned
>       FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>       return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
>     } else {
>       return false;
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to