[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2021-11-08 Thread Jean-Yves STEPHAN (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440502#comment-17440502
 ] 

Jean-Yves STEPHAN commented on HIVE-18743:
--

Hello. We use Hive for a Spark project, and our Spark job hangs in a branch of 
the code controlled by the DO_NOT_POPULATE_QUICK_STATS property. I'd like to 
try switching off this flag, it's currently passed as an "EnvironmentContext". 
Is this something I can control via an environment variable? or via a HiveConf 
(to set in hive-site.xml)? 

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.1.0, 1.2.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alex Kolbasov
>Priority: Major
> Fix For: 3.1.0, 2.4.0, 3.0.0
>
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-23 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448509#comment-16448509
 ] 

Vihang Karajgaonkar commented on HIVE-18743:


merged to branch-3 as well. Since fixed in 2.4.0 and 3.1.0 without 3.0.0 
doesn't make sense.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Fix For: 3.0.0, 2.4.0, 3.1.0
>
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-23 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448495#comment-16448495
 ] 

Vihang Karajgaonkar commented on HIVE-18743:


patch merged to branch-2 as well. Resolving this.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-22 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447291#comment-16447291
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~vihangk1] just got results from branch-2 patch testing - the test failures 
seem to be unrelated.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447146#comment-16447146
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12920122/HIVE-18743.01-branch-2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 10673 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestAccumuloCliDriver.testCliDriver[accumulo_queries]
 (batchId=227)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[avro_tableproperty_optimize]
 (batchId=22)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[explaindenpendencydiffengs]
 (batchId=38)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=142)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=139)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[table_nonprintable]
 (batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[join_acid_non_acid]
 (batchId=158)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[union_fast_stats]
 (batchId=153)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=144)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vectorized_parquet_types]
 (batchId=155)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[merge_negative_5]
 (batchId=88)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[explaindenpendencydiffengs]
 (batchId=115)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorization_input_format_excludes]
 (batchId=117)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vectorized_ptf] 
(batchId=125)
org.apache.hive.hcatalog.api.TestHCatClient.testTransportFailure (batchId=176)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/10402/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/10402/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-10402/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 15 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12920122 - PreCommit-HIVE-Build

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-21 Thread Vihang Karajgaonkar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446957#comment-16446957
 ] 

Vihang Karajgaonkar commented on HIVE-18743:


Patch merged to master branch. Thanks for your contribution [~akolb] Is the 
branch-2 patch ready to be merged as well?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-20 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446552#comment-16446552
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

Attached branch-2 patch as well.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01-branch-2.patch, HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-20 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446536#comment-16446536
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

While porting the fix to branch-2 I noticed that {{alterTempTable()}} there 
updates stats while in branch-3 it doesn't. Does anyone know why this is the 
case?

 

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-20 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446488#comment-16446488
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] Would you be able to commit the fix for me?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-20 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446487#comment-16446487
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] I heep biostory of reviews in reviewboard, but will keep patches 
in Jira as well in the future if this is useful for others.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-20 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445471#comment-16445471
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

[~akolb] there was an acid related ticket which have landed just before I've 
seen the end of that ticket - since it have added a lot of if-s everywhere I've 
to re-interpret a lot of things...
so we are better of to have at least this fix..
+1  ; I'm checking if there are any related test failures
note: why are you removing previous version of your patch? please don't do 
that...I know it might look tidier...but: the comments will miss there context  
- and by re-using patch#01 you may confuse a reviewer who have already seen 
your ticket...and remembers that it had 1 patch

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444691#comment-16444691
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12919833/HIVE-18743.01.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 25 failed/errored test(s), 14280 tests 
executed
*Failed tests:*
{noformat}
TestMinimrCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)

[infer_bucket_sort_num_buckets.q,infer_bucket_sort_reducers_power_two.q,parallel_orderby.q,bucket_num_reducers_acid.q,infer_bucket_sort_map_operators.q,infer_bucket_sort_merge.q,root_dir_external_table.q,infer_bucket_sort_dyn_part.q,udf_using.q,bucket_num_reducers_acid2.q]
TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely timed 
out) (batchId=217)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[llap_smb] (batchId=92)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[parquet_vectorization_0] 
(batchId=17)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[results_cache_invalidation2]
 (batchId=39)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[tez_join_hash] 
(batchId=54)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[mm_all] 
(batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[default_constraint]
 (batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[results_cache_invalidation2]
 (batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=163)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[tez_smb_1] 
(batchId=171)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_5] 
(batchId=105)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[cluster_tasklog_retrieval]
 (batchId=98)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[mapreduce_stack_trace]
 (batchId=98)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[mapreduce_stack_trace_turnoff]
 (batchId=98)
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testCliDriver[minimr_broken_pipe]
 (batchId=98)
org.apache.hadoop.hive.cli.TestTezPerfCliDriver.testCliDriver[query64] 
(batchId=253)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=225)
org.apache.hadoop.hive.ql.TestAcidOnTez.testAcidInsertWithRemoveUnion 
(batchId=228)
org.apache.hadoop.hive.ql.TestAcidOnTez.testCtasTezUnion (batchId=228)
org.apache.hadoop.hive.ql.TestAcidOnTez.testNonStandardConversion01 
(batchId=228)
org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 (batchId=232)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgress (batchId=235)
org.apache.hive.beeline.TestBeeLineWithArgs.testQueryProgressParallel 
(batchId=235)
org.apache.hive.minikdc.TestJdbcWithMiniKdcCookie.testCookieNegative 
(batchId=254)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/10349/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/10349/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-10349/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 25 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12919833 - PreCommit-HIVE-Build

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16444587#comment-16444587
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
 0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
22s{color} | {color:red} standalone-metastore: The patch generated 5 new + 522 
unchanged - 12 fixed = 527 total (was 534) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 14m 58s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-10349/dev-support/hive-personality.sh
 |
| git revision | master / 046bc64 |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-10349/yetus/diff-checkstyle-standalone-metastore.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-10349/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-04-12 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436583#comment-16436583
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] What are your current plans for this? Do you plan to commit your 
changes in any of the releases?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-06 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387694#comment-16387694
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

I also started to suspect that it's not easy to check this at all... I think 
for hive-2 this would be good. - could you submit your patch for branch-2 ? 
for hive-3: I think after HIVE-17478 this issue should be re-checked on master 
as well...since the goal of that is not to fix this - in case it will be still 
broken ; I think it will be more straightforward to fix it there.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-05 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386406#comment-16386406
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] This means that this fix doesn't make sense in 3.0 since you are 
removing the code altogether. Do you plan to port your change to hive-2 s well?

For the ptest - the point of the patch is to ensure that we do not access the 
file system when we don't need to. It doesn't change any externally-visible 
behavior, so we can't really test it with ptest.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385693#comment-16385693
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

[~akolb] if the stats collection is removed from the metastore; that also means 
that the code you are testing will be also gonebecause it will no longer 
happen there...
I think that probably the following command sequence could make this testable:
create table; insert ; desc the table; remove files from the table datadir by 
dfs commands; alter table ; desc table - stats are the same

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385415#comment-16385415
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12912949/HIVE-18743.07.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 13062 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385397#comment-16385397
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
19s{color} | {color:red} standalone-metastore: The patch generated 5 new + 505 
unchanged - 10 fixed = 510 total (was 515) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
14s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9478/dev-support/hive-personality.sh
 |
| git revision | master / 05d4719 |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9478/yetus/diff-checkstyle-standalone-metastore.txt
 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9478/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9478/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385389#comment-16385389
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12912949/HIVE-18743.07.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 21 failed/errored test(s), 13061 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385374#comment-16385374
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] what is the value of high-value qtest? The unit test allows me to 
control execution environment of the function exactly and it gives me an 
opportunity to verify whether warehouse ops are called or not. What extra value 
would we get from a qtest that we don't get from unit test?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385370#comment-16385370
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] So the assumption here is that the value can be not just a 
"true"/"false" string but an actual JSON object in which case it is parsed and 
{{stats.basicStats = true}} just overwrites one property?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385340#comment-16385340
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
20s{color} | {color:red} standalone-metastore: The patch generated 5 new + 505 
unchanged - 10 fixed = 510 total (was 515) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
13s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9477/dev-support/hive-personality.sh
 |
| git revision | master / 05d4719 |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9477/yetus/diff-checkstyle-standalone-metastore.txt
 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9477/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9477/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385296#comment-16385296
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

I don't think so...you left out the other parts of that code... 
https://github.com/apache/hive/blob/05d4719eefc56676a3e0e8f706e1c5e5e1f6b345/standalone-metastore/src/main/java/org/apache/hadoop/hive/common/StatsSetupConst.java#L232

[~akolb] Could you please add a high level qtest ? the testcase from 
testmetastore will also be removed...


> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385293#comment-16385293
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

I noticed a bit of an odd code:
{code:java}
public static void setBasicStatsState(Map params, String 
setting) {
  ...
  ColumnStatsAccurate stats = parseStatsAcc(params.get(COLUMN_STATS_ACCURATE));
  stats.basicStats = true;
}{code}
So  it parses the value of {{COLUMN_STATS_ACCURATE}} but then always ignores it 
and sets {{stats.basicStats}} to true anyway. Is it intentional? Can this be 
removed?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.06.patch, HIVE-18743.07.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385090#comment-16385090
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12912918/HIVE-18743.05.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 20 failed/errored test(s), 13062 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385060#comment-16385060
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
 7s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
19s{color} | {color:red} standalone-metastore: The patch generated 4 new + 512 
unchanged - 3 fixed = 516 total (was 515) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
13s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 45s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9475/dev-support/hive-personality.sh
 |
| git revision | master / 05d4719 |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9475/yetus/diff-checkstyle-standalone-metastore.txt
 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9475/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9475/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.05.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-04 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385027#comment-16385027
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] Added unit tests.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.05.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-02 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384540#comment-16384540
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12912729/HIVE-18743.04.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 29 failed/errored test(s), 13430 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=94)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-02 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384521#comment-16384521
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
26s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} standalone-metastore: The patch generated 0 new + 
484 unchanged - 3 fixed = 484 total (was 487) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
13s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 12m  8s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9452/dev-support/hive-personality.sh
 |
| git revision | master / 1a3090f |
| Default Java | 1.8.0_111 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9452/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9452/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.04.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-02 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384416#comment-16384416
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] If you are fixing  HIVE-17478, is there any value in fixing this 
or we should just wait for  HIVE-17478? Do you plan to do the same for hive-2 
as well? If not, should this be hive-2 only fix? What do you think?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.04.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-03-02 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383467#comment-16383467
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

[~akolb] Could you please write a test for this? I've experimented with 
HIVE-17478 ; and I think this whole filescanner logic will be gone from the 
metastore soon...

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.04.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-26 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376789#comment-16376789
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

[~akolb]: I will try to get in HIVE-17478 before 3.0 because the problem that 
this is at the metastore side is just keeps coming back from multiple 
directions (s3, acid, stat collection).
I think here the most important would be to add a good test case for this...so 
that we didn't re-introduce this problem again...


> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372642#comment-16372642
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12911489/HIVE-18743.03.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 36 failed/errored test(s), 13011 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372604#comment-16372604
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} standalone-metastore: The patch generated 0 new + 
484 unchanged - 3 fixed = 484 total (was 487) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
13s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 53s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9307/dev-support/hive-personality.sh
 |
| git revision | master / ec2378f |
| Default Java | 1.8.0_111 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9307/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9307/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372504#comment-16372504
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

I think it makes sense to separate these - removal looks to be a more involved 
problem. I don't have enough understanding of potential consequences.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372490#comment-16372490
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

if you think that removing could be also an option you may use HIVE-17478 to 
experiment with that path as well :)

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372480#comment-16372480
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12911489/HIVE-18743.03.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 38 failed/errored test(s), 12981 tests 
executed
*Failed tests:*
{noformat}
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=170)

[vector_windowing_expressions.q,tez_union_group_by.q,vector_like_2.q,llap_acid.q,sqlmerge.q,tez_dynpart_hashjoin_1.q,schema_evol_orc_acid_part_update_llap_io.q,vector_windowing_gby.q,vectorized_timestamp.q,cbo_subq_exists.q,lateral_view.q,schema_evol_orc_vec_table_llap_io.q,optimize_nullscan.q,vectorization_decimal_date.q,schema_evol_orc_nonvec_table_llap_io.q,udaf_all_keyword.q,tez_self_join.q,vector_partitioned_date_time.q,acid_vectorization_original.q,tez_fsstat.q,stats11.q,vector_mapjoin_reduce.q,join_acid_non_acid.q,empty_join.q,vector_groupby_grouping_window.q,auto_join21.q,tez_input_counters.q,schema_evol_orc_nonvec_part_all_complex_llap_io.q,orc_ppd_timestamp.q,vector_decimal_1.q]
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372482#comment-16372482
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

probably there was a time when this was more relevant...but it seems like this 
thing causes more problem than it fixes - and it just keeps coming back :)
so it might be easier to address the original problem differently ...if there 
is anyI think the original intention was that something wanted to skip the 
stats updatebut afaik currently hive also sets explicitly this flag to 
prevent the metastore from interferingso I guess that should leave the 
mostly unindended codepath-s ending up triggering this feature :D

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372477#comment-16372477
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

[~kgyrtkirk] I can't find anyone using {{NUM_FILES}}, but there are some 
consumers of {{TOTAL_SIZE}}. I dont know whether these can be removed without 
breaking something.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372472#comment-16372472
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

I don't know who is using these and what can break if this is removed. There 
were some purpose in putting this thing in I guess.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372465#comment-16372465
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

are there any reason you decided not to remove this thing?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372446#comment-16372446
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

Partition versions {{updatePartitionStatsFast()}} do not have this bug, they 
only overuse overloading, but otherwise seem ok, so I will not add changes to 
{{updatePartitionStatsFast()}} as part of this fix.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372434#comment-16372434
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} standalone-metastore: The patch generated 0 new + 
484 unchanged - 3 fixed = 484 total (was 487) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
12s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 22s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9304/dev-support/hive-personality.sh
 |
| git revision | master / ec2378f |
| Default Java | 1.8.0_111 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9304/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9304/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372416#comment-16372416
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

I think we need to add a property in the environment context which disables 
stats update. We can keep existing {{DO_NOT_UPDATE_STATS}} for compatibility 
with existing apps for a while, but switch to the new property for new uses.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372412#comment-16372412
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

Similar problems exist in {{updatePartitionStatsFast()}}. 

But there it isn't possible to disable with {{DO_NOT_UPDATE_STATS}}  for some 
reason.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch, 
> HIVE-18743.03.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372391#comment-16372391
 ] 

Hive QA commented on HIVE-18743:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12911484/HIVE-18743.02.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 39 failed/errored test(s), 13011 tests 
executed
*Failed tests:*
{noformat}
TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out) 
(batchId=93)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372368#comment-16372368
 ] 

Hive QA commented on HIVE-18743:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
54s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
34s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
19s{color} | {color:red} standalone-metastore: The patch generated 1 new + 484 
unchanged - 3 fixed = 485 total (was 487) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
12s{color} | {color:red} The patch generated 49 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 11m 16s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-9302/dev-support/hive-personality.sh
 |
| git revision | master / ec2378f |
| Default Java | 1.8.0_111 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9302/yetus/diff-checkstyle-standalone-metastore.txt
 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9302/yetus/patch-asflicense-problems.txt
 |
| modules | C: standalone-metastore U: standalone-metastore |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-9302/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch, HIVE-18743.02.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean 

[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372329#comment-16372329
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

While looking at this code I discovered a few more interesting things.

1) There are many more conditions that should be true before the result of  
{{wh.getFileStatusesForUnpartitionedTable()}} is actually used, so all of them 
should be checked *before* we bother traversing the filesystem
2) If someone sets {{DO_NOT_UPDATE_STATS}} as a persistent property, it will be 
removed which seems wrong - it should be passed via environment context.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0, 2.0.2, 3.0.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
> Attachments: HIVE-18743.01.patch
>
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-21 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372234#comment-16372234
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

The code that checks for {{DO_NOT_UPDATE_STATS}} as well as the property itself 
were added as part of HIVE-10228 and it has the following comment:

{code}  // This string constant is used by AlterHandler to figure out that it 
should not attempt to
  // update stats. It is set by any client-side task which wishes to signal 
that no stats
  // update should take place, such as with replication.
 public static final String DO_NOT_UPDATE_STATS = "DO_NOT_UPDATE_STATS";
{code}

The actual check is rather strange:

{code}
if ((params!=null) && 
params.containsKey(StatsSetupConst.DO_NOT_UPDATE_STATS)){
  boolean doNotUpdateStats = 
Boolean.valueOf(params.get(StatsSetupConst.DO_NOT_UPDATE_STATS));
  params.remove(StatsSetupConst.DO_NOT_UPDATE_STATS);
  tbl.setParameters(params); // to make sure we remove this marker property
  if (doNotUpdateStats){
return false;
  }
}
{code}

So after the check the {{DO_NOT_UPDATE_STATS}} is removed from parameters for 
some reason.

[~ashutoshc] [~thejas] Can you comment why the parameter is removed after the 
check and why the check is performed after file system operations are complete? 
To what extent does remote replication depends on existing behavior?

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-20 Thread Alexander Behm (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370276#comment-16370276
 ] 

Alexander Behm commented on HIVE-18743:
---

Thanks you, [~kgyrtkirk]. I agree completely. I'm very much in favor of getting 
rid of all non-obvious side effects of Metastore API calls. Stats collection is 
one of those side effects. As is today, it is very hard to reason about what 
exactly the Metastore will do how expensive API calls are.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-19 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368883#comment-16368883
 ] 

Zoltan Haindrich commented on HIVE-18743:
-

I fell that we may probably consider to abandon the stats collection in the 
metastore entirely; it should be done from only the hive task - which already 
sets DO_NOT_UPDATE_STATS - I think there is a ticket for this somewhere...

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18743) CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround is buggy.

2018-02-16 Thread Alexander Kolbasov (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368119#comment-16368119
 ] 

Alexander Kolbasov commented on HIVE-18743:
---

Will take a look at this.

> CREATE TABLE on S3 data can be extremely slow. DO_NOT_UPDATE_STATS workaround 
> is buggy.
> ---
>
> Key: HIVE-18743
> URL: https://issues.apache.org/jira/browse/HIVE-18743
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 1.2.0, 1.1.0
>Reporter: Alexander Behm
>Assignee: Alexander Kolbasov
>Priority: Major
>
> When hive.stats.autogather=true then the Metastore lists all files under the 
> table directory to populate basic stats like file counts and sizes. This file 
> listing operation can be very expensive particularly on filesystems like S3.
> One way to address this issue is to reconfigure hive.stats.autogather=false.
> *Here's the bug*
> It is my understanding that the DO_NOT_UPDATE_STATS table property is 
> intended to selectively prevent this stats collection. Unfortunately, this 
> table property is checked *after* the expensive file listing operation, so 
> the DO_NOT_UPDATE_STATS does not seem to work as intended. See:
> https://github.com/apache/hive/blob/master/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/utils/MetaStoreUtils.java#L633
> Relevant code snippet:
> {code}
>   public static boolean updateTableStatsFast(Database db, Table tbl, 
> Warehouse wh,
>  boolean madeDir, boolean 
> forceRecompute, EnvironmentContext environmentContext) throws MetaException {
> if (tbl.getPartitionKeysSize() == 0) {
>   // Update stats only when unpartitioned
>   FileStatus[] fileStatuses = wh.getFileStatusesForUnpartitionedTable(db, 
> tbl);
>   return updateTableStatsFast(tbl, fileStatuses, madeDir, forceRecompute, 
> environmentContext); <--- DO_NOT_UPDATE_STATS is checked in here after 
> wh.getFileStatusesForUnpartitionedTable() has already been called
> } else {
>   return false;
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)