Copilot commented on code in PR #6438:
URL: https://github.com/apache/hive/pull/6438#discussion_r3151321483
##########
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java:
##########
@@ -1525,7 +1540,8 @@ public enum ConfVars {
ACID_METRICS_TASK_CLASS + "," + ACID_METRICS_LOGGER_CLASS + "," +
"org.apache.hadoop.hive.metastore.HiveProtoEventsCleanerTask" + ","
+
"org.apache.hadoop.hive.metastore.ScheduledQueryExecutionsMaintTask" + ","
- + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask",
+ + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" +
","
+ + "org.apache.hadoop.hive.metastore.StatisticsManagementTask",
Review Comment:
Adding `org.apache.hadoop.hive.metastore.StatisticsManagementTask` to
`METASTORE_TASK_THREADS_ALWAYS` means it will start in all deployments. Given
the new configs currently default to enabling deletion, this can change runtime
behavior unexpectedly after upgrade; ensure this is gated behind an explicit
opt-in (or keep it out of the always-start list until explicitly configured).
```suggestion
+ "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask",
```
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.api.DeleteColumnStatisticsRequest;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.model.MTableColumnStatistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+/**
+ * Statistics management task is primarily responsible for auto deletion of
table column stats based on a certain frequency
+ *
+ * If some table or partition column statistics are older than the configured
retention interval
+ * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when
this metastore task runs periodically.
+ */
+public class StatisticsManagementTask implements MetastoreTaskThread {
+ private static final Logger LOG =
LoggerFactory.getLogger(StatisticsManagementTask.class);
+
+ // The 2 configs for users to set in the conf
+ // this is an optional table property, if this property does not exist for
a table, then it is not excluded
+ public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY =
"statistics.auto.deletion.exclude";
+
+ private static final Lock lock = new ReentrantLock();
+
+ private Configuration conf;
+
+ @Override
+ public long runFrequency(TimeUnit unit) {
+ return MetastoreConf.getTimeVar(conf,
MetastoreConf.ConfVars.STATISTICS_MANAGEMENT_TASK_FREQUENCY, unit);
+ }
+
+ @Override
+ public void setConf(Configuration configuration) {
+ // we modify conf in setupConf(), so we make a copy
+ this.conf = configuration;
Review Comment:
`setConf` says it makes a copy because the task may mutate the
configuration, but it currently assigns the passed instance directly. This
diverges from `PartitionManagementTask` (which does `new
Configuration(configuration)`) and can cause unexpected
cross-thread/shared-conf mutations; make a defensive copy here as well (or
update the comment if no mutation happens).
```suggestion
this.conf = new Configuration(configuration);
```
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.api.DeleteColumnStatisticsRequest;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.model.MTableColumnStatistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+/**
+ * Statistics management task is primarily responsible for auto deletion of
table column stats based on a certain frequency
+ *
+ * If some table or partition column statistics are older than the configured
retention interval
+ * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when
this metastore task runs periodically.
+ */
+public class StatisticsManagementTask implements MetastoreTaskThread {
+ private static final Logger LOG =
LoggerFactory.getLogger(StatisticsManagementTask.class);
+
+ // The 2 configs for users to set in the conf
+ // this is an optional table property, if this property does not exist for
a table, then it is not excluded
+ public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY =
"statistics.auto.deletion.exclude";
+
+ private static final Lock lock = new ReentrantLock();
+
+ private Configuration conf;
+
+ @Override
+ public long runFrequency(TimeUnit unit) {
+ return MetastoreConf.getTimeVar(conf,
MetastoreConf.ConfVars.STATISTICS_MANAGEMENT_TASK_FREQUENCY, unit);
+ }
+
+ @Override
+ public void setConf(Configuration configuration) {
+ // we modify conf in setupConf(), so we make a copy
+ this.conf = configuration;
+ }
+
+ @Override
+ public Configuration getConf() {
+ return conf;
+ }
+
+ // what needs to be included in this run() method:
+ // get the "lastAnalyzed" information from TAB_COL_STATS and find all the
tables need to be deleted
+ // delete all column stats
+ @Override
+ public void run() {
+ LOG.debug("Auto statistics deletion started. Cleaning up
table/partition column statistics over the retention period.");
+ long retentionMillis = MetastoreConf.getTimeVar(conf,
MetastoreConf.ConfVars. STATISTICS_RETENTION_PERIOD, TimeUnit.MILLISECONDS);
+ if (retentionMillis <= 0 || !MetastoreConf.getBoolVar(conf,
MetastoreConf.ConfVars.STATISTICS_AUTO_DELETION)) {
+ LOG.info("Statistics auto deletion is set to off currently.");
+ return;
+ }
+ if (!lock.tryLock()) {
+ return;
+ }
+ try {
+ long now = System.currentTimeMillis();
+ long lastAnalyzedThreshold = (now - retentionMillis) / 1000;
+
+ String filter = "lastAnalyzed < threshold";
+ String paramStr = "long threshold";
+ try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) {
+ RawStore ms = HMSHandler.getMSForConf(conf);
+ PersistenceManager pm = ((ObjectStore)
ms).getPersistenceManager();
+
+ Query q = null;
+ try {
+ q = pm.newQuery(MTableColumnStatistics.class);
+ q.setFilter(filter);
+ q.declareParameters(paramStr);
+ // only fetch required fields, avoid loading heavy MTable
objects
+ q.setResult(
+ "table.database.name, " +
+ "table.tableName, " +
+ "partitionName, " +
+ "table.parameters.get(\"" +
STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"
+ );
+ @SuppressWarnings("unchecked")
+ List<Object[]> rows = (List<Object[]>)
q.execute(lastAnalyzedThreshold);
+
+ for (Object[] row : rows) {
+ String dbName = (String) row[0];
+ String tblName = (String) row[1];
+ String partName = (String) row[2]; // can be null
for table-level stats
+ String excludeVal = (String) row[3]; // can be null
+
+ // exclude check uses projected param value
+ if (excludeVal != null) {
Review Comment:
The exclusion check treats any non-null value of
`statistics.auto.deletion.exclude` as excluded (including "false"). This makes
it impossible to explicitly set the property to false and still have deletion
run; compare the value to "true" (case-insensitive) instead of only checking
for presence.
```suggestion
if ("true".equalsIgnoreCase(excludeVal)) {
```
##########
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java:
##########
@@ -695,4 +697,64 @@ static DescriptiveStatistics
benchmarkPartitionManagement(@NotNull MicroBenchmar
}
}
+ static DescriptiveStatistics benchmarkStatisticsManagement(@NotNull
MicroBenchmark bench,
+ @NotNull
BenchData data,
+ int tableCount) {
+
+ String dbName = data.dbName + "_" + tableCount;
+ String tableNamePrefix = data.tableName;
+ final HMSClient client = data.getClient();
+ final StatisticsManagementTask statsTask = new StatisticsManagementTask();
+ try {
+ client.getHadoopConf().set("hive.metastore.uris",
client.getServerURI().toString());
+
client.getHadoopConf().set("metastore.statistics.management.database.pattern",
dbName);
+ statsTask.setConf(client.getHadoopConf());
+
+ client.createDatabase(dbName);
+ for (int i = 0; i < tableCount; i++) {
+ String tableName = tableNamePrefix + "_" + i;
+ Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName,
tableName)
+ .withType(TableType.MANAGED_TABLE)
+ .withColumns(createSchema(Arrays.asList("col1:string",
"col2:int")))
+
.withPartitionKeys(createSchema(Collections.singletonList("part_col")))
+ .withParameter("columnStatsAccurate", "true");
+
+ client.createTable(tableBuilder.build());
+ addManyPartitionsNoException(client, dbName, tableName, null,
Collections.singletonList("part_col"), 100);
+
+ // simulate the partitions of each table which its stats has an old
"lastAnalyzed"
+ List<Partition> partitions = client.listPartitions(dbName, tableName);
+ for (Partition partition : partitions) {
+ Map<String, String> params = partition.getParameters();
+ // to manually change the "lastAnalyzed" to an old time, ex. 400 days
+ params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis()
- TimeUnit.DAYS.toMillis(400)));
+ }
+ client.alterPartitions(dbName, tableName, partitions);
+ }
Review Comment:
The benchmark simulates expired stats by updating the partition parameter
`lastAnalyzed`, but `StatisticsManagementTask` currently queries
`MTableColumnStatistics.lastAnalyzed` (TAB_COL_STATS) instead of
partition/table parameters. As a result, this setup (and the post-run assertion
that `lastAnalyzed` is removed from partition parameters) is not exercising the
code path being benchmarked; update the benchmark to expire the same metadata
the task uses (or update the task to use the same source of truth).
##########
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java:
##########
@@ -695,4 +697,64 @@ static DescriptiveStatistics
benchmarkPartitionManagement(@NotNull MicroBenchmar
}
}
+ static DescriptiveStatistics benchmarkStatisticsManagement(@NotNull
MicroBenchmark bench,
+ @NotNull
BenchData data,
+ int tableCount) {
+
+ String dbName = data.dbName + "_" + tableCount;
+ String tableNamePrefix = data.tableName;
+ final HMSClient client = data.getClient();
+ final StatisticsManagementTask statsTask = new StatisticsManagementTask();
+ try {
+ client.getHadoopConf().set("hive.metastore.uris",
client.getServerURI().toString());
+
client.getHadoopConf().set("metastore.statistics.management.database.pattern",
dbName);
+ statsTask.setConf(client.getHadoopConf());
+
+ client.createDatabase(dbName);
+ for (int i = 0; i < tableCount; i++) {
+ String tableName = tableNamePrefix + "_" + i;
+ Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName,
tableName)
+ .withType(TableType.MANAGED_TABLE)
+ .withColumns(createSchema(Arrays.asList("col1:string",
"col2:int")))
+
.withPartitionKeys(createSchema(Collections.singletonList("part_col")))
+ .withParameter("columnStatsAccurate", "true");
+
Review Comment:
Table/partition stats accuracy is tracked under the `COLUMN_STATS_ACCURATE`
parameter (see `StatsSetupConst.COLUMN_STATS_ACCURATE`), but the benchmark sets
`columnStatsAccurate`. If this parameter is intended to make the metastore
think column stats are present/accurate, the current key likely won't be
recognized; use the correct constant/key.
##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
Review Comment:
`java.util.Map` is imported but not used in this class; please remove the
unused import to keep the file clean (and to satisfy builds that fail on unused
imports/checkstyle).
##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java:
##########
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore;
+
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_CATALOG_NAME;
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_DATABASE_NAME;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.ColumnStatistics;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsDesc;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.DoubleColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.TableBuilder;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.hadoop.hive.metastore.utils.TestTxnDbUtil;
+import org.apache.thrift.TException;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+@Category(MetastoreUnitTest.class)
+public class TestStatisticsManagement {
+
+ private IMetaStoreClient client;
+ private Configuration conf;
+
+ @Before
+ public void setUp() throws Exception {
+ conf = MetastoreConf.newMetastoreConf();
+ MetastoreConf.setVar(conf,
ConfVars.METASTORE_METADATA_TRANSFORMER_CLASS, " ");
+ MetaStoreTestUtils.setConfForStandloneMode(conf);
+ conf.setBoolean(ConfVars.MULTITHREADED.getVarname(), false);
+ conf.setBoolean(ConfVars.HIVE_IN_TEST.getVarname(), true);
+
+ // enable stats auto deletion, set up a short retention so threshold
check triggers easily
+ MetastoreConf.setBoolVar(conf, ConfVars.STATISTICS_AUTO_DELETION,
true);
+ MetastoreConf.setTimeVar(conf, ConfVars.STATISTICS_RETENTION_PERIOD,
1, TimeUnit.DAYS);
+
+
MetaStoreTestUtils.startMetaStoreWithRetry(HadoopThriftAuthBridge.getBridge(),
conf);
+ TestTxnDbUtil.setConfValues(conf);
+ TestTxnDbUtil.prepDb(conf);
+
+ client = new HiveMetaStoreClient(conf);
+ }
+
+ @After
+ public void tearDown() throws Exception {
+ if (client != null) {
+ // Drop any leftover databases, similar to
TestPartitionManagement.java
+ List<String> dbs = client.getAllDatabases(DEFAULT_CATALOG_NAME);
+ for (String db : dbs) {
+ if (!db.equalsIgnoreCase(DEFAULT_DATABASE_NAME)) {
+ client.dropDatabase(DEFAULT_CATALOG_NAME, db, true, false,
true);
+ }
+ }
+ }
+ try {
+ if (client != null) {
+ client.close();
+ }
+ } finally {
+ client = null;
+ }
+ }
+
+ @Test
+ public void testExpiredTableColStatsAreDeleted() throws Exception {
+ String dbName = "stats_db1";
+ String tableName = "tbl1";
+ createDbAndTable(dbName, tableName, false);
+ // create a column stats entry (table-level)
+ writeTableLevelColStats(dbName, tableName, "c1");
+ // ensure stats exists
+ assertHasTableColStats(dbName, tableName, "c1");
+ // make lastAnalyzed older than the threshold
+ makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+ runStatisticsManagementTask(conf);
+ assertNoTableColStats(dbName, tableName, "c1");
+ }
+
+ @Test
+ public void testExcludedTableStatsAreNotDeleted() throws Exception {
+ String dbName = "stats_db2";
+ String tableName = "tbl2";
+ // Create a database and table that ARE excluded from auto deletion.
+ createDbAndTable(dbName, tableName, true);
+ writeTableLevelColStats(dbName, tableName, "c1");
+ assertHasTableColStats(dbName, tableName, "c1");
+
+ // Manually set lastAnalyzed to a very old timestamp so it would
normally be expired.
+ makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+ runStatisticsManagementTask(conf);
+
+ // Verify that stats are still present because the table is excluded.
+ assertHasTableColStats(dbName, tableName, "c1");
+ }
+
+ private void createDbAndTable(String dbName, String tableName, boolean
exclude) throws Exception {
+ Database db;
+ if (!DEFAULT_DATABASE_NAME.equals(dbName)) {
+ db = new DatabaseBuilder()
+ .setName(dbName)
+ .setCatalogName(DEFAULT_CATALOG_NAME)
+ .create(client, conf);
+ } else {
+ db = client.getDatabase(DEFAULT_CATALOG_NAME,
DEFAULT_DATABASE_NAME);
+ }
+
+ // Build a simple test table with two columns.
+ TableBuilder tb = new TableBuilder()
+ .inDb(db)
+ .setTableName(tableName)
+ .addCol("c1", "double")
+ .addCol("c2", "string")
+
.setInputFormat("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat")
+
.setOutputFormat("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat");
+
+ Table t = tb.build(conf);
+
+ // If requested, mark this table as excluded from automatic stats
deletion.
+ if (exclude) {
+
t.getParameters().put(StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY,
"true");
+ }
+ client.createTable(t);
+ client.flushCache();
+ }
+
+ private void writeTableLevelColStats(String db, String tbl, String col)
throws TException {
+ // Create a stats object for one column.
+ ColumnStatisticsObj obj = new ColumnStatisticsObj();
+ obj.setColName(col);
+ obj.setColType("double");
+
+ // Fill in minimal double-column statistics data.
+ DoubleColumnStatsData doubleData = new DoubleColumnStatsData();
+ doubleData.setNumNulls(0);
+ doubleData.setNumDVs(1);
+ doubleData.setLowValue(1.0);
+ doubleData.setHighValue(1.0);
+
+ ColumnStatisticsData data = new ColumnStatisticsData();
+ data.setDoubleStats(doubleData);
+ obj.setStatsData(data);
+
+ ColumnStatistics cs = new ColumnStatistics();
+ ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl);
+ desc.setCatName("hive");
+ cs.addToStatsObj(obj);
+
+ client.updateTableColumnStatistics(cs);
+ }
Review Comment:
`ColumnStatisticsDesc desc` is created but never attached to the
`ColumnStatistics` object. `updateTableColumnStatistics` expects the stats
descriptor (db/table/cat/lastAnalyzed/etc.) to be set (see other tests like
`TestHiveMetaStore`); call `cs.setStatsDesc(desc)` before updating, otherwise
the update can fail or write incomplete stats.
##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java:
##########
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore;
+
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_CATALOG_NAME;
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_DATABASE_NAME;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.ColumnStatistics;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsDesc;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.DoubleColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.TableBuilder;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.hadoop.hive.metastore.utils.TestTxnDbUtil;
+import org.apache.thrift.TException;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+@Category(MetastoreUnitTest.class)
+public class TestStatisticsManagement {
+
+ private IMetaStoreClient client;
+ private Configuration conf;
+
+ @Before
+ public void setUp() throws Exception {
+ conf = MetastoreConf.newMetastoreConf();
+ MetastoreConf.setVar(conf,
ConfVars.METASTORE_METADATA_TRANSFORMER_CLASS, " ");
+ MetaStoreTestUtils.setConfForStandloneMode(conf);
+ conf.setBoolean(ConfVars.MULTITHREADED.getVarname(), false);
+ conf.setBoolean(ConfVars.HIVE_IN_TEST.getVarname(), true);
+
+ // enable stats auto deletion, set up a short retention so threshold
check triggers easily
+ MetastoreConf.setBoolVar(conf, ConfVars.STATISTICS_AUTO_DELETION,
true);
+ MetastoreConf.setTimeVar(conf, ConfVars.STATISTICS_RETENTION_PERIOD,
1, TimeUnit.DAYS);
+
+
MetaStoreTestUtils.startMetaStoreWithRetry(HadoopThriftAuthBridge.getBridge(),
conf);
+ TestTxnDbUtil.setConfValues(conf);
+ TestTxnDbUtil.prepDb(conf);
+
+ client = new HiveMetaStoreClient(conf);
+ }
+
+ @After
+ public void tearDown() throws Exception {
+ if (client != null) {
+ // Drop any leftover databases, similar to
TestPartitionManagement.java
+ List<String> dbs = client.getAllDatabases(DEFAULT_CATALOG_NAME);
+ for (String db : dbs) {
+ if (!db.equalsIgnoreCase(DEFAULT_DATABASE_NAME)) {
+ client.dropDatabase(DEFAULT_CATALOG_NAME, db, true, false,
true);
+ }
+ }
+ }
+ try {
+ if (client != null) {
+ client.close();
+ }
+ } finally {
+ client = null;
+ }
+ }
+
+ @Test
+ public void testExpiredTableColStatsAreDeleted() throws Exception {
+ String dbName = "stats_db1";
+ String tableName = "tbl1";
+ createDbAndTable(dbName, tableName, false);
+ // create a column stats entry (table-level)
+ writeTableLevelColStats(dbName, tableName, "c1");
+ // ensure stats exists
+ assertHasTableColStats(dbName, tableName, "c1");
+ // make lastAnalyzed older than the threshold
+ makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+ runStatisticsManagementTask(conf);
+ assertNoTableColStats(dbName, tableName, "c1");
+ }
+
+ @Test
+ public void testExcludedTableStatsAreNotDeleted() throws Exception {
+ String dbName = "stats_db2";
+ String tableName = "tbl2";
+ // Create a database and table that ARE excluded from auto deletion.
+ createDbAndTable(dbName, tableName, true);
+ writeTableLevelColStats(dbName, tableName, "c1");
+ assertHasTableColStats(dbName, tableName, "c1");
+
+ // Manually set lastAnalyzed to a very old timestamp so it would
normally be expired.
+ makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+ runStatisticsManagementTask(conf);
+
+ // Verify that stats are still present because the table is excluded.
+ assertHasTableColStats(dbName, tableName, "c1");
+ }
+
+ private void createDbAndTable(String dbName, String tableName, boolean
exclude) throws Exception {
+ Database db;
+ if (!DEFAULT_DATABASE_NAME.equals(dbName)) {
+ db = new DatabaseBuilder()
+ .setName(dbName)
+ .setCatalogName(DEFAULT_CATALOG_NAME)
+ .create(client, conf);
+ } else {
+ db = client.getDatabase(DEFAULT_CATALOG_NAME,
DEFAULT_DATABASE_NAME);
+ }
+
+ // Build a simple test table with two columns.
+ TableBuilder tb = new TableBuilder()
+ .inDb(db)
+ .setTableName(tableName)
+ .addCol("c1", "double")
+ .addCol("c2", "string")
+
.setInputFormat("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat")
+
.setOutputFormat("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat");
+
+ Table t = tb.build(conf);
+
+ // If requested, mark this table as excluded from automatic stats
deletion.
+ if (exclude) {
+
t.getParameters().put(StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY,
"true");
+ }
+ client.createTable(t);
+ client.flushCache();
+ }
+
+ private void writeTableLevelColStats(String db, String tbl, String col)
throws TException {
+ // Create a stats object for one column.
+ ColumnStatisticsObj obj = new ColumnStatisticsObj();
+ obj.setColName(col);
+ obj.setColType("double");
+
+ // Fill in minimal double-column statistics data.
+ DoubleColumnStatsData doubleData = new DoubleColumnStatsData();
+ doubleData.setNumNulls(0);
+ doubleData.setNumDVs(1);
+ doubleData.setLowValue(1.0);
+ doubleData.setHighValue(1.0);
+
+ ColumnStatisticsData data = new ColumnStatisticsData();
+ data.setDoubleStats(doubleData);
+ obj.setStatsData(data);
+
+ ColumnStatistics cs = new ColumnStatistics();
+ ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl);
+ desc.setCatName("hive");
+ cs.addToStatsObj(obj);
+
+ client.updateTableColumnStatistics(cs);
+ }
+
+ private void assertHasTableColStats(String db, String tbl, String col)
throws TException {
+ List<ColumnStatisticsObj> objs = client.getTableColumnStatistics(db,
tbl, List.of(col), "hive");
+ assertTrue("Expected stats for " + db + "." + tbl + "." + col, objs !=
null && !objs.isEmpty());
+ }
+
+ private void assertNoTableColStats(String db, String tbl, String col)
throws TException {
+ try {
+ List<ColumnStatisticsObj> objs =
client.getTableColumnStatistics(db, tbl, List.of(col), "hive");
+ assertTrue("Expected no stats for " + db + "." + tbl + "." + col,
objs == null || objs.isEmpty());
+ } catch (NoSuchObjectException e) {
+ // acceptable: server may throw if stats absent depending on impl
+ }
+ }
+
+ private void makeAllTableColStatsOlderThanRetention(String db, String tbl)
throws Exception {
+ // We update via ObjectStore/PM directly to avoid relying on params
"lastAnalyzed".
+ RawStore ms = HMSHandler.getMSForConf(conf);
+ ObjectStore os = (ObjectStore) ms;
+ os.setConf(conf);
+
+ // Compute an old timestamp in seconds, here we use 400 days ago.
+ long oldSeconds = (System.currentTimeMillis() -
TimeUnit.DAYS.toMillis(400)) / 1000;
+
+ // NOTE: exact JDO classes/field paths sometimes vary; adjust filter
if needed based on MTableColumnStatistics mapping
Review Comment:
This test includes a TODO-style note about JDO field paths varying ("adjust
filter if needed"). Since this is committed test code, it should be
deterministic against the actual model mapping in this repo; please either
remove the note or replace it with a concrete explanation of why this
query/filter is correct here.
```suggestion
// In this repository's JDO model, MTableColumnStatistics points to
MTable via the
// "table" field, and MTable exposes the table and database names as
"tableName" and
// "database.name". This filter therefore selects exactly the
statistics rows for the
// target table in the target database.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]