Copilot commented on code in PR #6438:
URL: https://github.com/apache/hive/pull/6438#discussion_r3151321483


##########
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java:
##########
@@ -1525,7 +1540,8 @@ public enum ConfVars {
             ACID_METRICS_TASK_CLASS + "," + ACID_METRICS_LOGGER_CLASS + "," +
             "org.apache.hadoop.hive.metastore.HiveProtoEventsCleanerTask" + ","
             + 
"org.apache.hadoop.hive.metastore.ScheduledQueryExecutionsMaintTask" + ","
-            + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask",
+            + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask" + 
","
+            + "org.apache.hadoop.hive.metastore.StatisticsManagementTask",

Review Comment:
   Adding `org.apache.hadoop.hive.metastore.StatisticsManagementTask` to 
`METASTORE_TASK_THREADS_ALWAYS` means it will start in all deployments. Given 
the new configs currently default to enabling deletion, this can change runtime 
behavior unexpectedly after upgrade; ensure this is gated behind an explicit 
opt-in (or keep it out of the always-start list until explicitly configured).
   ```suggestion
               + "org.apache.hadoop.hive.metastore.ReplicationMetricsMaintTask",
   ```



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.api.DeleteColumnStatisticsRequest;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.model.MTableColumnStatistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+/**
+ * Statistics management task is primarily responsible for auto deletion of 
table column stats based on a certain frequency
+ *
+ * If some table or partition column statistics are older than the configured 
retention interval
+ * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when 
this metastore task runs periodically.
+ */
+public class StatisticsManagementTask implements MetastoreTaskThread {
+    private static final Logger LOG = 
LoggerFactory.getLogger(StatisticsManagementTask.class);
+
+    // The 2 configs for users to set in the conf
+    // this is an optional table property, if this property does not exist for 
a table, then it is not excluded
+    public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY = 
"statistics.auto.deletion.exclude";
+
+    private static final Lock lock = new ReentrantLock();
+
+    private Configuration conf;
+
+    @Override
+    public long runFrequency(TimeUnit unit) {
+        return MetastoreConf.getTimeVar(conf, 
MetastoreConf.ConfVars.STATISTICS_MANAGEMENT_TASK_FREQUENCY, unit);
+    }
+
+    @Override
+    public void setConf(Configuration configuration) {
+        // we modify conf in setupConf(), so we make a copy
+        this.conf = configuration;

Review Comment:
   `setConf` says it makes a copy because the task may mutate the 
configuration, but it currently assigns the passed instance directly. This 
diverges from `PartitionManagementTask` (which does `new 
Configuration(configuration)`) and can cause unexpected 
cross-thread/shared-conf mutations; make a defensive copy here as well (or 
update the comment if no mutation happens).
   ```suggestion
           this.conf = new Configuration(configuration);
   ```



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.api.DeleteColumnStatisticsRequest;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.model.MTableColumnStatistics;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+/**
+ * Statistics management task is primarily responsible for auto deletion of 
table column stats based on a certain frequency
+ *
+ * If some table or partition column statistics are older than the configured 
retention interval
+ * (MetastoreConf.ConfVars.STATISTICS_RETENTION_PERIOD), they are deleted when 
this metastore task runs periodically.
+ */
+public class StatisticsManagementTask implements MetastoreTaskThread {
+    private static final Logger LOG = 
LoggerFactory.getLogger(StatisticsManagementTask.class);
+
+    // The 2 configs for users to set in the conf
+    // this is an optional table property, if this property does not exist for 
a table, then it is not excluded
+    public static final String STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY = 
"statistics.auto.deletion.exclude";
+
+    private static final Lock lock = new ReentrantLock();
+
+    private Configuration conf;
+
+    @Override
+    public long runFrequency(TimeUnit unit) {
+        return MetastoreConf.getTimeVar(conf, 
MetastoreConf.ConfVars.STATISTICS_MANAGEMENT_TASK_FREQUENCY, unit);
+    }
+
+    @Override
+    public void setConf(Configuration configuration) {
+        // we modify conf in setupConf(), so we make a copy
+        this.conf = configuration;
+    }
+
+    @Override
+    public Configuration getConf() {
+        return conf;
+    }
+
+    // what needs to be included in this run() method:
+    // get the "lastAnalyzed" information from TAB_COL_STATS and find all the 
tables need to be deleted
+    // delete all column stats
+    @Override
+    public void run() {
+        LOG.debug("Auto statistics deletion started. Cleaning up 
table/partition column statistics over the retention period.");
+        long retentionMillis = MetastoreConf.getTimeVar(conf, 
MetastoreConf.ConfVars. STATISTICS_RETENTION_PERIOD, TimeUnit.MILLISECONDS);
+        if (retentionMillis <= 0 || !MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.STATISTICS_AUTO_DELETION)) {
+            LOG.info("Statistics auto deletion is set to off currently.");
+            return;
+        }
+        if (!lock.tryLock()) {
+            return;
+        }
+        try {
+            long now = System.currentTimeMillis();
+            long lastAnalyzedThreshold = (now - retentionMillis) / 1000;
+
+            String filter = "lastAnalyzed < threshold";
+            String paramStr = "long threshold";
+            try (IMetaStoreClient msc = new HiveMetaStoreClient(conf)) {
+                RawStore ms = HMSHandler.getMSForConf(conf);
+                PersistenceManager pm = ((ObjectStore) 
ms).getPersistenceManager();
+
+                Query q = null;
+                try {
+                    q = pm.newQuery(MTableColumnStatistics.class);
+                    q.setFilter(filter);
+                    q.declareParameters(paramStr);
+                    // only fetch required fields, avoid loading heavy MTable 
objects
+                    q.setResult(
+                            "table.database.name, " +
+                                    "table.tableName, " +
+                                    "partitionName, " +
+                                    "table.parameters.get(\"" + 
STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY + "\")"
+                    );
+                    @SuppressWarnings("unchecked")
+                    List<Object[]> rows = (List<Object[]>) 
q.execute(lastAnalyzedThreshold);
+
+                    for (Object[] row : rows) {
+                        String dbName = (String) row[0];
+                        String tblName = (String) row[1];
+                        String partName = (String) row[2];     // can be null 
for table-level stats
+                        String excludeVal = (String) row[3];   // can be null
+
+                        // exclude check uses projected param value
+                        if (excludeVal != null) {

Review Comment:
   The exclusion check treats any non-null value of 
`statistics.auto.deletion.exclude` as excluded (including "false"). This makes 
it impossible to explicitly set the property to false and still have deletion 
run; compare the value to "true" (case-insensitive) instead of only checking 
for presence.
   ```suggestion
                           if ("true".equalsIgnoreCase(excludeVal)) {
   ```



##########
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java:
##########
@@ -695,4 +697,64 @@ static DescriptiveStatistics 
benchmarkPartitionManagement(@NotNull MicroBenchmar
     }
   }
 
+  static DescriptiveStatistics benchmarkStatisticsManagement(@NotNull 
MicroBenchmark bench,
+                                                             @NotNull 
BenchData data,
+                                                             int tableCount) {
+
+    String dbName = data.dbName + "_" + tableCount;
+    String tableNamePrefix = data.tableName;
+    final HMSClient client = data.getClient();
+    final StatisticsManagementTask statsTask = new StatisticsManagementTask();
+    try {
+      client.getHadoopConf().set("hive.metastore.uris", 
client.getServerURI().toString());
+      
client.getHadoopConf().set("metastore.statistics.management.database.pattern", 
dbName);
+      statsTask.setConf(client.getHadoopConf());
+
+      client.createDatabase(dbName);
+      for (int i = 0; i < tableCount; i++) {
+        String tableName = tableNamePrefix + "_" + i;
+        Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName, 
tableName)
+                .withType(TableType.MANAGED_TABLE)
+                .withColumns(createSchema(Arrays.asList("col1:string", 
"col2:int")))
+                
.withPartitionKeys(createSchema(Collections.singletonList("part_col")))
+                .withParameter("columnStatsAccurate", "true");
+
+        client.createTable(tableBuilder.build());
+        addManyPartitionsNoException(client, dbName, tableName, null, 
Collections.singletonList("part_col"), 100);
+
+        // simulate the partitions of each table which its stats has an old 
"lastAnalyzed"
+        List<Partition> partitions = client.listPartitions(dbName, tableName);
+        for (Partition partition : partitions) {
+          Map<String, String> params = partition.getParameters();
+          // to manually change the "lastAnalyzed" to an old time, ex. 400 days
+          params.put("lastAnalyzed", String.valueOf(System.currentTimeMillis() 
- TimeUnit.DAYS.toMillis(400)));
+        }
+        client.alterPartitions(dbName, tableName, partitions);
+      }

Review Comment:
   The benchmark simulates expired stats by updating the partition parameter 
`lastAnalyzed`, but `StatisticsManagementTask` currently queries 
`MTableColumnStatistics.lastAnalyzed` (TAB_COL_STATS) instead of 
partition/table parameters. As a result, this setup (and the post-run assertion 
that `lastAnalyzed` is removed from partition parameters) is not exercising the 
code path being benchmarked; update the benchmark to expire the same metadata 
the task uses (or update the task to use the same source of truth).



##########
standalone-metastore/metastore-tools/metastore-benchmarks/src/main/java/org/apache/hadoop/hive/metastore/tools/HMSBenchmarks.java:
##########
@@ -695,4 +697,64 @@ static DescriptiveStatistics 
benchmarkPartitionManagement(@NotNull MicroBenchmar
     }
   }
 
+  static DescriptiveStatistics benchmarkStatisticsManagement(@NotNull 
MicroBenchmark bench,
+                                                             @NotNull 
BenchData data,
+                                                             int tableCount) {
+
+    String dbName = data.dbName + "_" + tableCount;
+    String tableNamePrefix = data.tableName;
+    final HMSClient client = data.getClient();
+    final StatisticsManagementTask statsTask = new StatisticsManagementTask();
+    try {
+      client.getHadoopConf().set("hive.metastore.uris", 
client.getServerURI().toString());
+      
client.getHadoopConf().set("metastore.statistics.management.database.pattern", 
dbName);
+      statsTask.setConf(client.getHadoopConf());
+
+      client.createDatabase(dbName);
+      for (int i = 0; i < tableCount; i++) {
+        String tableName = tableNamePrefix + "_" + i;
+        Util.TableBuilder tableBuilder = new Util.TableBuilder(dbName, 
tableName)
+                .withType(TableType.MANAGED_TABLE)
+                .withColumns(createSchema(Arrays.asList("col1:string", 
"col2:int")))
+                
.withPartitionKeys(createSchema(Collections.singletonList("part_col")))
+                .withParameter("columnStatsAccurate", "true");
+

Review Comment:
   Table/partition stats accuracy is tracked under the `COLUMN_STATS_ACCURATE` 
parameter (see `StatsSetupConst.COLUMN_STATS_ACCURATE`), but the benchmark sets 
`columnStatsAccurate`. If this parameter is intended to make the metastore 
think column stats are present/accurate, the current key likely won't be 
recognized; use the correct constant/key.



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/StatisticsManagementTask.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.metastore;
+
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;

Review Comment:
   `java.util.Map` is imported but not used in this class; please remove the 
unused import to keep the file clean (and to satisfy builds that fail on unused 
imports/checkstyle).



##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java:
##########
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore;
+
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_CATALOG_NAME;
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_DATABASE_NAME;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.ColumnStatistics;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsDesc;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.DoubleColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.TableBuilder;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.hadoop.hive.metastore.utils.TestTxnDbUtil;
+import org.apache.thrift.TException;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+@Category(MetastoreUnitTest.class)
+public class TestStatisticsManagement {
+
+    private IMetaStoreClient client;
+    private Configuration conf;
+
+    @Before
+    public void setUp() throws Exception {
+        conf = MetastoreConf.newMetastoreConf();
+        MetastoreConf.setVar(conf, 
ConfVars.METASTORE_METADATA_TRANSFORMER_CLASS, " ");
+        MetaStoreTestUtils.setConfForStandloneMode(conf);
+        conf.setBoolean(ConfVars.MULTITHREADED.getVarname(), false);
+        conf.setBoolean(ConfVars.HIVE_IN_TEST.getVarname(), true);
+
+        // enable stats auto deletion, set up a short retention so threshold 
check triggers easily
+        MetastoreConf.setBoolVar(conf, ConfVars.STATISTICS_AUTO_DELETION, 
true);
+        MetastoreConf.setTimeVar(conf, ConfVars.STATISTICS_RETENTION_PERIOD, 
1, TimeUnit.DAYS);
+
+        
MetaStoreTestUtils.startMetaStoreWithRetry(HadoopThriftAuthBridge.getBridge(), 
conf);
+        TestTxnDbUtil.setConfValues(conf);
+        TestTxnDbUtil.prepDb(conf);
+
+        client = new HiveMetaStoreClient(conf);
+    }
+
+    @After
+    public void tearDown() throws Exception {
+        if (client != null) {
+            // Drop any leftover databases, similar to 
TestPartitionManagement.java
+            List<String> dbs = client.getAllDatabases(DEFAULT_CATALOG_NAME);
+            for (String db : dbs) {
+                if (!db.equalsIgnoreCase(DEFAULT_DATABASE_NAME)) {
+                    client.dropDatabase(DEFAULT_CATALOG_NAME, db, true, false, 
true);
+                }
+            }
+        }
+        try {
+            if (client != null) {
+                client.close();
+            }
+        } finally {
+            client = null;
+        }
+    }
+
+    @Test
+    public void testExpiredTableColStatsAreDeleted() throws Exception {
+        String dbName = "stats_db1";
+        String tableName = "tbl1";
+        createDbAndTable(dbName, tableName, false);
+        // create a column stats entry (table-level)
+        writeTableLevelColStats(dbName, tableName, "c1");
+        // ensure stats exists
+        assertHasTableColStats(dbName, tableName, "c1");
+        // make lastAnalyzed older than the threshold
+        makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+        runStatisticsManagementTask(conf);
+        assertNoTableColStats(dbName, tableName, "c1");
+    }
+
+    @Test
+    public void testExcludedTableStatsAreNotDeleted() throws Exception {
+        String dbName = "stats_db2";
+        String tableName = "tbl2";
+        // Create a database and table that ARE excluded from auto deletion.
+        createDbAndTable(dbName, tableName, true);
+        writeTableLevelColStats(dbName, tableName, "c1");
+        assertHasTableColStats(dbName, tableName, "c1");
+
+        // Manually set lastAnalyzed to a very old timestamp so it would 
normally be expired.
+        makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+        runStatisticsManagementTask(conf);
+
+        // Verify that stats are still present because the table is excluded.
+        assertHasTableColStats(dbName, tableName, "c1");
+    }
+
+    private void createDbAndTable(String dbName, String tableName, boolean 
exclude) throws Exception {
+        Database db;
+        if (!DEFAULT_DATABASE_NAME.equals(dbName)) {
+            db = new DatabaseBuilder()
+                    .setName(dbName)
+                    .setCatalogName(DEFAULT_CATALOG_NAME)
+                    .create(client, conf);
+        } else {
+            db = client.getDatabase(DEFAULT_CATALOG_NAME, 
DEFAULT_DATABASE_NAME);
+        }
+
+        // Build a simple test table with two columns.
+        TableBuilder tb = new TableBuilder()
+                .inDb(db)
+                .setTableName(tableName)
+                .addCol("c1", "double")
+                .addCol("c2", "string")
+                
.setInputFormat("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat")
+                
.setOutputFormat("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat");
+
+        Table t = tb.build(conf);
+
+        // If requested, mark this table as excluded from automatic stats 
deletion.
+        if (exclude) {
+            
t.getParameters().put(StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY,
 "true");
+        }
+        client.createTable(t);
+        client.flushCache();
+    }
+
+    private void writeTableLevelColStats(String db, String tbl, String col) 
throws TException {
+        // Create a stats object for one column.
+        ColumnStatisticsObj obj = new ColumnStatisticsObj();
+        obj.setColName(col);
+        obj.setColType("double");
+
+        // Fill in minimal double-column statistics data.
+        DoubleColumnStatsData doubleData = new DoubleColumnStatsData();
+        doubleData.setNumNulls(0);
+        doubleData.setNumDVs(1);
+        doubleData.setLowValue(1.0);
+        doubleData.setHighValue(1.0);
+
+        ColumnStatisticsData data = new ColumnStatisticsData();
+        data.setDoubleStats(doubleData);
+        obj.setStatsData(data);
+
+        ColumnStatistics cs = new ColumnStatistics();
+        ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl);
+        desc.setCatName("hive");
+        cs.addToStatsObj(obj);
+
+        client.updateTableColumnStatistics(cs);
+    }

Review Comment:
   `ColumnStatisticsDesc desc` is created but never attached to the 
`ColumnStatistics` object. `updateTableColumnStatistics` expects the stats 
descriptor (db/table/cat/lastAnalyzed/etc.) to be set (see other tests like 
`TestHiveMetaStore`); call `cs.setStatsDesc(desc)` before updating, otherwise 
the update can fail or write incomplete stats.



##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestStatisticsManagement.java:
##########
@@ -0,0 +1,239 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * <p>
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * <p>
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.metastore;
+
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_CATALOG_NAME;
+import static org.apache.hadoop.hive.metastore.Warehouse.DEFAULT_DATABASE_NAME;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.util.List;
+import java.util.concurrent.TimeUnit;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.annotation.MetastoreUnitTest;
+import org.apache.hadoop.hive.metastore.api.ColumnStatistics;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsData;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsDesc;
+import org.apache.hadoop.hive.metastore.api.ColumnStatisticsObj;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.hadoop.hive.metastore.api.DoubleColumnStatsData;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.hadoop.hive.metastore.client.builder.DatabaseBuilder;
+import org.apache.hadoop.hive.metastore.client.builder.TableBuilder;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf;
+import org.apache.hadoop.hive.metastore.conf.MetastoreConf.ConfVars;
+import org.apache.hadoop.hive.metastore.security.HadoopThriftAuthBridge;
+import org.apache.hadoop.hive.metastore.utils.TestTxnDbUtil;
+import org.apache.thrift.TException;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+
+import javax.jdo.PersistenceManager;
+import javax.jdo.Query;
+
+@Category(MetastoreUnitTest.class)
+public class TestStatisticsManagement {
+
+    private IMetaStoreClient client;
+    private Configuration conf;
+
+    @Before
+    public void setUp() throws Exception {
+        conf = MetastoreConf.newMetastoreConf();
+        MetastoreConf.setVar(conf, 
ConfVars.METASTORE_METADATA_TRANSFORMER_CLASS, " ");
+        MetaStoreTestUtils.setConfForStandloneMode(conf);
+        conf.setBoolean(ConfVars.MULTITHREADED.getVarname(), false);
+        conf.setBoolean(ConfVars.HIVE_IN_TEST.getVarname(), true);
+
+        // enable stats auto deletion, set up a short retention so threshold 
check triggers easily
+        MetastoreConf.setBoolVar(conf, ConfVars.STATISTICS_AUTO_DELETION, 
true);
+        MetastoreConf.setTimeVar(conf, ConfVars.STATISTICS_RETENTION_PERIOD, 
1, TimeUnit.DAYS);
+
+        
MetaStoreTestUtils.startMetaStoreWithRetry(HadoopThriftAuthBridge.getBridge(), 
conf);
+        TestTxnDbUtil.setConfValues(conf);
+        TestTxnDbUtil.prepDb(conf);
+
+        client = new HiveMetaStoreClient(conf);
+    }
+
+    @After
+    public void tearDown() throws Exception {
+        if (client != null) {
+            // Drop any leftover databases, similar to 
TestPartitionManagement.java
+            List<String> dbs = client.getAllDatabases(DEFAULT_CATALOG_NAME);
+            for (String db : dbs) {
+                if (!db.equalsIgnoreCase(DEFAULT_DATABASE_NAME)) {
+                    client.dropDatabase(DEFAULT_CATALOG_NAME, db, true, false, 
true);
+                }
+            }
+        }
+        try {
+            if (client != null) {
+                client.close();
+            }
+        } finally {
+            client = null;
+        }
+    }
+
+    @Test
+    public void testExpiredTableColStatsAreDeleted() throws Exception {
+        String dbName = "stats_db1";
+        String tableName = "tbl1";
+        createDbAndTable(dbName, tableName, false);
+        // create a column stats entry (table-level)
+        writeTableLevelColStats(dbName, tableName, "c1");
+        // ensure stats exists
+        assertHasTableColStats(dbName, tableName, "c1");
+        // make lastAnalyzed older than the threshold
+        makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+        runStatisticsManagementTask(conf);
+        assertNoTableColStats(dbName, tableName, "c1");
+    }
+
+    @Test
+    public void testExcludedTableStatsAreNotDeleted() throws Exception {
+        String dbName = "stats_db2";
+        String tableName = "tbl2";
+        // Create a database and table that ARE excluded from auto deletion.
+        createDbAndTable(dbName, tableName, true);
+        writeTableLevelColStats(dbName, tableName, "c1");
+        assertHasTableColStats(dbName, tableName, "c1");
+
+        // Manually set lastAnalyzed to a very old timestamp so it would 
normally be expired.
+        makeAllTableColStatsOlderThanRetention(dbName, tableName);
+
+        runStatisticsManagementTask(conf);
+
+        // Verify that stats are still present because the table is excluded.
+        assertHasTableColStats(dbName, tableName, "c1");
+    }
+
+    private void createDbAndTable(String dbName, String tableName, boolean 
exclude) throws Exception {
+        Database db;
+        if (!DEFAULT_DATABASE_NAME.equals(dbName)) {
+            db = new DatabaseBuilder()
+                    .setName(dbName)
+                    .setCatalogName(DEFAULT_CATALOG_NAME)
+                    .create(client, conf);
+        } else {
+            db = client.getDatabase(DEFAULT_CATALOG_NAME, 
DEFAULT_DATABASE_NAME);
+        }
+
+        // Build a simple test table with two columns.
+        TableBuilder tb = new TableBuilder()
+                .inDb(db)
+                .setTableName(tableName)
+                .addCol("c1", "double")
+                .addCol("c2", "string")
+                
.setInputFormat("org.apache.hadoop.hive.ql.io.orc.OrcInputFormat")
+                
.setOutputFormat("org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat");
+
+        Table t = tb.build(conf);
+
+        // If requested, mark this table as excluded from automatic stats 
deletion.
+        if (exclude) {
+            
t.getParameters().put(StatisticsManagementTask.STATISTICS_AUTO_DELETION_EXCLUDE_TBLPROPERTY,
 "true");
+        }
+        client.createTable(t);
+        client.flushCache();
+    }
+
+    private void writeTableLevelColStats(String db, String tbl, String col) 
throws TException {
+        // Create a stats object for one column.
+        ColumnStatisticsObj obj = new ColumnStatisticsObj();
+        obj.setColName(col);
+        obj.setColType("double");
+
+        // Fill in minimal double-column statistics data.
+        DoubleColumnStatsData doubleData = new DoubleColumnStatsData();
+        doubleData.setNumNulls(0);
+        doubleData.setNumDVs(1);
+        doubleData.setLowValue(1.0);
+        doubleData.setHighValue(1.0);
+
+        ColumnStatisticsData data = new ColumnStatisticsData();
+        data.setDoubleStats(doubleData);
+        obj.setStatsData(data);
+
+        ColumnStatistics cs = new ColumnStatistics();
+        ColumnStatisticsDesc desc = new ColumnStatisticsDesc(true, db, tbl);
+        desc.setCatName("hive");
+        cs.addToStatsObj(obj);
+
+        client.updateTableColumnStatistics(cs);
+    }
+
+    private void assertHasTableColStats(String db, String tbl, String col) 
throws TException {
+        List<ColumnStatisticsObj> objs = client.getTableColumnStatistics(db, 
tbl, List.of(col), "hive");
+        assertTrue("Expected stats for " + db + "." + tbl + "." + col, objs != 
null && !objs.isEmpty());
+    }
+
+    private void assertNoTableColStats(String db, String tbl, String col) 
throws TException {
+        try {
+            List<ColumnStatisticsObj> objs = 
client.getTableColumnStatistics(db, tbl, List.of(col), "hive");
+            assertTrue("Expected no stats for " + db + "." + tbl + "." + col, 
objs == null || objs.isEmpty());
+        } catch (NoSuchObjectException e) {
+            // acceptable: server may throw if stats absent depending on impl
+        }
+    }
+
+    private void makeAllTableColStatsOlderThanRetention(String db, String tbl) 
throws Exception {
+        // We update via ObjectStore/PM directly to avoid relying on params 
"lastAnalyzed".
+        RawStore ms = HMSHandler.getMSForConf(conf);
+        ObjectStore os = (ObjectStore) ms;
+        os.setConf(conf);
+
+        // Compute an old timestamp in seconds, here we use 400 days ago.
+        long oldSeconds = (System.currentTimeMillis() - 
TimeUnit.DAYS.toMillis(400)) / 1000;
+
+        // NOTE: exact JDO classes/field paths sometimes vary; adjust filter 
if needed based on MTableColumnStatistics mapping

Review Comment:
   This test includes a TODO-style note about JDO field paths varying ("adjust 
filter if needed"). Since this is committed test code, it should be 
deterministic against the actual model mapping in this repo; please either 
remove the note or replace it with a concrete explanation of why this 
query/filter is correct here.
   ```suggestion
           // In this repository's JDO model, MTableColumnStatistics points to 
MTable via the
           // "table" field, and MTable exposes the table and database names as 
"tableName" and
           // "database.name". This filter therefore selects exactly the 
statistics rows for the
           // target table in the target database.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to