[ 
https://issues.apache.org/jira/browse/HIVE-24870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-24870:
--------------------------------
    Description: 
HIVE-2246 introduces CD_ID for optimizing metastore db (details there). 
ObjectStore.removeUnusedColumnDescriptor is a maintenance task that is called 
in every alter partition kind of operation. During a replication, 
alterPartition could be a heavy path, and has no direct advantage of running 
removeUnusedColumnDescriptor immediately. Moreover, there is a 
{code}
select count(*) from "SDS" where "CD_ID"=12345;
{code}
kind of query in it, which can take a relatively long time compared to alter 
partition. 

https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L4982
{code}
      query = pm.newQuery("select count(1) from " +
        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where 
(this.cd == inCD)");
      query.declareParameters("MColumnDescriptor inCD");
      long count = ((Long)query.execute(oldCD)).longValue();

      //if no other SD references this CD, we can throw it out.
      if (count == 0) {
{code}

My proposal is to run this in a batched way, in every configurable amount of 
seconds/minutes/whatever.

  was:
HIVE-2246 introduces CD_ID for optimizing metastore db (details there). 
ObjectStore.removeUnusedColumnDescriptor is a maintenance task that is called 
in every alter partition kind of operation. During a replication, 
alterPartition could be a heavy path, and has no direct advantage of running 
removeUnusedColumnDescriptor immediately. Moreover, there is a 
{code}
select count(*) from "SDS" where "CD_ID"=12345;
{code}
kind of query in it, which can take a relatively long time compared to alter 
partition. 

{code}
      query = pm.newQuery("select count(1) from " +
        "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where 
(this.cd == inCD)");
      query.declareParameters("MColumnDescriptor inCD");
{code}

My proposal is to run this in a batched way, in every configurable amount of 
seconds/minutes/whatever.


> Metastore: cleanup unused column descriptors asynchronously in batches
> ----------------------------------------------------------------------
>
>                 Key: HIVE-24870
>                 URL: https://issues.apache.org/jira/browse/HIVE-24870
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>
> HIVE-2246 introduces CD_ID for optimizing metastore db (details there). 
> ObjectStore.removeUnusedColumnDescriptor is a maintenance task that is called 
> in every alter partition kind of operation. During a replication, 
> alterPartition could be a heavy path, and has no direct advantage of running 
> removeUnusedColumnDescriptor immediately. Moreover, there is a 
> {code}
> select count(*) from "SDS" where "CD_ID"=12345;
> {code}
> kind of query in it, which can take a relatively long time compared to alter 
> partition. 
> https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L4982
> {code}
>       query = pm.newQuery("select count(1) from " +
>         "org.apache.hadoop.hive.metastore.model.MStorageDescriptor where 
> (this.cd == inCD)");
>       query.declareParameters("MColumnDescriptor inCD");
>       long count = ((Long)query.execute(oldCD)).longValue();
>       //if no other SD references this CD, we can throw it out.
>       if (count == 0) {
> {code}
> My proposal is to run this in a batched way, in every configurable amount of 
> seconds/minutes/whatever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to