Xiao Li created SPARK-17030:
-------------------------------

             Summary: Remove/Cleanup HiveMetastoreCatalog.scala
                 Key: SPARK-17030
                 URL: https://issues.apache.org/jira/browse/SPARK-17030
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Xiao Li


Metadata cache is a key-value cache built on Google Guava Cache to speed up 
building logical plan nodes (`LogicalRelation`) for data source tables. The 
cache key is a unique identifier of a table. Here, the identifier is the fully 
qualified table name, including the database in which it resides. (In the 
future, it could be extended to a multi-part names when introducing federated 
Catalog). The value is the corresponding LogicalRelation that represents a 
specific data source table.  
The cache is session based. In each session, the cache is managed in two 
different ways at the same time. 

1. **Auto loading**: when Spark querying the cache for a user-defined data 
source table, the cache either returns a cached LogicalRelation, or else 
automatically builds a new one by decoding the metadata fetched from the 
external Catalog. 
2. **Manual caching**: Hive tables are represented as logical plan nodes 
MetastoreRelation. For better performance, we convert Hive serde tables to data 
source tables, if convertible. The conversion is not completed at the stage of 
metadata loading. Instead, it is conducted during semantic analysis. If a Hive 
serde table is convertible, we first try to get the value (by the fully 
qualified table name) from the metadata cache. If existed, we use it directly; 
otherwise, build a new one and also push it into the cache for the future reuse.

Currently, the file `HiveMetastoreCatalog.scala` contains different 
entities/functions since all of them require interaction with the cache, called 
`cachedDataSourceTables`. This PR is to cleanup `HiveMetastoreCatalog.scala`. 

**Proposal**: To avoid mixing everything related to cache in the same file, we 
abstract and define the following API for cache operations. After the code 
changes, `HiveMetastoreatalog.scala` only contains the cache API 
implementation. The file name can be renamed to `MetadataCache.scala`

{noformat}
// cacheTable is a wrapper of cache.put(key, value). It associates value with 
key in this cache.
// If the cache previously contained a value associated with key, the old value 
is replaced by value.
def cacheTable(tableIdent: TableIdentifier, plan: LogicalPlan): Unit
{noformat}

{noformat}
// getTableIfPresent is a wrapper of cache.getIfPresent(key) that never causes 
values to be automatically loaded.
def getTableIfPresent(tableIdent: TableIdentifier): Option[LogicalPlan]
{noformat}

{noformat}
// getTable is a wrapper of cache.get(key). If cache misses, Caches loaded by a 
CacheLoader will call
// CacheLoader.load(K) to load new values into the cache. That means, it will 
call the function load.
def getTable(tableIdent: TableIdentifier): LogicalPlan
{noformat}

{noformat}
// refreshTable is a wrapper of cache.invalidate. It does not eagerly reload 
the cache.
// It just invalidate the cache. Next time when we use the table, it will be 
populated in the cache.
def refreshTable(tableIdent: TableIdentifier): Unit
{noformat}

{noformat}
// Discards all entries in the cache. It is a wrapper of cache.invalidateAll.
def invalidateAll(): Unit
{noformat}

This PR also moves three Hive-specific Analyzer rules `CreateTables`, 
`OrcConversions` and `ParquetConversions` from `HiveMetastoreCatalog.scala` to 
`HiveStrategies.scala`. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to