Kadir Ozdemir created PHOENIX-7018: -------------------------------------- Summary: Server side index maintainer caching for read, write, and replication Key: PHOENIX-7018 URL: https://issues.apache.org/jira/browse/PHOENIX-7018 Project: Phoenix Issue Type: Improvement Reporter: Kadir Ozdemir
The relationship between a data table and index is somewhat involved. Phoenix needs to transform a data table row to the corresponding index row, extract a data table row key (i.e., a primary key) from an index table row key (a secondary key), and map data table columns to index table included columns. The metadata for these operations and the operations are encapsulated in the class called IndexMaintainer. Phoenix creates a separate IndexMaintainer object for each index table in memory on the client side. IndexMaintainer objects are then serialized using the protobuf library and sent to servers along with the mutations on the data tables and scans on the index tables. The Phoenix server code (more accurately Phoenix coprocessors) then uses IndexMaintainer objects to update indexes and leverage indexes for queries. Phoenix coprocessors use IndexMaintainer objects associated with a given batch of mutations or scan operation only once (i.e., for the batch or scan) and the Phoenix client sends these objects along with every batch of mutations and every scan. The secondary indexes are used to improve the performance of queries on the secondary index columns. The secondary indexes are required to be consistent with their data tables. The consistency here means that regardless of whether a query is served from a data table or index table, the same result is returned. This consistency promise cannot be kept when the data and their index table rows are replicated independently, which happens at the HBase level replication. HBase replicates WALs (Write Ahead Logs) of regions servers and replays these WALs at the destination cluster. A given data table row and the corresponding index table row are likely served by different region servers and the WALs of these region servers are replicated independently. This means these rows can arrive at different times which makes data and index tables inconsistent at the destination cluster. Replicating global indexes leads to inefficient use of the replication bandwidth due to the additional overhead of replicating data that can actually be derived from the data that has been already replicated. When one considers that an index table is essentially a copy of its data table without the columns that are not included in the index, and a given data table can have multiple indexes, it is easy to see that replicating indexes can double the replication bandwidth requirement easily for a given data table. A solution for eliminating index table replication is to add just enough metadata to WAL records for the mutations of the data tables with indexes and have a replication endpoint and coprocessor endpoint to generate index mutations from these records, please see PHOENIX-5315. This document extends this solution to eliminate replicating index tables but also to improve read and write path for index tables by maintaining a consistent server side caching for index maintainers. The idea behind the proposed solution is to cache the index maintainers on the server side and thus eliminate transferring index maintainers during read and write as well as replication. The coprocessors that currently require the index maintainers are IndexRegionObserver for the write path and some other coprocessors including GlobalIndexChecker for read repair in the read path. The proposed solution leverages the existing capability of adding IndexMaintainer objects in the server side cache implemented by ServerCachingEndpointImpl. The design eliminates global index table replication and also eliminates the server side cache update with IndexMaintainer objects for each batch write. IndexRegionObserver (the coprocessor that generates index mutations from data table mutations) needs to access IndexMaintainer objects for the indexes on a table or view. The metadata transferred as a mutation attribute will be used to identify the table or view for which a mutation is. The metadata will include the tenant Id, table schema, and table name. The cache key for the array of index maintainers for this table or view will be formed from this metadata. When IndexRegionObserver intercepts a mutation on an HBase table (using the preBatchMutate coprocessor hook), IndexRegionObserver will form the cache key for the array of index maintainer and retrieve it from the server cache. This design requires maintaining metadata caches at region servers. These caches need to be consistent, that is, these caches should not have stale metadata. To ensure this, when MetaDataEndpointImpl updates index metadata, It will first invalidate the index maintainers caches. If the invalidation fails, then the metadata operation fails and the failure response is returned to the client. It is important to note that MetaDataEndpointImpl needs to use locking to serialize read and write operations on the metadata. After the caches are invalidated, the Phoenix coprocs would attempt to retrieve index maintainer objects from MetaDataEndpointImpl for the next mutation or scan operation. This retrieval operation has to wait for the ongoing metadata update transaction to complete. For every query including point lookup queries, the Phoenix client currently serializes an IndexMaintainer object, and attaches it to the scan object as a scan attribute, and then Phoenix coprocessors deserialize it from the scan object. To eliminate this serialization/deserialization and save the network bandwidth for IndexMaintainer, the Phoenix client can pass the cache key for the array of IndexMaintainer objects and the name of the index (instead of the IndexMaintainer object), and the coprocessor can retrieve the the array of IndexMaintainer objects from its server cache and identifies the one for the given index. If the array of IndexMaintainer objects is not in the cache, the coprocessor using the Phoenix client library can construct the array of IndexMaintainer objects and populate the server cache with it. A mutation or batch of mutations on a data table requires updating all the indexes on that data table. For the cache to be efficient, we need to have the IndexMaintainer objects for all of these indexes. This is the reason this design chooses to cache the array of IndexMaintainer objects (for a given table or view) instead of caching individual IndexMaintainer objects. This design achieves cache coherency which impacts the availability of the metadata operations. For a metadata operation to succeed, MetaDataEndpointImpl should be able to invalidate the index maintainer caches on the region servers first so that Phoenix coprocessors would need to retrieve the most recent metadata from MetaDataEndpointImpl when they need it. Depending on how table regions are distributed over region servers, for a given metadata a subset or all of the server caches may need to be invalidated. It is important to note that the general cluster availability is not impacted significantly as the metadata operations are rare compared to the read/write operations on the user data in terms of number of operations or frequency of operations. -- This message was sent by Atlassian Jira (v8.20.10#820010)