Kadir Ozdemir created PHOENIX-6832:
--------------------------------------

             Summary: Uncovered Global Secondary Indexes
                 Key: PHOENIX-6832
                 URL: https://issues.apache.org/jira/browse/PHOENIX-6832
             Project: Phoenix
          Issue Type: Improvement
            Reporter: Kadir Ozdemir


An index can be called a covered index if the index cannot serve a query alone. 
The sole purpose of an uncovered index would be identifying the data table rows 
to be scanned for the query.  This implies that the DDL for an uncovered index 
does not have the INCLUDE clause. 

Then an index is called a covered index if the index can serve a query alone. 
Please note that a covered index does not mean that it can cover all queries. 
It just means that it can cover a query. A covered index can still cover some 
queries even if the index DDL does not have the INCLUDE clause. This is because 
if a given query may reference to only PK and/or indexed columns, and thus a 
covered index without any included columns can serve this query by itself 
(i.e.,  without joining index rows with data table rows). Another use case for 
covered indexes without included columns is the count(*) queries. Currently 
Phoenix uses indexes for count(*) queries by default.

Since uncovered indexes will be used to identify data table rows affected by a 
given query and the column values will be picked up from the data table, we can 
provide a solution that is much simpler than the solution for covered indexes 
by taking the advantage of the fact that the data table is the source of truth, 
and an index table is used to only map secondary keys to the primary keys to 
eliminate full table scans. The correctness of such a solution is ensured if 
for every data table row, there exists an index row. Then our solution to 
update the data tables and their indexes in a consistent fashion for global 
secondary indexes would be a two-phase update approach, where we first insert 
the index table rows, and only if they are successful, then we update the data 
table rows. 

This approach does not require reading the existing data table rows which is 
currently required for covered indexes. Also, it does not require two-phase 
commit writes for updating and maintaining global secondary index table rows. 
Eliminating a data table read operation and an RPC call to update the index row 
verification status on the corresponding index row would cut down index write 
latency overhead by at least 50% for global uncovered indexes when compared to 
global covered indexes. This is because global covered indexes require one data 
table read and two index write operation for every data table update whereas 
global uncovered indexes would require only one index write. For batch writes, 
the expected performance and latency improvement would be much higher than 50% 
since a batch of random row updates would not anymore require random seeks on 
the data table for reading existing data table rows.

PHOENIX-6458, PHOENIX-6501 and PHOENIX-6663 improve the performance and 
efficiency of joining index rows with their data table rows when a covered 
index cannot cover a given query. We can further leverage it to support 
uncovered indexes. 

The uncovered indexes would be a significant performance improvement for write 
intensive workloads. Also a common use case where uncovered indexes will be 
desired is the upsert select use case on the data table, where a subset of rows 
are updated in a batch. In this use case, the select query performance is 
greatly improved via a covered index but the upsert part suffers due to the 
covered index write overhead especially when the selected data table rows are 
not consecutively stored on disk which is the most common case.

As mentioned before, the DDL for index creation does not include the INCLUDE 
clause. We can add UNCOVERED key word to indicate the index to be created is an 
uncovered index, for example, CREATE UNCOVERED INDEX. 

As in the case of covered indexes, we can do read repair for uncovered indexes 
too. The difference is that instead of using the verify status for index rows, 
we would check if the corresponding data table row exists for a given index 
row. Since we would always retrieve the data table rows to join back with index 
rows for uncovered indexes, the read repair cost would occur only for deleting 
invalid index rows. Also, the existing index reverse verification and repair 
feature supported by IndexTool can be used to do bulk repair operation time to 
time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to