virajjasani opened a new pull request, #2401:
URL: https://github.com/apache/phoenix/pull/2401

   Achieving consistently low latency at any scale is a major challenge for 
many critical web applications/services. This requires such applications to 
choose the right database. Distributed NoSQL databases like Apache Phoenix 
offer the scalability and throughput required for such critical workloads. 
However, when it comes to global secondary indexes, Phoenix provides strongly 
consistent (synchronous) indexes. Here, index updates are tightly coupled with 
the data table updates, meaning as the number of indexes grows, write latency 
on the data table can increase depending on the network overhead, and/or WAL 
replica availability of each index table. As a result, applications with high 
write volumes and multiple indexes can experience some throughput and 
availability degradation.
   
   The purpose of this Jira is to provide the Eventually Consistent Global 
Secondary Indexes. Here, index updates are managed separately from the data 
table updates. This keeps write latency on the data table consistently lower 
regardless of the number of indexes created on the data table. This allows high 
write volume applications to take advantage of the global secondary indexes in 
Phoenix without slowing down their writes, while accepting eventual consistency 
of the indexes.
   
   The design document attached to the Jira describes several possible 
approaches to achieve this, while finalizing two approaches to provide 
eventually consistent indexes.
   
   **Requirements for Eventually Consistent Indexes**
   
   1. Users should be able to create eventually consistent indexes for both 
covered and uncovered indexes.
   2. The SQL statement should include the CONSISTENCY clause to determine 
whether the given covered or uncovered index is strongly consistent or 
eventually consistent. By default, consider the given index as strongly 
consistent. CREATE INDEX <index-name> ON <data-table> ( <col1>,... <colN>) 
INCLUDE (<col1>,...<colN>) CONSISTENCY = EVENTUAL
   3. Users should be able to seamlessly update the CONSISTENCY property of the 
given index from strong to eventual and vice versa using ALTER INDEX SQL 
statement. (although the change of consistency update depends on the 
UPDATE_CACHE_FREQUENCY used at the table level) ALTER INDEX <index-name> ON 
<data-table> CONSISTENCY = EVENTUAL
   4. Depending on the use cases, data tables can consist of the mix of zero or 
more strongly consistent indexes and zero or more eventually consistent indexes.
   5. Index verification MapReduce jobs should work for the eventually 
consistent global secondary indexes similar to how they work for the strongly 
consistent global secondary indexes.
   6. Concurrent mutations on the data table should work for eventually 
consistent indexes.
   7. Data table mutations need to produce and store the time ordered metadata 
(change records) for consumers to replay them and perform the index mutation 
RPCs.
   8. Updates to eventually consistent indexes should mirror the pre-index and 
post-index update semantics of strongly consistent updates. However, the 
separate RPCs for pre-index and post-index updates can be combined into a 
single RPC call. For instance, if the data table update failed, the consumer 
should update corresponding indexes with unverified rows (pre-index updates) 
only. If the data table update succeeded, the consumer should update 
corresponding indexes with verified rows (post-index update) only. The consumer 
does not need to perform both pre and post index update RPCs on the indexes.
   9. To improve the scale of index updates, mutations on indexes should be 
executed by consuming ordered change records per table region. This allows for 
parallel processing across all table regions.
   10. Once the data table region splits or merges into new daughter regions, 
any remaining ordered change records from the closed parent region should be 
processed before consuming newly generated change records for the new daughter 
regions.
   
   
   [Design 
doc](https://docs.google.com/document/d/1-bInfjnc0Yg_Aoqs8zc8-HQQzNrKgLtkodEJQxOkOCg/edit?usp=sharing)
   
   Two approaches are finalized for the eventually consistent global secondary 
indexes:
   
   `phoenix.index.cdc.mutation.serialize` controls which approach is used for 
implementing eventually consistent global secondary indexes.
   
   **Approach 1: Serialized mutations (value = true)**
   
   During preBatchMutate(), IndexRegionObserver generates index mutations for 
each data table mutation and serializes them into a Protobuf IndexMutations 
message. This serialized payload is written as a column value in the CDC index 
table row alongside the CDC event. The IndexCDCConsumer later reads these 
pre-computed mutations from the CDC index, deserializes them, and applies them 
directly to the index table(s). In this approach, the consumer does not need to 
understand index structure or re-derive mutations — it simply replays what was 
already computed on the write path. The trade-off is increased CDC index row 
size due to the serialized mutation payload, and additional write IO on the CDC 
index table.
   
   **Approach 2: Generated mutations from data row states (default, value = 
false)**
   
   During preBatchMutate(), IndexRegionObserver writes only a lightweight CDC 
index entry without serialized index mutations. Instead, the CDC event is 
created with the DATA_ROW_STATE cdc scope. When the IndexCDCConsumer processes 
these events, it reads the CDC index rows which trigger a server-side scan of 
the data table to reconstruct the before-image `currentDataRowState` and 
after-image `nextDataRowState` of the data row at the change timestamp. These 
raw row states are returned as a Protobuf DataRowStates message. The consumer 
then provides these states into  generateIndexMutationsForRow() — the same 
utility used by IndexRegionObserver#prepareIndexMutations on the write path — 
to derive index mutations at consume time. This approach keeps CDC index rows 
small, avoids additional write IO, and generates mutations based on the current 
index definition, but requires an additional data table read per CDC event and 
is sensitive to data visibility timing. Make sure max lookback age
  is long enough to retain before and after images of the row.
   
   Use Approach 2 (serialize = false, default) to minimize write IO: no 
serialized mutations are written to the CDC index, keeping CDC index rows small 
and write latency uniform. The trade-off is higher read IO at consume time — 
the consumer performs an additional data table point-lookup with a raw scan per 
CDC event to reconstruct row states. 
   
   Use Approach 1 (serialize = true) to minimize read IO: the consumer reads 
pre-computed mutations from the CDC index and applies them directly, with no 
data table scan required at consume time. The trade-off is higher write IO — 
serialized index mutations are written alongside each CDC index entry, 
increasing CDC index row size and write-path latency. Although CDC index is 
expected to have TTL same as the data table max lookback age. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to