[PR] PHOENIX-7794 Eventually Consistent Global Secondary Indexes [phoenix]

via GitHub Tue, 07 Apr 2026 14:32:52 -0700


virajjasani opened a new pull request, #2401:
URL: https://github.com/apache/phoenix/pull/2401

Achieving consistently low latency at any scale is a major challenge for
many critical web applications/services. This requires such applications to
choose the right database. Distributed NoSQL databases like Apache Phoenix
offer the scalability and throughput required for such critical workloads.
However, when it comes to global secondary indexes, Phoenix provides strongly
consistent (synchronous) indexes. Here, index updates are tightly coupled with
the data table updates, meaning as the number of indexes grows, write latency
on the data table can increase depending on the network overhead, and/or WAL
replica availability of each index table. As a result, applications with high
write volumes and multiple indexes can experience some throughput and
availability degradation.

The purpose of this Jira is to provide the Eventually Consistent Global
Secondary Indexes. Here, index updates are managed separately from the data
table updates. This keeps write latency on the data table consistently lower
regardless of the number of indexes created on the data table. This allows high
write volume applications to take advantage of the global secondary indexes in
Phoenix without slowing down their writes, while accepting eventual consistency
of the indexes.

The design document attached to the Jira describes several possible
approaches to achieve this, while finalizing two approaches to provide
eventually consistent indexes.

**Requirements for Eventually Consistent Indexes**

1. Users should be able to create eventually consistent indexes for both
covered and uncovered indexes.
2. The SQL statement should include the CONSISTENCY clause to determine
whether the given covered or uncovered index is strongly consistent or
eventually consistent. By default, consider the given index as strongly
consistent. CREATE INDEX <index-name> ON <data-table> ( <col1>,... <colN>)
INCLUDE (<col1>,...<colN>) CONSISTENCY = EVENTUAL
3. Users should be able to seamlessly update the CONSISTENCY property of the
given index from strong to eventual and vice versa using ALTER INDEX SQL
statement. (although the change of consistency update depends on the
UPDATE_CACHE_FREQUENCY used at the table level) ALTER INDEX <index-name> ON
<data-table> CONSISTENCY = EVENTUAL
4. Depending on the use cases, data tables can consist of the mix of zero or
more strongly consistent indexes and zero or more eventually consistent indexes.
5. Index verification MapReduce jobs should work for the eventually
consistent global secondary indexes similar to how they work for the strongly
consistent global secondary indexes.
6. Concurrent mutations on the data table should work for eventually
consistent indexes.
7. Data table mutations need to produce and store the time ordered metadata
(change records) for consumers to replay them and perform the index mutation
RPCs.
8. Updates to eventually consistent indexes should mirror the pre-index and
post-index update semantics of strongly consistent updates. However, the
separate RPCs for pre-index and post-index updates can be combined into a
single RPC call. For instance, if the data table update failed, the consumer
should update corresponding indexes with unverified rows (pre-index updates)
only. If the data table update succeeded, the consumer should update
corresponding indexes with verified rows (post-index update) only. The consumer
does not need to perform both pre and post index update RPCs on the indexes.
9. To improve the scale of index updates, mutations on indexes should be
executed by consuming ordered change records per table region. This allows for
parallel processing across all table regions.
10. Once the data table region splits or merges into new daughter regions,
any remaining ordered change records from the closed parent region should be
processed before consuming newly generated change records for the new daughter
regions.

[Design
doc](https://docs.google.com/document/d/1-bInfjnc0Yg_Aoqs8zc8-HQQzNrKgLtkodEJQxOkOCg/edit?usp=sharing)

Two approaches are finalized for the eventually consistent global secondary
indexes:

`phoenix.index.cdc.mutation.serialize` controls which approach is used for
implementing eventually consistent global secondary indexes.

**Approach 1: Serialized mutations (value = true)**

During preBatchMutate(), IndexRegionObserver generates index mutations for
each data table mutation and serializes them into a Protobuf IndexMutations
message. This serialized payload is written as a column value in the CDC index
table row alongside the CDC event. The IndexCDCConsumer later reads these
pre-computed mutations from the CDC index, deserializes them, and applies them
directly to the index table(s). In this approach, the consumer does not need to
understand index structure or re-derive mutations — it simply replays what was
already computed on the write path. The trade-off is increased CDC index row
size due to the serialized mutation payload, and additional write IO on the CDC
index table.

**Approach 2: Generated mutations from data row states (default, value =
false)**

During preBatchMutate(), IndexRegionObserver writes only a lightweight CDC
index entry without serialized index mutations. Instead, the CDC event is
created with the DATA_ROW_STATE cdc scope. When the IndexCDCConsumer processes
these events, it reads the CDC index rows which trigger a server-side scan of
the data table to reconstruct the before-image `currentDataRowState` and
after-image `nextDataRowState` of the data row at the change timestamp. These
raw row states are returned as a Protobuf DataRowStates message. The consumer
then provides these states into generateIndexMutationsForRow() — the same
utility used by IndexRegionObserver#prepareIndexMutations on the write path —
to derive index mutations at consume time. This approach keeps CDC index rows
small, avoids additional write IO, and generates mutations based on the current
index definition, but requires an additional data table read per CDC event and
is sensitive to data visibility timing. Make sure max lookback age
is long enough to retain before and after images of the row.

Use Approach 2 (serialize = false, default) to minimize write IO: no
serialized mutations are written to the CDC index, keeping CDC index rows small
and write latency uniform. The trade-off is higher read IO at consume time —
the consumer performs an additional data table point-lookup with a raw scan per
CDC event to reconstruct row states.

Use Approach 1 (serialize = true) to minimize read IO: the consumer reads
pre-computed mutations from the CDC index and applies them directly, with no
data table scan required at consume time. The trade-off is higher write IO —
serialized index mutations are written alongside each CDC index entry,
increasing CDC index row size and write-path latency. Although CDC index is
expected to have TTL same as the data table max lookback age.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] PHOENIX-7794 Eventually Consistent Global Secondary Indexes [phoenix]

Reply via email to