[
https://issues.apache.org/jira/browse/PHOENIX-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kadir OZDEMIR reopened PHOENIX-5527:
------------------------------------
In the original design for consistent indexes, we do three phase write. In the
first phase, we write full index rows with unverified status, then we write
data table rows, and finally we overwrite the index row status and set it to
unverified. All these writes get the same timestamp so that index and data
table entries have the same timestamp for consistency. This timestamp is the
wall clock time of the server at the time data table row is read to prepare
index mutations.
Now if an index row is replicated before its data row and is scanned at the
destination, this row can be deleted by read repair. The delete timestamp will
be the same as the existing row timestamp. Since deletes always trump puts when
the timestamps are the same, even if the data row is replicated later, it will
not be visible. To reduce the occurrences of this event, we set the delete time
to 7 days as a stopgap solution for now. However, the side effect of this would
be the increase in the number of unverified rows and unnecessary read repairs.
There is a better solution for this replication lag problem as follows:
1. Instead of writing full index row in the first phase, write it at the last
phase. So, in the first phase, we just write unverified status for the index
row. In the last row, we do full row index write at the last phase.
2. The timestamp of the unverified row is the timestamp of the index full row
(and also the data table row) minus 1. This will make sure that if the
unverified row is deleted by read repair, it will not mask the verified row.
This change does not impact correctness of the design. Now, if the index row is
replicated before the data table row and is scanned, it can be deleted safely
as this will only delete the unverified status. When the full index row is
replicated, it will be visible to scans.
This also improves overall design in terms of efficiency. In the presence of
concurrent writes, we skip the last write phase. These writes leave the index
writes in unverified status. Similarly, if the first or second phase write
fails, we do not proceed with the third phase.
Since with this change, we will be writing only the empty column for index
tables in these failure cases , the storage usage will be improved as we will
write less index data.
The actual fix for the replication lag should be not to replicate index tables
index tables in the first place, and to derive them form the data table writes
as we do on the local cluster. When we have the actual fix, we may remove
subtraction 1 from unverified row timestamp (although we may also want to keep
it as it can protect the index rows against deletions by some crazy race
conditions).
The patch for this attached. I run the tests locally and all passed except one
test failure of a newly introduced IT (EmptyColumnIT). The patch is quite small
and straightforward. I am hoping to get a +1 quickly from one of you,
[~gjacoby], [~vincentpoon],[~abhishek.chouhan], [~larsh].
> Unverified index rows should not be deleted due to replication lag
> -------------------------------------------------------------------
>
> Key: PHOENIX-5527
> URL: https://issues.apache.org/jira/browse/PHOENIX-5527
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 5.0.0, 4.14.3
> Reporter: Kadir OZDEMIR
> Assignee: Kadir OZDEMIR
> Priority: Major
> Fix For: 4.15.0, 5.1.0
>
> Attachments: PHOENIX-5527.master.001.patch
>
>
> The current default delete time for unverified index rows is 10 minutes. If
> an index table row is replicated before its data table row and the
> replication row is unverified at the time of replication, it can be deleted
> when it is scanned on the destination cluster. To prevent these deletes due
> to replication lag issues, we should increase the default time to 7 days.
> This value is configurable using the configuration parameter,
> phoenix.global.index.row.age.threshold.to.delete.ms.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)