[ 
https://issues.apache.org/jira/browse/IGNITE-18595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy updated IGNITE-18595:
---------------------------------------
    Description: 
Before starting to accept tuples during a full state transfer, we should take 
the list of all the indices of the table in question that are in states between 
REGISTERED and READ_ONLY at the Catalog version passed with the full state 
transfer. Let’s put them in the *CurrentIndices* list.

Then, for each tuple version we accept:
 # If it’s committed, only consider indices from *CurrentIndices* that are not 
in the REGISTERED state now. We don’t need index committed versions for 
REGISTERED indices as they will be indexed by the backfiller (after the index 
switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, put 
the tuple version to the index if one of the following is true:
 # The index state is not READ_ONLY at the snapshot catalog version (so it’s 
one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be 
read by both RW and RO transactions via the index
 # The index state is READ_ONLY at the snapshot catalog version, but at 
commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t 
include tuples committed on STOPPING as, from the point of view of RO 
transactions, it’s impossible to query such tuples via the index [it is not 
queryable at those timestamps], new RW transactions don’t see the index, and 
old RW transactions [that saw it] have already finished)

 # If it’s a Write Intent, then:
 # If the index is in the REGISTERED state at the snapshot catalog version, add 
the tuple to the index if its transaction was started in the REGISTERED state 
of the index; otherwise, skip it as it will be indexed by the backfiller.
 # If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the 
snapshot catalog version, add the tuple to the index
 # If the index is in READ_ONLY state at the snapshot catalog version, add the 
tuple to the index only if the transaction had been started before the index 
switched to the STOPPING state (this is to index a write intent from a 
finished, but not yet cleaned up, transaction)

Unlike the Backfiller operation, during a full state transfer, we don’t need to 
use the Write Intent resolution procedure as races with transaction cleanup are 
not possible, we just index a Write Intent; If, after the partition replica 
goes online, it gets a cleanup request with ABORT, it will clean the index 
itself.

If the initial state of an index during the full state transfer was BACKFILLING 
and, during accepting the full state transfer, we saw that the index was 
dropped (and moved to the [deleted] pseudostate), we should stop writing to 
that index (and allow it be destroyed on that partition).

If we start a full state transfer on a partition for which an index is being 
built (so the index is in the BACKFILLING state): we’ll index the accepted 
tuples (according to the rules above). After the full state transfer finishes, 
we’ll start getting ‘add this batch to the index’ commands from the RAFT log 
(as the Backfiller emits them during the backfilling process), we can just 
ignore or reapply them. To ignore them, we can raise a special flag in the 
index storage when finishing a full state transfer started with the index being 
in BACKFILLING state.
h1. Old version

Here there is no source of information for schema versions, associated with 
individual inserts. The core idea of the full rebalance is that all versions of 
all rows will be sent, while indexes will be rebuilt locally on the consumer. 
This is unfortunate. Why, you may ask.

Imagine the following situation:
 * time T1: table A with index X is created
 * time T2: user uploads the data
 * time T3: user drops index X
 * time T4: “clean” node N enters topology and downloads data via full 
rebalance procedure
 * time T5: N becomes a leader and receives (already running) RO transactions 
with timestamp T2<T<T3

Ideally, index X should be available for timestamp T. If the index is already 
available, it can’t suddenly become unavailable without an explicit rebuild 
request from the user (I guess).

The LATEST schema version at the moment of rebalance must be known. That’s 
unavoidable and makes total sense. First idea that comes to mind is updating 
all Registered and Available indexes. Situation, when an index has more indexed 
rows than it requires, is correct. Scan queries only return indexed rows that 
match corresponding value in the partition MV store. The real problem would be 
having less data than required.

The way that the approach is described in paragraph above is not quite correct. 
Let’s consider that there is a BinaryRow version. It defines a set of columns 
in the table at the moment of update. Not all row versions are compatible with 
all indexes. For example, you cannot put data into an index if a column has 
been deleted. On the other hand, you can put data in the index if a column has 
not yet been created (assuming it has a default value). In both cases the 
column is missing from the row version, but the outcome is very different.

This fact has some implications. A set of indexes to be updated depends on the 
row version for every particular row. I propose calculating it as a set of all 
indexes from a {_}maximal continuous range of db schemas{_}, that (if not 
empty) starts with the earliest known schema and _all schemas in the range have 
all indexed columns_ existing in the table.

For example, there’s a table T:
|DB schema version|Table columns|
|1|PK, A|
|2|PK, A, B|
|3 (LATEST)|PK, B|

 

In such configuration, ranges would be:
|Index columns|Schemas range|
|A|[1 ... 2]|
|B|[1 ... 3]|
|A, B|[1 ... 2]|

  was:
Before starting to accept tuples during a full state transfer, we should take 
the list of all the indices of the table in question that are in states between 
REGISTERED and READ_ONLY at the Catalog version passed with the full state 
transfer. Let’s put them in the *CurrentIndices* list.

Then, for each tuple version we accept:
 # If it’s committed, only consider indices from *CurrentIndices* that are not 
in the REGISTERED state now. We don’t need index committed versions for 
REGISTERED indices as they will be indexed by the backfiller (after the index 
switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, put 
the tuple version to the index if one of the following is true:
 # The index state is not READ_ONLY at the snapshot catalog version (so it’s 
one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be 
read by both RW and RO transactions via the index
 # The index state is READ_ONLY at the snapshot catalog version, but at 
commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t 
include tuples committed on STOPPING as, from the point of view of RO 
transactions, it’s impossible to query such tuples via the index [it is not 
queryable at those timestamps], new RW transactions don’t see the index, and 
old RW transactions [that saw it] have already finished)

 # If it’s a Write Intent, then:
 # If the index is in the REGISTERED state at the snapshot catalog version, add 
the tuple to the index if its transaction was started in the REGISTERED state 
of the index; otherwise, skip it as it will be indexed by the backfiller.
 # If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the 
snapshot catalog version, add the tuple to the index
 # If the index is in READ_ONLY state at the snapshot catalog version, add the 
tuple to the index only if the transaction had been started before the index 
switched to the STOPPING state (this is to index a write intent from a 
finished, but not yet cleaned up, transaction)

Unlike the Backfiller operation, during a full state transfer, we don’t need to 
use the Write Intent resolution procedure as races with transaction cleanup are 
not possible, we just index a Write Intent; If, after the partition replica 
goes online, it gets a cleanup request with ABORT, it will clean the index 
itself.

If the initial state of an index during the full state transfer was BACKFILLING 
and, during accepting the full state transfer, we saw that the index was 
dropped (and moved to the [deleted] pseudostate), we should stop writing to 
that index (and allow it be destroyed on that partition).

If we start a full state transfer on a partition for which an index is being 
built (so the index is in the BACKFILLING state): we’ll index the accepted 
tuples (according to the rules above). After the full state transfer finishes, 
we’ll start getting ‘add this batch to the index’ commands from the RAFT log 
(as the Backfiller emits them during the backfilling process), we can just 
ignore or reapply them. To ignore them, we will need to raise a partition 
replica-local flag in the index storage when finishing a full state transfer 
started when the index was in BACKFILLING state.
h1. Old version

Here there is no source of information for schema versions, associated with 
individual inserts. The core idea of the full rebalance is that all versions of 
all rows will be sent, while indexes will be rebuilt locally on the consumer. 
This is unfortunate. Why, you may ask.

Imagine the following situation:
 * time T1: table A with index X is created
 * time T2: user uploads the data
 * time T3: user drops index X
 * time T4: “clean” node N enters topology and downloads data via full 
rebalance procedure
 * time T5: N becomes a leader and receives (already running) RO transactions 
with timestamp T2<T<T3

Ideally, index X should be available for timestamp T. If the index is already 
available, it can’t suddenly become unavailable without an explicit rebuild 
request from the user (I guess).

The LATEST schema version at the moment of rebalance must be known. That’s 
unavoidable and makes total sense. First idea that comes to mind is updating 
all Registered and Available indexes. Situation, when an index has more indexed 
rows than it requires, is correct. Scan queries only return indexed rows that 
match corresponding value in the partition MV store. The real problem would be 
having less data than required.

The way that the approach is described in paragraph above is not quite correct. 
Let’s consider that there is a BinaryRow version. It defines a set of columns 
in the table at the moment of update. Not all row versions are compatible with 
all indexes. For example, you cannot put data into an index if a column has 
been deleted. On the other hand, you can put data in the index if a column has 
not yet been created (assuming it has a default value). In both cases the 
column is missing from the row version, but the outcome is very different.

This fact has some implications. A set of indexes to be updated depends on the 
row version for every particular row. I propose calculating it as a set of all 
indexes from a {_}maximal continuous range of db schemas{_}, that (if not 
empty) starts with the earliest known schema and _all schemas in the range have 
all indexed columns_ existing in the table.

For example, there’s a table T:
|DB schema version|Table columns|
|1|PK, A|
|2|PK, A, B|
|3 (LATEST)|PK, B|

 

In such configuration, ranges would be:
|Index columns|Schemas range|
|A|[1 ... 2]|
|B|[1 ... 3]|
|A, B|[1 ... 2]|


> Implement index build process during the full state transfer
> ------------------------------------------------------------
>
>                 Key: IGNITE-18595
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18595
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> Before starting to accept tuples during a full state transfer, we should take 
> the list of all the indices of the table in question that are in states 
> between REGISTERED and READ_ONLY at the Catalog version passed with the full 
> state transfer. Let’s put them in the *CurrentIndices* list.
> Then, for each tuple version we accept:
>  # If it’s committed, only consider indices from *CurrentIndices* that are 
> not in the REGISTERED state now. We don’t need index committed versions for 
> REGISTERED indices as they will be indexed by the backfiller (after the index 
> switches to BACKFILLING). For each remaining index in {*}CurrentIndices{*}, 
> put the tuple version to the index if one of the following is true:
>  # The index state is not READ_ONLY at the snapshot catalog version (so it’s 
> one of BACKFILLING, AVAILABLE, STOPPING) - because these tuples can still be 
> read by both RW and RO transactions via the index
>  # The index state is READ_ONLY at the snapshot catalog version, but at 
> commitTs it either did not yet exist, or strictly preceded STOPPING (we don’t 
> include tuples committed on STOPPING as, from the point of view of RO 
> transactions, it’s impossible to query such tuples via the index [it is not 
> queryable at those timestamps], new RW transactions don’t see the index, and 
> old RW transactions [that saw it] have already finished)
>  # If it’s a Write Intent, then:
>  # If the index is in the REGISTERED state at the snapshot catalog version, 
> add the tuple to the index if its transaction was started in the REGISTERED 
> state of the index; otherwise, skip it as it will be indexed by the 
> backfiller.
>  # If the index is in any of BACKFILLING, AVAILABLE, STOPPING states at the 
> snapshot catalog version, add the tuple to the index
>  # If the index is in READ_ONLY state at the snapshot catalog version, add 
> the tuple to the index only if the transaction had been started before the 
> index switched to the STOPPING state (this is to index a write intent from a 
> finished, but not yet cleaned up, transaction)
> Unlike the Backfiller operation, during a full state transfer, we don’t need 
> to use the Write Intent resolution procedure as races with transaction 
> cleanup are not possible, we just index a Write Intent; If, after the 
> partition replica goes online, it gets a cleanup request with ABORT, it will 
> clean the index itself.
> If the initial state of an index during the full state transfer was 
> BACKFILLING and, during accepting the full state transfer, we saw that the 
> index was dropped (and moved to the [deleted] pseudostate), we should stop 
> writing to that index (and allow it be destroyed on that partition).
> If we start a full state transfer on a partition for which an index is being 
> built (so the index is in the BACKFILLING state): we’ll index the accepted 
> tuples (according to the rules above). After the full state transfer 
> finishes, we’ll start getting ‘add this batch to the index’ commands from the 
> RAFT log (as the Backfiller emits them during the backfilling process), we 
> can just ignore or reapply them. To ignore them, we can raise a special flag 
> in the index storage when finishing a full state transfer started with the 
> index being in BACKFILLING state.
> h1. Old version
> Here there is no source of information for schema versions, associated with 
> individual inserts. The core idea of the full rebalance is that all versions 
> of all rows will be sent, while indexes will be rebuilt locally on the 
> consumer. This is unfortunate. Why, you may ask.
> Imagine the following situation:
>  * time T1: table A with index X is created
>  * time T2: user uploads the data
>  * time T3: user drops index X
>  * time T4: “clean” node N enters topology and downloads data via full 
> rebalance procedure
>  * time T5: N becomes a leader and receives (already running) RO transactions 
> with timestamp T2<T<T3
> Ideally, index X should be available for timestamp T. If the index is already 
> available, it can’t suddenly become unavailable without an explicit rebuild 
> request from the user (I guess).
> The LATEST schema version at the moment of rebalance must be known. That’s 
> unavoidable and makes total sense. First idea that comes to mind is updating 
> all Registered and Available indexes. Situation, when an index has more 
> indexed rows than it requires, is correct. Scan queries only return indexed 
> rows that match corresponding value in the partition MV store. The real 
> problem would be having less data than required.
> The way that the approach is described in paragraph above is not quite 
> correct. Let’s consider that there is a BinaryRow version. It defines a set 
> of columns in the table at the moment of update. Not all row versions are 
> compatible with all indexes. For example, you cannot put data into an index 
> if a column has been deleted. On the other hand, you can put data in the 
> index if a column has not yet been created (assuming it has a default value). 
> In both cases the column is missing from the row version, but the outcome is 
> very different.
> This fact has some implications. A set of indexes to be updated depends on 
> the row version for every particular row. I propose calculating it as a set 
> of all indexes from a {_}maximal continuous range of db schemas{_}, that (if 
> not empty) starts with the earliest known schema and _all schemas in the 
> range have all indexed columns_ existing in the table.
> For example, there’s a table T:
> |DB schema version|Table columns|
> |1|PK, A|
> |2|PK, A, B|
> |3 (LATEST)|PK, B|
>  
> In such configuration, ranges would be:
> |Index columns|Schemas range|
> |A|[1 ... 2]|
> |B|[1 ... 3]|
> |A, B|[1 ... 2]|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to