[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-19 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657730#comment-16657730
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/20/18 4:52 AM:
--

See *Activate | Deactivate Cluster* successful rerun 
https://ci.ignite.apache.org/viewLog.html?buildId=2124589=buildResultsDiv=IgniteTests24Java8_ActivateDeactivateCluster


was (Author: pavlukhin):
See *Activate | Deactivate Cluster* rerun 
https://ci.ignite.apache.org/viewLog.html?buildId=2124589=buildResultsDiv=IgniteTests24Java8_ActivateDeactivateCluster

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
> Attachments: 
> mtcga.gridgain.com_build.html_serverId=apache=true=2111226=Check.png
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-19 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656842#comment-16656842
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/19/18 2:12 PM:
--

The latest run was partially cancelled and not representative. Newly started 
build estimated execution time is 11 hours (due to problems with .NET tests). 

See bot report about previous TC run in an attachment. All failed tests seems 
to be either flacky or constantly failing. 


was (Author: pavlukhin):
The latest run was partially cancelled and not representative. Newly started 
build estimated execution time is 11 hours (due to problems with .NET tests). 

See bot report about previous TC run in an attachment. Comments regarding it:

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
> Attachments: 
> mtcga.gridgain.com_build.html_serverId=apache=true=2111226=Check.png
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-19 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656842#comment-16656842
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/19/18 2:06 PM:
--

The latest run was partially cancelled and not representative. Newly started 
build estimated execution time is 11 hours (due to problems with .NET tests). 

See bot report about previous TC run in an attachment. Comments regarding it:


was (Author: pavlukhin):
The latest run was partially cancelled and not representative. Newly started 
build estimated execution time is 11 hours (due to problems with .NET tests). 
See bot report about previous TC run in an attachment.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
> Attachments: 
> mtcga.gridgain.com_build.html_serverId=apache=true=2111226=Check.png
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-18 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641358#comment-16641358
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/18/18 3:12 PM:
--

If a node fails before finishing all initiated by it transactions they must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.

Also _partition counters_ should be kept consistent among partition replicas 
after recovery. Current transaction commit protocol delivers _partition 
counters_ to backups on _prepare_ phase. During recovery there could occur a 
situation when transaction is recovering case when primary has failed and one 
backup received counters and another do not. In such case transaction should be 
rolled back and counters should be aligned. As primary has failed PME will 
occur. We must close all possible _gaps_ in counters before PME is complete. 
It's achieved with the following steps:
1. Interchange counters among sibling backups before finishing recovering 
transacitons.
2. Drain pending partition counter queues during PME.


was (Author: pavlukhin):
If a node fails before finishing all initiated by it transactions they must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.

Also _partition counters_ should be kept consistent among partition replicas 
after recovery. Current protocol delivers _partition counters_ to backups on 
_prepare_ phase. During recovery there could occur a situation when transaction 
is recovering case when primary has failed and one backup received counters and 
another do not. Such case is a rollback and counters should be aligned. As 
primary has failed PME will occur. We rely on counters alignment during PME.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-18 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641358#comment-16641358
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/18/18 3:03 PM:
--

If a node fails before finishing all initiated by it transactions they must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.

Also _partition counters_ should be kept consistent among partition replicas 
after recovery. Current protocol delivers _partition counters_ to backups on 
_prepare_ phase. During recovery there could occur a situation when transaction 
is recovering case when primary has failed and one backup received counters and 
another do not. Such case is a rollback and counters should be aligned. As 
primary has failed PME will occur. We rely on counters alignment during PME.


was (Author: pavlukhin):
If a node fails before finishing all initiated by it transactions they must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.
 
Possible problem: not all transactions can be recovered. Such transactions can 
prevent other recovered transactions removal from the active list.

Also _partition counters_ should be kept consistent among partition replicas 
after recovery. Current protocol delivers _partition counters_ to backups on 
_prepare_ phase. During recovery there could occur a situation when transaction 
is recovering case when primary has failed and one backup received counters and 
another do not. Such case is a rollback and counters should be aligned. As 
primary has failed PME will occur. We rely on counters alignment during PME.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-07 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641358#comment-16641358
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/8/18 5:50 AM:
-

If a node fails before finishing all initiated by it transactions they must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.
 
Possible problem: not all transactions can be recovered. Such transactions can 
prevent other recovered transactions removal from the active list.

Also _partition counters_ should be kept consistent among partition replicas 
after recovery. Current protocol delivers _partition counters_ to backups on 
_prepare_ phase. During recovery there could occur a situation when transaction 
is recovering case when primary has failed and one backup received counters and 
another do not. Such case is a rollback and counters should be aligned. As 
primary has failed PME will occur. We rely on counters alignment during PME.


was (Author: pavlukhin):
If a node fails before finishing all initiated by it transactions the must be 
removed from active list on mvcc coordinator strictly after local transaction 
completion on each participating node. There are 2 cases handled differently 
depending on node type (client or server).
 # Transactions left by a server node are removed from the active list on PME.
 # Transactions left by a client node are removed from the active list after 
cluster-wide voting when each node gives a vote after making decision on all 
transactions recovery on that node.
 
Possible problem: not all transactions can be recovered. Such transactions can 
prevent other recovered transactions removal from the active list.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-10-01 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633377#comment-16633377
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 10/1/18 8:24 AM:
-

We should support recovery in slightly different scenarios depending on 
following:
 * Is failed node server?
 * Is failed node mvcc coordinator?

Also attention should be put on backup transaction rollback.
Q: Who should be an initiator of rollback procedure on backup?
A: Primary (DhtLocal) TX rolls back dependent backup TXs (DhtRemote).


was (Author: pavlukhin):
We should support recovery in slightly different scenarios depending on 
following:
 * Is failed node server?
 * Is failed node mvcc coordinator?
Also attention should be put on backup transaction rollback. It is not 
currently clear who should be an initiator of rollback procedure on backup.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-09-30 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633377#comment-16633377
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 9/30/18 2:06 PM:
-

We should support recovery in slightly different scenarios depending on 
following:
 * Is failed node server?
 * Is failed node mvcc coordinator?
Also attention should be put on backup transaction rollback. It is not 
currently clear who should be an initiator of rollback procedure on backup.


was (Author: pavlukhin):
We should support recovery in slightly different scenarios depending on 
following:
 * Is failed node server?
 * Is failed node mvcc coordinator?

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-09-27 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628932#comment-16628932
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 9/27/18 1:22 PM:
-

Possible problems:
 * Proper commit with mvcc coordinator.
 * TX mappings with backups and proper empty DHT TX handling (+ cache API).
 * Consistent rollback if one backup was prepared and another was not.
 * Near node failed after requesting snapshot and before writing anything.


was (Author: pavlukhin):
Possible problems:
 * Proper commit with mvcc coordinator.
 * TX mappings with backups and proper empty DHT TX handling.
 * Consistent rollback if one backup was prepared and another was not.
 * Near node failed after requesting snapshot and before writing anything.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-09-27 Thread Ivan Pavlukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628932#comment-16628932
 ] 

Ivan Pavlukhin edited comment on IGNITE-5935 at 9/27/18 12:53 PM:
--

Possible problems:
 * Proper commit with mvcc coordinator.
 * TX mappings with backups and proper empty DHT TX handling.
 * Consistent rollback if one backup was prepared and another was not.
 * Near node failed after requesting snapshot and before writing anything.


was (Author: pavlukhin):
Possible problems:
 * Proper commit with mvcc coordinator.
 * TX mappings with backups and proper empty DHT TX handling.
 * Consistent rollback if one backup was prepared and another was not.

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Assignee: Ivan Pavlukhin
>Priority: Major
> Fix For: 2.7
>
>
> Transaction recovery procedure is initiated when near node failed before 
> transaction was finished.
> In MVCC transactions _partition update counter_ modification is started on 
> prepare phase. If a transaction was prepared at least on one node we need to 
> finish _partition update counter_ modification consistently on all 
> participating nodes.
> Also recovered transaction should be removed from active transactions list on 
> mvcc coordinator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-5935) MVCC TX: Tx recovery protocol

2018-09-24 Thread Vladimir Ozerov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625418#comment-16625418
 ] 

Vladimir Ozerov edited comment on IGNITE-5935 at 9/24/18 6:24 AM:
--

Propagation of involved TX nodes on prepare stage: 
{{GridDistributedTxPrepareRequest.transactionNodes}}
Recovery trigger: {{IgniteTxManager.commitIfPrepared}}
Recovery request processing: {{IgniteTxHandler.processCheckPreparedTxRequest}}


was (Author: vozerov):
Propagation of involved TX nodes on prepare stage: 
{{GridDistributedTxPrepareRequest.transactionNodes}}
Recovery trigger: {{IgniteTxManager.commitIfPrepared}}

> MVCC TX: Tx recovery protocol
> -
>
> Key: IGNITE-5935
> URL: https://issues.apache.org/jira/browse/IGNITE-5935
> Project: Ignite
>  Issue Type: Task
>  Components: cache, mvcc
>Reporter: Semen Boikov
>Priority: Major
> Fix For: 2.7
>
>
> Tx recovery doesn't work properly for txs over MVCC enabled caches using 
> Cache API. It requires MvccSnapshot which may not be acquired at recovery 
> time.
> Need to implement logic for checking whether snapshot was already gotten by 
> one of tx participants and use existing one, request and spread between 
> participants a new snapshot otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)