[jira] [Resolved] (IGNITE-12502) Document ignite-spring-data_2.2 module

2022-05-05 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-12502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita resolved IGNITE-12502.
--
Resolution: Won't Fix

The module was documented: 
https://ignite.apache.org/docs/latest/extensions-and-integrations/spring/spring-data

> Document ignite-spring-data_2.2 module
> --
>
> Key: IGNITE-12502
> URL: https://issues.apache.org/jira/browse/IGNITE-12502
> Project: Ignite
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Ilya Kasnacheev
>Priority: Major
>
> After IGNITE-12259
> I think there are no API changes, but we should mention that we have such 
> module and what its dependencies are.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16919) H2 Index cost function must take into account only corresponding columns.

2022-05-05 Thread Konstantin Orlov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532434#comment-17532434
 ] 

Konstantin Orlov commented on IGNITE-16919:
---

I've taken another look, and now I see, it's definitely a bug. It's hard to 
comprehend without any test or reproducer...

For the history: without the fix, we sometimes double the cost of an index just 
because we mistakenly suppose there are columns which aren't covered by the 
current index, so the read from the scan index is required.

BTW, as far as I know, we always read the data row from the page, and do it 
only once despite all columns are covered by the index or not. Perhaps, we 
should revisit this place.

> H2 Index cost function must take into account only corresponding columns.
> -
>
> Key: IGNITE-16919
> URL: https://issues.apache.org/jira/browse/IGNITE-16919
> Project: Ignite
>  Issue Type: Bug
>  Components: sql
>Affects Versions: 2.13
>Reporter: Evgeny Stanilovsky
>Assignee: Evgeny Stanilovsky
>Priority: Major
> Attachments: image-2022-04-30-19-13-59-997.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> H2IndexCostedBase#getCostRangeIndex is called with allColumnsSet where 
> consists columns from all operating tables, check: 
> org.h2.table.Plan#calculateCost :
> {code:java}
> final HashSet allColumnsSet = ExpressionVisitor
> .allColumnsForTableFilters(allFilters);
> {code}
> thus allColumnsSet consist columns from all operating tables
>  !image-2022-04-30-19-13-59-997.png! 
> and erroneous iteration here:
> H2IndexCostedBase#getCostRangeIndex
> ...
> {code:java}
> if (!isScanIndex && allColumnsSet != null && !skipColumnsIntersection && 
> !allColumnsSet.isEmpty()) {
> boolean foundAllColumnsWeNeed = true;
> for (Column c : allColumnsSet) { // <-- all columns
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-15316) Read Repair may see inconsistent entry when it is consistent but updated right before the check

2022-05-05 Thread Anton Vinogradov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov updated IGNITE-15316:
--
Summary: Read Repair may see inconsistent entry when it is consistent but 
updated right before the check  (was: Read Repair may see inconsistent entry at 
tx cache when it is consistent but updated right before the check)

> Read Repair may see inconsistent entry when it is consistent but updated 
> right before the check
> ---
>
> Key: IGNITE-15316
> URL: https://issues.apache.org/jira/browse/IGNITE-15316
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Anton Vinogradov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-31
>
> Even at FULL_SYNC mode stale reads are possible from backups after the lock 
> is obtained by "Read Repair" tx.
> This is possible because (at previous tx) entry becomes unlocked (committed) 
> on primary before tx committed on backups.
> This is not a problem for Ignite (since backups keep locks until updated) but 
> produces false positive "inconsistency state found" events and repairs.
> As to Atomic caches, there is even no chance to lock entry before the check, 
> so, the inconsistency window is wider than in the tx case.
> This problem does not allow to use ReadRepair with concurrent modifications, 
> since repair may happen because of an inconsistent read (while another 
> operation is in progress), not because of real inconsistency.
> A possible solution is to implement fake updates, which will guarantee that 
> the previous update is fully finished -> consistent read.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-15316) Read Repair may see inconsistent entry at tx cache when it is consistent but updated right before the check

2022-05-05 Thread Anton Vinogradov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov updated IGNITE-15316:
--
Description: 
Even at FULL_SYNC mode stale reads are possible from backups after the lock is 
obtained by "Read Repair" tx.
This is possible because (at previous tx) entry becomes unlocked (committed) on 
primary before tx committed on backups.
This is not a problem for Ignite (since backups keep locks until updated) but 
produces false positive "inconsistency state found" events and repairs.

As to Atomic caches, there is even no chance to lock entry before the check, 
so, the inconsistency window is wider than in the tx case.

This problem does not allow to use ReadRepair with concurrent modifications, 
since repair may happen because of an inconsistent read (while another 
operation is in progress), not because of real inconsistency.

A possible solution is to implement fake updates, which will guarantee that the 
previous update is fully finished -> consistent read.

  was:
Even at FULL_SYNC mode stale reads are possible from backups after the lock is 
obtained by "Read Repair" tx.
This is possible because (at previous tx) entry becomes unlocked (committed) on 
primary before tx committed on backups.
This is not a problem for Ignite (since backups keep locks until updated) but 
produces false positive "inconsistency state found" events and repairs.

Unlock relocation does not seems to be a proper fix, since it will cause a 
performance drop.
So, we should recheck values several times if an inconsistency is found, even 
when the lock is already obtained by "Read Repair".


> Read Repair may see inconsistent entry at tx cache when it is consistent but 
> updated right before the check
> ---
>
> Key: IGNITE-15316
> URL: https://issues.apache.org/jira/browse/IGNITE-15316
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Anton Vinogradov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-31
>
> Even at FULL_SYNC mode stale reads are possible from backups after the lock 
> is obtained by "Read Repair" tx.
> This is possible because (at previous tx) entry becomes unlocked (committed) 
> on primary before tx committed on backups.
> This is not a problem for Ignite (since backups keep locks until updated) but 
> produces false positive "inconsistency state found" events and repairs.
> As to Atomic caches, there is even no chance to lock entry before the check, 
> so, the inconsistency window is wider than in the tx case.
> This problem does not allow to use ReadRepair with concurrent modifications, 
> since repair may happen because of an inconsistent read (while another 
> operation is in progress), not because of real inconsistency.
> A possible solution is to implement fake updates, which will guarantee that 
> the previous update is fully finished -> consistent read.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] (IGNITE-15316) Read Repair may see inconsistent entry at tx cache when it is consistent but updated right before the check

2022-05-05 Thread Anton Vinogradov (Jira)


[ https://issues.apache.org/jira/browse/IGNITE-15316 ]


Anton Vinogradov deleted comment on IGNITE-15316:
---

was (Author: av):
It's a good idea to consider a replacement (as a part of this issue)
{noformat}
for (KeyCacheObject key : keys) {
List nodes = ctx.affinity().nodesByKey(key, 
topVer); // affinity

primaryNodes.put(key, nodes.get(0));
...
{noformat}
to
{noformat}
for (KeyCacheObject key : keys) {
List nodes = ctx.topology().nodes(key.partition(), 
topVer); // topology

primaryNodes.put(key, nodes.get(0));
...
{noformat}
at 
{{org.apache.ignite.internal.processors.cache.distributed.near.consistency.GridNearReadRepairAbstractFuture#map}}.
This may help to reduce remaps count at unstable topology, but require being 
thoughtfully researched.

Looks like affinity mapping instead of topology may cause unchecked copies on 
unstable topology.

> Read Repair may see inconsistent entry at tx cache when it is consistent but 
> updated right before the check
> ---
>
> Key: IGNITE-15316
> URL: https://issues.apache.org/jira/browse/IGNITE-15316
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Anton Vinogradov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-31
>
> Even at FULL_SYNC mode stale reads are possible from backups after the lock 
> is obtained by "Read Repair" tx.
> This is possible because (at previous tx) entry becomes unlocked (committed) 
> on primary before tx committed on backups.
> This is not a problem for Ignite (since backups keep locks until updated) but 
> produces false positive "inconsistency state found" events and repairs.
> Unlock relocation does not seems to be a proper fix, since it will cause a 
> performance drop.
> So, we should recheck values several times if an inconsistency is found, even 
> when the lock is already obtained by "Read Repair".



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16931) Read Repair should support unstable topology

2022-05-05 Thread Anton Vinogradov (Jira)
Anton Vinogradov created IGNITE-16931:
-

 Summary: Read Repair should support unstable topology
 Key: IGNITE-16931
 URL: https://issues.apache.org/jira/browse/IGNITE-16931
 Project: Ignite
  Issue Type: Improvement
Reporter: Anton Vinogradov
Assignee: Anton Vinogradov


Currently RR does not support unstable topology (when not all owners are 
located by affinity) and this can be fixed.

As a start point, it's a good idea to consider a replacement
{noformat}
for (KeyCacheObject key : keys) {
List nodes = ctx.affinity().nodesByKey(key, 
topVer); // affinity

primaryNodes.put(key, nodes.get(0));
...
{noformat}
to
{noformat}
for (KeyCacheObject key : keys) {
List nodes = ctx.topology().nodes(key.partition(), 
topVer); // topology

primaryNodes.put(key, nodes.get(0));
...
{noformat}
at 
{{{}org.apache.ignite.internal.processors.cache.distributed.near.consistency.GridNearReadRepairAbstractFuture#map{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16931) Read Repair should support unstable topology

2022-05-05 Thread Anton Vinogradov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov updated IGNITE-16931:
--
Parent: IGNITE-15167
Issue Type: Sub-task  (was: Improvement)

> Read Repair should support unstable topology
> 
>
> Key: IGNITE-16931
> URL: https://issues.apache.org/jira/browse/IGNITE-16931
> Project: Ignite
>  Issue Type: Sub-task
>Reporter: Anton Vinogradov
>Assignee: Anton Vinogradov
>Priority: Major
>  Labels: iep-31
>
> Currently RR does not support unstable topology (when not all owners are 
> located by affinity) and this can be fixed.
> As a start point, it's a good idea to consider a replacement
> {noformat}
> for (KeyCacheObject key : keys) {
> List nodes = ctx.affinity().nodesByKey(key, 
> topVer); // affinity
> primaryNodes.put(key, nodes.get(0));
> ...
> {noformat}
> to
> {noformat}
> for (KeyCacheObject key : keys) {
> List nodes = 
> ctx.topology().nodes(key.partition(), topVer); // topology
> primaryNodes.put(key, nodes.get(0));
> ...
> {noformat}
> at 
> {{{}org.apache.ignite.internal.processors.cache.distributed.near.consistency.GridNearReadRepairAbstractFuture#map{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16239) [Extensions] Document the zookeeper-ip-finder-ext extension.

2022-05-05 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita updated IGNITE-16239:
-
Fix Version/s: 2.13

> [Extensions] Document the zookeeper-ip-finder-ext extension.
> 
>
> Key: IGNITE-16239
> URL: https://issues.apache.org/jira/browse/IGNITE-16239
> Project: Ignite
>  Issue Type: Task
>Reporter: Amelchev Nikita
>Priority: Minor
> Fix For: 2.13
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16919) H2 Index cost function must take into account only corresponding columns.

2022-05-05 Thread Konstantin Orlov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532345#comment-17532345
 ] 

Konstantin Orlov commented on IGNITE-16919:
---

Hi, [~zstan]! The patch looks good to me. Could you please change the type of 
this ticket to "Improvement"? The old behaviour seems legit too.

> H2 Index cost function must take into account only corresponding columns.
> -
>
> Key: IGNITE-16919
> URL: https://issues.apache.org/jira/browse/IGNITE-16919
> Project: Ignite
>  Issue Type: Bug
>  Components: sql
>Affects Versions: 2.13
>Reporter: Evgeny Stanilovsky
>Assignee: Evgeny Stanilovsky
>Priority: Major
> Attachments: image-2022-04-30-19-13-59-997.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> H2IndexCostedBase#getCostRangeIndex is called with allColumnsSet where 
> consists columns from all operating tables, check: 
> org.h2.table.Plan#calculateCost :
> {code:java}
> final HashSet allColumnsSet = ExpressionVisitor
> .allColumnsForTableFilters(allFilters);
> {code}
> thus allColumnsSet consist columns from all operating tables
>  !image-2022-04-30-19-13-59-997.png! 
> and erroneous iteration here:
> H2IndexCostedBase#getCostRangeIndex
> ...
> {code:java}
> if (!isScanIndex && allColumnsSet != null && !skipColumnsIntersection && 
> !allColumnsSet.isEmpty()) {
> boolean foundAllColumnsWeNeed = true;
> for (Column c : allColumnsSet) { // <-- all columns
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Comment Edited] (IGNITE-16895) Update documentation with GitHub Actions

2022-05-05 Thread Amelchev Nikita (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532282#comment-17532282
 ] 

Amelchev Nikita edited comment on IGNITE-16895 at 5/5/22 2:48 PM:
--

Merged into the master.

[~mmuzaf], thank you for the review!


was (Author: nsamelchev):
Merged into the master.

> Update documentation with GitHub Actions
> 
>
> Key: IGNITE-16895
> URL: https://issues.apache.org/jira/browse/IGNITE-16895
> Project: Ignite
>  Issue Type: Task
>Reporter: Amelchev Nikita
>Assignee: Amelchev Nikita
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> GitHub Actions can be used to update documentation for released Ignite 
> versions.
> For now, this is a complex manual work that requires understanding all the 
> intermediate steps: 
> [wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].
> I propose to automatize this and give an ability to update documentation on a 
> push event to a released branch.
> ASF GitHub Actions Policy allows automated services to push changes related 
> to documentation: 
> [policy|https://infra.apache.org/github-actions-policy.html].
> Write access is 
> [required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
>  to run the update. So, only committers can run workflows manually.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16895) Update documentation with GitHub Actions

2022-05-05 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita updated IGNITE-16895:
-
Description: 
GitHub Actions can be used to update documentation for released Ignite versions.

For now, this is a complex manual work that requires understanding all the 
intermediate steps: 
[wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].

I propose to automatize this and give an ability to update documentation on a 
push event to a released branch.

ASF GitHub Actions Policy allows automated services to push changes related to 
documentation: [policy|https://infra.apache.org/github-actions-policy.html].

Write access is 
[required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
 to run the update. So, only committers can run workflows manually.

  was:
GitHub Actions can be used to update documentation for released Ignite versions.

For now, this is a complex manual work that requires understanding all the 
intermediate steps: 
[wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].

I propose to automatize this and give an ability to update documentation by a 
click (or on a push to a released branch event):

ASF GitHub Actions Policy allows automated services to push changes related to 
documentation: [policy|https://infra.apache.org/github-actions-policy.html].

Write access is 
[required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
 to run the update. So, only committers can run workflows manually.


> Update documentation with GitHub Actions
> 
>
> Key: IGNITE-16895
> URL: https://issues.apache.org/jira/browse/IGNITE-16895
> Project: Ignite
>  Issue Type: Task
>Reporter: Amelchev Nikita
>Assignee: Amelchev Nikita
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GitHub Actions can be used to update documentation for released Ignite 
> versions.
> For now, this is a complex manual work that requires understanding all the 
> intermediate steps: 
> [wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].
> I propose to automatize this and give an ability to update documentation on a 
> push event to a released branch.
> ASF GitHub Actions Policy allows automated services to push changes related 
> to documentation: 
> [policy|https://infra.apache.org/github-actions-policy.html].
> Write access is 
> [required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
>  to run the update. So, only committers can run workflows manually.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16895) Update documentation with GitHub Actions

2022-05-05 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita updated IGNITE-16895:
-
Attachment: (was: image-2022-04-23-16-19-41-327.png)

> Update documentation with GitHub Actions
> 
>
> Key: IGNITE-16895
> URL: https://issues.apache.org/jira/browse/IGNITE-16895
> Project: Ignite
>  Issue Type: Task
>Reporter: Amelchev Nikita
>Assignee: Amelchev Nikita
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GitHub Actions can be used to update documentation for released Ignite 
> versions.
> For now, this is a complex manual work that requires understanding all the 
> intermediate steps: 
> [wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].
> I propose to automatize this and give an ability to update documentation on a 
> push event to a released branch.
> ASF GitHub Actions Policy allows automated services to push changes related 
> to documentation: 
> [policy|https://infra.apache.org/github-actions-policy.html].
> Write access is 
> [required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
>  to run the update. So, only committers can run workflows manually.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16895) Update documentation with GitHub Actions

2022-05-05 Thread Amelchev Nikita (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amelchev Nikita updated IGNITE-16895:
-
Description: 
GitHub Actions can be used to update documentation for released Ignite versions.

For now, this is a complex manual work that requires understanding all the 
intermediate steps: 
[wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].

I propose to automatize this and give an ability to update documentation by a 
click (or on a push to a released branch event):

ASF GitHub Actions Policy allows automated services to push changes related to 
documentation: [policy|https://infra.apache.org/github-actions-policy.html].

Write access is 
[required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
 to run the update. So, only committers can run workflows manually.

  was:
GitHub Actions can be used to update documentation for released Ignite versions.

For now, this is a complex manual work that requires understanding all the 
intermediate steps: 
[wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].

I propose to automatize this and give an ability to update documentation by a 
click (or on a push to a released branch event):

!image-2022-04-23-16-19-41-327.png|width=260,height=216!

ASF GitHub Actions Policy allows automated services to push changes related to 
documentation: [policy|https://infra.apache.org/github-actions-policy.html].

Write access is 
[required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
 to run the update. So, only committers can run workflows manually.


> Update documentation with GitHub Actions
> 
>
> Key: IGNITE-16895
> URL: https://issues.apache.org/jira/browse/IGNITE-16895
> Project: Ignite
>  Issue Type: Task
>Reporter: Amelchev Nikita
>Assignee: Amelchev Nikita
>Priority: Major
> Attachments: image-2022-04-23-16-19-41-327.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> GitHub Actions can be used to update documentation for released Ignite 
> versions.
> For now, this is a complex manual work that requires understanding all the 
> intermediate steps: 
> [wiki|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=85461527#HowtoDocument-UpdatingPublishedDocs].
> I propose to automatize this and give an ability to update documentation by a 
> click (or on a push to a released branch event):
> ASF GitHub Actions Policy allows automated services to push changes related 
> to documentation: 
> [policy|https://infra.apache.org/github-actions-policy.html].
> Write access is 
> [required|https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow]
>  to run the update. So, only committers can run workflows manually.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16908) Move ignite-hibernate modules to the Ignite Extension

2022-05-05 Thread Maxim Muzafarov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532266#comment-17532266
 ] 

Maxim Muzafarov commented on IGNITE-16908:
--

In addition to the pull requrest the following branches prepared:
https://github.com/apache/ignite-extensions/tree/release/ignite-hibernate-ext-5.1.0/modules/hibernate-ext
https://github.com/apache/ignite-extensions/tree/release/ignite-hibernate-ext-4.2.0/modules/hibernate-ext

> Move ignite-hibernate modules to the Ignite Extension
> -
>
> Key: IGNITE-16908
> URL: https://issues.apache.org/jira/browse/IGNITE-16908
> Project: Ignite
>  Issue Type: Task
>  Components: extensions
>Reporter: Maxim Muzafarov
>Assignee: Maxim Muzafarov
>Priority: Major
> Fix For: 2.14
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following list of modules should be moved to the Extensions.
> - ignite-hibernate_4.2
> - ignite-hibernate_5.1
> - ignite-hibernate_5.3
> - ignite-hibernate-core (a common part for all hibernate modules)
> In details:
> - remove all these modules from the Ignite project.
> - create ignite-hibernate extension.
> - move ignite-hibernate-core + ignite-hibernate_4.2 to
> release/ignite-hibernate-4.2.0 branch (the version of ignite-hibernate
> extension will be 4.2.0) and release it on demand;
> - move ignite-hibernate-core + ignite-hibernate_5.1 to
> release/ignite-hibernate-5.1.0 branch (the version of ignite-hibernate
> extension will be 5.1.0) and release it on demand;
> - move ignite-hibernate-core + ignite-hibernate_5.3 to the master
> branch and to the release/ignite-hibernate-5.3.0 branch (the version
> of ignite-hibernate extension will be 5.3.0) and release it
> immediately;



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16930) .NET: Thin 3.0: Implement Compute.ExecuteColocated

2022-05-05 Thread Pavel Tupitsyn (Jira)
Pavel Tupitsyn created IGNITE-16930:
---

 Summary: .NET: Thin 3.0: Implement Compute.ExecuteColocated
 Key: IGNITE-16930
 URL: https://issues.apache.org/jira/browse/IGNITE-16930
 Project: Ignite
  Issue Type: Improvement
  Components: platforms, thin client
Reporter: Pavel Tupitsyn
Assignee: Pavel Tupitsyn
 Fix For: 3.0.0-alpha5


Implement executeColocated without partition awareness (send the request using 
the default connection, let the server route it to the correct node). See 
IGNITE-16786 for a reference implementation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16786) Thin 3.0: Implement ClientCompute#executeColocated()

2022-05-05 Thread Pavel Tupitsyn (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Tupitsyn updated IGNITE-16786:

Component/s: thin client
 (was: clients)

> Thin 3.0: Implement ClientCompute#executeColocated()
> 
>
> Key: IGNITE-16786
> URL: https://issues.apache.org/jira/browse/IGNITE-16786
> Project: Ignite
>  Issue Type: Improvement
>  Components: thin client
>Reporter: Roman Puchkovskiy
>Assignee: Pavel Tupitsyn
>Priority: Major
>  Labels: ignite-3
> Fix For: 3.0.0-alpha5
>
>
> Implement executeColocated without partition awareness (send the request 
> using the default connection, let the server route it to the correct node).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (IGNITE-16926) Interrupted compute job may fail a node

2022-05-05 Thread Ivan Bessonov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov reassigned IGNITE-16926:
--

Assignee: Ivan Bessonov

> Interrupted compute job may fail a node
> ---
>
> Key: IGNITE-16926
> URL: https://issues.apache.org/jira/browse/IGNITE-16926
> Project: Ignite
>  Issue Type: Bug
>  Components: persistence
>Reporter: Ivan Bessonov
>Assignee: Ivan Bessonov
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> Critical system error detected. Will be handled accordingly to configured 
> handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
> super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
> [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
> failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
> o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
> corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden 
> ","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
>  B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
> cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
> Row@79570772[ key: 1168930235, val: Data hidden due to 
> IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
> data hidden ]] at 
> org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492)
>  at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432)
>  at 
> org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500)
>  at 
> 

[jira] [Created] (IGNITE-16929) .NET: Thin 3.0: Implement sessions for .NET thin client

2022-05-05 Thread Igor Sapego (Jira)
Igor Sapego created IGNITE-16929:


 Summary: .NET: Thin 3.0: Implement sessions for .NET thin client
 Key: IGNITE-16929
 URL: https://issues.apache.org/jira/browse/IGNITE-16929
 Project: Ignite
  Issue Type: New Feature
  Components: platforms, thin client
Affects Versions: 3.0.0-alpha4
Reporter: Igor Sapego
 Fix For: 3.0.0-alpha5


Let's implement sessions support for .NET client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16928) Thin 3.0: Implement sessions for Java client

2022-05-05 Thread Igor Sapego (Jira)
Igor Sapego created IGNITE-16928:


 Summary: Thin 3.0: Implement sessions for Java client
 Key: IGNITE-16928
 URL: https://issues.apache.org/jira/browse/IGNITE-16928
 Project: Ignite
  Issue Type: New Feature
  Components: platforms, thin client
Affects Versions: 3.0.0-alpha4
Reporter: Igor Sapego
 Fix For: 3.0.0-alpha5


Let's implemnt local sessions for Java client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-5956) Ignite Continuous Query (Queries 3): IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails

2022-05-05 Thread Evgeny Stanilovsky (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532180#comment-17532180
 ] 

Evgeny Stanilovsky commented on IGNITE-5956:


[~jooger] seems this is still an issue and need to be fixed.

> Ignite Continuous Query (Queries 3): 
> IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails
> -
>
> Key: IGNITE-5956
> URL: https://issues.apache.org/jira/browse/IGNITE-5956
> Project: Ignite
>  Issue Type: Bug
>  Components: sql
>Affects Versions: 2.1, 2.13
>Reporter: Sergey Chugunov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain, test-failure
>
> Reproducible locally.
> May be broken by commit *70eed75422ea*.
> Fails with exception:
> {noformat}
> javax.cache.CacheException: Failed to execute query: for distributed join all 
> REPLICATED caches must be at the end of the joined tables list.
>   at 
> org.apache.ignite.internal.processors.query.h2.opt.GridH2CollocationModel.isCollocated(GridH2CollocationModel.java:704)
>   at 
> org.apache.ignite.internal.processors.query.h2.sql.GridSqlQuerySplitter.split(GridSqlQuerySplitter.java:239)
>   at 
> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.queryDistributedSqlFields(IgniteH2Indexing.java:1309)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1804)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1802)
>   at 
> org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2282)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.querySqlFields(GridQueryProcessor.java:1809)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:788)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:758)
>   at 
> org.apache.ignite.testframework.junits.common.GridCommonAbstractTest.queryPlan(GridCommonAbstractTest.java:1650)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQuery(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:389)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQueries(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.join(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:283)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.testJoin2(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:197)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1980)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:131)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1895)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-5956) Ignite Continuous Query (Queries 3): IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails

2022-05-05 Thread Evgeny Stanilovsky (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evgeny Stanilovsky updated IGNITE-5956:
---
Component/s: sql

> Ignite Continuous Query (Queries 3): 
> IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails
> -
>
> Key: IGNITE-5956
> URL: https://issues.apache.org/jira/browse/IGNITE-5956
> Project: Ignite
>  Issue Type: Bug
>  Components: sql
>Affects Versions: 2.1, 2.13
>Reporter: Sergey Chugunov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain, test-failure
>
> Reproducible locally.
> May be broken by commit *70eed75422ea*.
> Fails with exception:
> {noformat}
> javax.cache.CacheException: Failed to execute query: for distributed join all 
> REPLICATED caches must be at the end of the joined tables list.
>   at 
> org.apache.ignite.internal.processors.query.h2.opt.GridH2CollocationModel.isCollocated(GridH2CollocationModel.java:704)
>   at 
> org.apache.ignite.internal.processors.query.h2.sql.GridSqlQuerySplitter.split(GridSqlQuerySplitter.java:239)
>   at 
> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.queryDistributedSqlFields(IgniteH2Indexing.java:1309)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1804)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1802)
>   at 
> org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2282)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.querySqlFields(GridQueryProcessor.java:1809)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:788)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:758)
>   at 
> org.apache.ignite.testframework.junits.common.GridCommonAbstractTest.queryPlan(GridCommonAbstractTest.java:1650)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQuery(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:389)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQueries(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.join(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:283)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.testJoin2(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:197)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1980)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:131)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1895)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-5956) Ignite Continuous Query (Queries 3): IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails

2022-05-05 Thread Evgeny Stanilovsky (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evgeny Stanilovsky updated IGNITE-5956:
---
Affects Version/s: 2.13

> Ignite Continuous Query (Queries 3): 
> IgniteCacheDistributedJoinPartitionedAndReplicatedTest fails
> -
>
> Key: IGNITE-5956
> URL: https://issues.apache.org/jira/browse/IGNITE-5956
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.1, 2.13
>Reporter: Sergey Chugunov
>Priority: Major
>  Labels: MakeTeamcityGreenAgain, test-failure
>
> Reproducible locally.
> May be broken by commit *70eed75422ea*.
> Fails with exception:
> {noformat}
> javax.cache.CacheException: Failed to execute query: for distributed join all 
> REPLICATED caches must be at the end of the joined tables list.
>   at 
> org.apache.ignite.internal.processors.query.h2.opt.GridH2CollocationModel.isCollocated(GridH2CollocationModel.java:704)
>   at 
> org.apache.ignite.internal.processors.query.h2.sql.GridSqlQuerySplitter.split(GridSqlQuerySplitter.java:239)
>   at 
> org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.queryDistributedSqlFields(IgniteH2Indexing.java:1309)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1804)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor$5.applyx(GridQueryProcessor.java:1802)
>   at 
> org.apache.ignite.internal.util.lang.IgniteOutClosureX.apply(IgniteOutClosureX.java:36)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.executeQuery(GridQueryProcessor.java:2282)
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.querySqlFields(GridQueryProcessor.java:1809)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:788)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheProxy.query(IgniteCacheProxy.java:758)
>   at 
> org.apache.ignite.testframework.junits.common.GridCommonAbstractTest.queryPlan(GridCommonAbstractTest.java:1650)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQuery(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:389)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.checkQueries(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:364)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.join(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:283)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheDistributedJoinPartitionedAndReplicatedTest.testJoin2(IgniteCacheDistributedJoinPartitionedAndReplicatedTest.java:197)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at junit.framework.TestCase.runTest(TestCase.java:176)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1980)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:131)
>   at 
> org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1895)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16900) Add checkstyle LeftCurly rule

2022-05-05 Thread Nikolay Izhikov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532171#comment-17532171
 ] 

Nikolay Izhikov commented on IGNITE-16900:
--

Failures unrelated.

Tests broken by - 
https://github.com/apache/ignite/commit/7357847369079925289114f650a506408812fe4c

> Add checkstyle LeftCurly rule
> -
>
> Key: IGNITE-16900
> URL: https://issues.apache.org/jira/browse/IGNITE-16900
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Ignite codestyle specify
> > { starts on the same line as the opening block statement. For example:
> To force this checkstyle has a LeftCurly rule.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16900) Add checkstyle LeftCurly rule

2022-05-05 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532170#comment-17532170
 ] 

Ignite TC Bot commented on IGNITE-16900:


{panel:title=Branch: [pull//head] Base: [master] : Possible Blockers 
(2)|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}
{color:#d04437}Control Utility{color} [[tests 
2|https://ci2.ignite.apache.org/viewLog.html?buildId=6422754]]
* IgniteControlUtilityTestSuite: 
KillCommandsCommandShTest.testCancelConsistencyTask - Test has low fail rate in 
base branch 3,8% and is not flaky
* IgniteControlUtilityTestSuite: 
KillCommandsCommandShTest.testCancelComputeTask - Test has low fail rate in 
base branch 3,8% and is not flaky

{panel}
{panel:title=Branch: [pull//head] Base: [master] : No new tests 
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}{panel}
[TeamCity *-- Run :: All* 
Results|https://ci2.ignite.apache.org/viewLog.html?buildId=6422624buildTypeId=IgniteTests24Java8_RunAll]

> Add checkstyle LeftCurly rule
> -
>
> Key: IGNITE-16900
> URL: https://issues.apache.org/jira/browse/IGNITE-16900
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Nikolay Izhikov
>Assignee: Nikolay Izhikov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Ignite codestyle specify
> > { starts on the same line as the opening block statement. For example:
> To force this checkstyle has a LeftCurly rule.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532164#comment-17532164
 ] 

Anton Vinogradov edited comment on IGNITE-16916 at 5/5/22 10:14 AM:


Reopening because of failing tests


was (Author: av):
Reopening bacause of failing tests

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: image-2022-05-05-12-46-26-543.png, screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-14341) Reduce contention in the PendingEntriesTree when cleaning up expired entries.

2022-05-05 Thread Pavel Pereslegin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-14341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Pereslegin updated IGNITE-14341:
--
Summary: Reduce contention in the PendingEntriesTree when cleaning up 
expired entries.  (was: Significant performance drop when entries expiring 
concurrently)

> Reduce contention in the PendingEntriesTree when cleaning up expired entries.
> -
>
> Key: IGNITE-14341
> URL: https://issues.apache.org/jira/browse/IGNITE-14341
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Aleksey Plekhanov
>Assignee: Pavel Pereslegin
>Priority: Major
>  Labels: ise
> Attachments: JmhCacheExpireBenchmark.java
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, there is a significant performance drop when expired entries 
> concurrently evicted by threads that perform some actions with cache (see 
> attached reproducer):
> {noformat}
> Benchmark  Mode  Cnt Score Error   
> Units
> JmhCacheExpireBenchmark.putWithExpire thrpt3   100,132 ±  21,025  
> ops/ms
> JmhCacheExpireBenchmark.putWithoutExpire  thrpt3  2133,122 ± 559,694  
> ops/ms{noformat}
> Root cause: pending entries tree (offheap BPlusTree) is used to track expired 
> entries, after each cache operation (and by timeout thread) there is an 
> attempt to evict some amount of expired entries. these entries looked up from 
> the start of the pending entries tree and there is a contention on the first 
> leaf page of that tree.
> All threads waiting for the same page lock:
> {noformat}
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>   at 
> org.apache.ignite.internal.util.OffheapReadWriteLock.waitAcquireWriteLock(OffheapReadWriteLock.java:503)
>   at 
> org.apache.ignite.internal.util.OffheapReadWriteLock.writeLock(OffheapReadWriteLock.java:244)
>   at 
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.writeLock(PageMemoryNoStoreImpl.java:528)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeLock(PageHandler.java:422)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:350)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:325)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$13200(BPlusTree.java:100)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.doRemoveFromLeaf(BPlusTree.java:4588)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.removeFromLeaf(BPlusTree.java:4567)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.tryRemoveFromLeaf(BPlusTree.java:5196)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.access$6800(BPlusTree.java:4209)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:2189)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:2165)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:2165)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:2076)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removex(BPlusTree.java:1905)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expireInternal(IgniteCacheOffheapManagerImpl.java:1426)
>   at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1375)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:246)
>   at 
> org.apache.ignite.internal.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:882){noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532160#comment-17532160
 ] 

Anton Vinogradov edited comment on IGNITE-16916 at 5/5/22 9:54 AM:
---

[~ktkale...@gridgain.com], [~sergeychugunov] 
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is 
[broken|https://ci.ignite.apache.org/test/-8103382042071142009?currentProjectId=IgniteTests24Java8=%3Cdefault%3E=true]
 now

!screenshot-1.png!

 

 

As far as I can see, you never checked these changes :( 

 

!image-2022-05-05-12-46-26-543.png!


was (Author: av):
[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is 
[broken|https://ci.ignite.apache.org/test/-8103382042071142009?currentProjectId=IgniteTests24Java8=%3Cdefault%3E=true]
 now

!screenshot-1.png!

 

 

As far as I can see, you never checked these changes :( 

 

!image-2022-05-05-12-46-26-543.png!

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: image-2022-05-05-12-46-26-543.png, screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16927) [Extensions] Fix scope of spring-data-commons dependency.

2022-05-05 Thread Mikhail Petrov (Jira)
Mikhail Petrov created IGNITE-16927:
---

 Summary: [Extensions] Fix scope of spring-data-commons dependency.
 Key: IGNITE-16927
 URL: https://issues.apache.org/jira/browse/IGNITE-16927
 Project: Ignite
  Issue Type: Bug
Reporter: Mikhail Petrov


Currently scope of spring-data-commons dependency for extensions is `compile` 
which means that extensions are dependent on hardcoded version of 
spring-data-commons. We should change it to provided to avoid releasing the 
spring-data-ext for each spring-data-commons version.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Reopened] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov reopened IGNITE-16916:
---

Reopening bacause of failing tests

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: image-2022-05-05-12-46-26-543.png, screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532160#comment-17532160
 ] 

Anton Vinogradov edited comment on IGNITE-16916 at 5/5/22 9:46 AM:
---

[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is 
[broken|https://ci.ignite.apache.org/test/-8103382042071142009?currentProjectId=IgniteTests24Java8=%3Cdefault%3E=true]
 now

!screenshot-1.png!

 

 

As far as I can see, you never checked these changes :( 

 

!image-2022-05-05-12-46-26-543.png!


was (Author: av):
[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is 
[broken|https://ci.ignite.apache.org/test/-8103382042071142009?currentProjectId=IgniteTests24Java8=%3Cdefault%3E=true]
 now

!screenshot-1.png!

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: image-2022-05-05-12-46-26-543.png, screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532160#comment-17532160
 ] 

Anton Vinogradov edited comment on IGNITE-16916 at 5/5/22 9:44 AM:
---

[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is 
[broken|https://ci.ignite.apache.org/test/-8103382042071142009?currentProjectId=IgniteTests24Java8=%3Cdefault%3E=true]
 now

!screenshot-1.png!


was (Author: av):
[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is broken now

 !screenshot-1.png! 

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


[ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532160#comment-17532160
 ] 

Anton Vinogradov commented on IGNITE-16916:
---

[~ktkale...@gridgain.com]
Looks like {{KillCommandsCommandShTest.testCancelComputeTask}} is broken now

 !screenshot-1.png! 

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16916) Make nodes more resilient in case of a job cancellation

2022-05-05 Thread Anton Vinogradov (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Vinogradov updated IGNITE-16916:
--
Attachment: screenshot-1.png

> Make nodes more resilient in case of a job cancellation
> ---
>
> Key: IGNITE-16916
> URL: https://issues.apache.org/jira/browse/IGNITE-16916
> Project: Ignite
>  Issue Type: Task
>  Components: compute
>Reporter: Kirill Tkalenko
>Assignee: Kirill Tkalenko
>Priority: Major
> Fix For: 2.14
>
> Attachments: screenshot-1.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a job being cancelled we currently have a really questionable 
> approach.
> We are now setting the interruption flag even before we give a use a chance 
> to stop the job gracefully.
> Proposal for the implementation:
> * Adding a distributed property in the metastore that will set a timeout for 
> interrupting *GridJobWorker* that did not gracefully complete after calling 
> *GridJobWorker#cancel*;
> * On the call of the *GridJobWorker#cancel*, do not *Thread#interrupt* the 
> thread, but add *GridTimeoutObject*.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (IGNITE-16926) Interrupted compute job may fail a node

2022-05-05 Thread Ivan Bessonov (Jira)
Ivan Bessonov created IGNITE-16926:
--

 Summary: Interrupted compute job may fail a node
 Key: IGNITE-16926
 URL: https://issues.apache.org/jira/browse/IGNITE-16926
 Project: Ignite
  Issue Type: Bug
  Components: persistence
Reporter: Ivan Bessonov


{code:java}
Critical system error detected. Will be handled accordingly to configured 
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, 
super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet 
[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], 
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class 
o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is 
corrupted [groupId=1234619879, pageIds=[7290201467513], cacheId=645096946, 
cacheName=*, indexName=*, msg=Runtime failure on row: Row@79570772[ key: 
1168930235, val: Data hidden due to IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden, data hidden, data hidden, data 
hidden, data hidden, data hidden, data hidden 
","logger_name":"ROOT","thread_name":"pub-#1278%x%","level":"ERROR","level_value":4,"stack_trace":"org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException:
 B+Tree is corrupted [groupId=1234619879, pageIds=[7290201467513], 
cacheId=645096946, cacheName=*, indexName=*, msg=Runtime failure on row: 
Row@79570772[ key: 1168930235, val: Data hidden due to 
IGNITE_SENSITIVE_DATA_LOGGING flag. ][ data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden, data hidden, data hidden, data hidden, data hidden, data hidden, 
data hidden ]] at 
org.apache.ignite.internal.processors.query.h2.database.H2Tree.corruptedTreeException(H2Tree.java:1003)
 at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doPut(BPlusTree.java:2492)
 at 
org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.putx(BPlusTree.java:2432)
 at 
org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.putx(H2TreeIndex.java:500)
 at 
org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.addToIndex(GridH2Table.java:880)
 at 
org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:794)
 at 
org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.store(IgniteH2Indexing.java:411)
 at 
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:2546)
 at 

[jira] [Updated] (IGNITE-16801) Implement error handling for rebalance

2022-05-05 Thread Vyacheslav Koptilin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vyacheslav Koptilin updated IGNITE-16801:
-
   Epic Link: IGNITE-14209
Ignite Flags:   (was: Docs Required,Release Notes Required)

> Implement error handling for rebalance 
> ---
>
> Key: IGNITE-16801
> URL: https://issues.apache.org/jira/browse/IGNITE-16801
> Project: Ignite
>  Issue Type: Task
>Reporter: Kirill Gusakov
>Priority: Major
>  Labels: ignite-3
>
> We have the listener `onReconfigurationError` for handling errors during the 
> rebalance, but not implementation yet.
> At the moment, it looks like, that we can receive only 1 kind of errors - 
> `RaftError.ECATCHUP`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles 

[jira] [Updated] (IGNITE-16668) Design in-memory raft group reconfiguration on node failure

2022-05-05 Thread Alexander Lapin (Jira)


 [ 
https://issues.apache.org/jira/browse/IGNITE-16668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Lapin updated IGNITE-16668:
-
Description: 
If a node storing a partition of an in-memory table fails and leaves the 
cluster all data it had is lost. From the point of view of the partition it 
looks like as the node is left forever.

Although Raft protocol tolerates leaving some amount of nodes composing Raft 
group (partition); for in-memory caches we cannot restore replica factor 
because of in-memory nature of the table.

It means that we need to detect failures of each node owning a partition and 
recalculate assignments for the table without keeping replica factor.
h4. Upd 1:
h4. Problem

By design raft has several persisted segments, e.g. raft meta 
(term/committedIndex) and stable raft log. So, by converting common raft to 
in-memory one it’s possible to break some of it’s invariants. For example Node 
C could vote for Candidate A before self-restart and vote then for Candidate B 
after one. As a result two leaders will be elected which is illegal.
 
!Screenshot from 2022-04-19 11-11-05.png!
 
h4. Solution

In order to solve the problem mentioned above it’s possible to remove and then 
return back the restarting node from the peers of the corresponding raft group. 
The peer-removal process should be finished before the restarting of the 
corresponding raft server node.
 
  !Screenshot from 2022-04-19 11-12-55.png!
 
The process of removing and then returning back the restarting node is however 
itself tricky. And to answer why it’s non-trivial action, it’s necessary to 
reveal the main ideas of the rebalance protocol.

Reconfiguration of the raft group - is a process driven by the fact of changing 
the assignments. Each partition has three corresponding sets of assignments 
stored in the metastore:
 # assignments.stable - current distribution

 # assignments.pending - partition distribution for an ongoing rebalance if any

 # assignments.planned - in some cases it’s not possible to cancel or merge 
pending rebalance with new one. In that case newly calculated assignments will 
be stored explicitly with corresponding assignments.planned key. It's worth 
noting that it doesn't make sense to keep more than one planned rebalance. Any 
new scheduled one will overwrite already existing.

However such idea of overwriting the assignments.planned key wont work within 
the context of an in-memory raft restart, because it’s not valid to overwrite 
the reduction of assignments. Let's illustrate this problem with the following 
example.
 # In-memory partition p1 is hosted on nodes A, B and C, meaning that 
p1.assignments.stable=[A,B,C]

 # Let's say that the baseline was changed, resulting in a rebalance on 
assignments.pending=[A,B,C,*D*]

 # During the non-cancelable phase of [A,B,C]->[A,B,C,D], node C fails and 
returns back, meaning that we should plan [A,B,D] and [A,B,C,D] assignments. 
Both must be recorded in the only assignments.planned key meaning that 
[A,B,C,D] will overwrite reduction [A,B,D], so no actual raft reconfiguration 
will take place, which is not acceptable.

In order to overcome given issue, let’s introduce two new keys 
_assignments.switch.reduce_ that will hold nodes that should be removed and 
_assignments.switch.append_ that will hold nodes that should be returned back 
and run following actions:
h5. On in-memory partition restart (or on partition start with cleaned-up PDS)

within retry loop add current node to assignments.switch.reduce set:
{code:java}
do {
 retrievedAssignmentsSwitchReduce = 
metastorage.read(assignments.switch.reduce);
 calculatedAssignmetnsSwitchReduce = 
union(retrievedAssignmentsSwitchReduce.value, currentNode);

 if (retrievedAssignmentsSwitchReduce.isEmpty()) {
         invokeRes = metastoreInvoke:
         if empty(assignments.switch.reduce)
             assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 } else {
         invokeRes = metastoreInvoke:
 eq(revision(assignments.switch.reduce), 
retrievedAssignmentsSwitchReduce.revision)
                 assignments.switch.reduce = calculatedAssignmentsSwitchReduce 
 }
} while (!invokeRes);{code}
h5. On assignments.switch.reduce change on corresponding partition leader

Within watch listener on assignments.switch.reduce key on corresponding 
partition leader we trigger new rebalance if there are no pending one.
{code:java}
calculatedAssignments = substract(calcPartAssighments(), 
assignments.switch.reduce);

metastoreInvoke:
    if empty(partition.assignments.change.trigger.revision) || 
partition.assignments.change.trigger.revision < event.revision
    if empty(assignments.pending)
    assignments.pending = calculatedAssignments
        partition.assignments.change.trigger.revision = event.revision
{code}
h5. On rebalance done

changePeers() calles