[ 
https://issues.apache.org/jira/browse/IGNITE-28804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Chesnokov updated IGNITE-28804:
-----------------------------------------
    Description: 
Flaky GridCacheContinuousQueryMultiNodesFilteringTest#testWithNodeFilter from 
Continuous Queries 4 suite is flaky: 
[https://ci2.ignite.apache.org/test/4649996366513987762?currentProjectId=IgniteTests24Java8&branch=%3Cdefault%3E|https://ci2.ignite.apache.org/test/-3060366423503188690?currentProjectId=IgniteTests24Java8]

For my local machine it requires about 6 runs to reproduce the bug 

Fails with "Timeout of waiting for topology map update" on second 
awaitPartitionMapExchange

 

UPD: Root cause is that the test used {{ClusterNode.id()}} in the node filter. 
This worked for normal test nodes, because their UUIDs end with {{{}0{}}}, 
{{{}1{}}}, {{{}2{}}}. But during baseline affinity calculation Ignite can also 
call this filter for a {{{}DetachedClusterNode{}}}. This is a special baseline 
node representation, and its UUID is random. Because of that, the filtered 
{{grid2}} could sometimes pass the regex as a detached node.

After that Ignite mapped this detached node back to the real {{grid2}} by 
{{{}consistentId{}}}, so affinity expected {{grid2}} to own some partitions. 
But the cache was not actually started on the real {{{}grid2{}}}, because the 
real node did not pass the filter. So {{awaitPartitionMapExchange()}} kept 
seeing {{affNodesCnt=2}} and {{ownersCnt=1}} and timed out.

The fix is to use stable {{ATTR_IGNITE_INSTANCE_NAME}} instead of runtime node 
UUID in the filter.

  was:
Flaky GridCacheContinuousQueryMultiNodesFilteringTest#testWithNodeFilter from 
Continuous Queries 4 suite is flaky: 
[https://ci2.ignite.apache.org/test/4649996366513987762?currentProjectId=IgniteTests24Java8&branch=%3Cdefault%3E|https://ci2.ignite.apache.org/test/-3060366423503188690?currentProjectId=IgniteTests24Java8]

For my local machine it requires about 6 runs to reproduce the bug 

Fails with "Timeout of waiting for topology map update" on second 
awaitPartitionMapExchange


> Flaky GridCacheContinuousQueryMultiNodesFilteringTest#testWithNodeFilter
> ------------------------------------------------------------------------
>
>                 Key: IGNITE-28804
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28804
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Aleksandr Chesnokov
>            Assignee: Aleksandr Chesnokov
>            Priority: Minor
>              Labels: MakeTeamcityGreenAgain, ise
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Flaky GridCacheContinuousQueryMultiNodesFilteringTest#testWithNodeFilter from 
> Continuous Queries 4 suite is flaky: 
> [https://ci2.ignite.apache.org/test/4649996366513987762?currentProjectId=IgniteTests24Java8&branch=%3Cdefault%3E|https://ci2.ignite.apache.org/test/-3060366423503188690?currentProjectId=IgniteTests24Java8]
> For my local machine it requires about 6 runs to reproduce the bug 
> Fails with "Timeout of waiting for topology map update" on second 
> awaitPartitionMapExchange
>  
> UPD: Root cause is that the test used {{ClusterNode.id()}} in the node 
> filter. This worked for normal test nodes, because their UUIDs end with 
> {{{}0{}}}, {{{}1{}}}, {{{}2{}}}. But during baseline affinity calculation 
> Ignite can also call this filter for a {{{}DetachedClusterNode{}}}. This is a 
> special baseline node representation, and its UUID is random. Because of 
> that, the filtered {{grid2}} could sometimes pass the regex as a detached 
> node.
> After that Ignite mapped this detached node back to the real {{grid2}} by 
> {{{}consistentId{}}}, so affinity expected {{grid2}} to own some partitions. 
> But the cache was not actually started on the real {{{}grid2{}}}, because the 
> real node did not pass the filter. So {{awaitPartitionMapExchange()}} kept 
> seeing {{affNodesCnt=2}} and {{ownersCnt=1}} and timed out.
> The fix is to use stable {{ATTR_IGNITE_INSTANCE_NAME}} instead of runtime 
> node UUID in the filter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to