Konstantin Orlov created IGNITE-25363:
-----------------------------------------
Summary: Sql. Delayed NODE_LEFT event processing may cause query
to hung
Key: IGNITE-25363
URL: https://issues.apache.org/jira/browse/IGNITE-25363
Project: Ignite
Issue Type: Bug
Components: sql ai3
Reporter: Konstantin Orlov
This problem is highlighted by test
{{org.apache.ignite.internal.runner.app.ItDataSchemaSyncTest#checkSchemasCorrectlyRestore}}
which sometimes fails on TC with timeout. The sequence of events as follow:
# Given: cluster of 3 nodes, distribution zone spans all these nodes.
# Node 1 has been restarted.
# Notification of
{{org.apache.ignite.internal.network.TopologyEventHandler#onDisappeared}}
handlers are delayed on node 2 (due to metastorage lagging or whatever reason).
# Query started from node 1.
# Root fragment processed locally, {{QueryBatchRequest}} came to node 2 before
{{QueryStartRequest}}. This step is crucial since it puts not completed future
to mailbox registry
({{org.apache.ignite.internal.sql.engine.exec.MailboxRegistryImpl#locals}}).
# {{TopologyEventHandler}}'s are notified on node 2. This step causes
{{onNodeLeft}} handler to be chained to the future from previous step.
# {{QueryStartRequest}} came to node 2. Query fragment is created an
immediately closed by {{onNodeLeft}} handler.
The problem is that {{onNodeLeft}} handler is applied to a query started on a
topology which takes into account node restart. We have to ignore such outdated
events.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)