Denis Chudov created IGNITE-12760:
-------------------------------------
Summary: Prevent AssertionError on message unmarshalling, when
classLoaderId contains id of node that already left
Key: IGNITE-12760
URL: https://issues.apache.org/jira/browse/IGNITE-12760
Project: Ignite
Issue Type: Bug
Reporter: Denis Chudov
Assignee: Denis Chudov
Following assertion error triggers failure handler and crashes the node. Can
possibly crash the whole cluster.
{code:java}
2020-02-18
14:34:09.775\[ERROR]\[query-#146129%DPL_GRID%DplGridNodeName%]\[o.a.i.i.p.cache.GridCacheIoManager]
Failed to process message \[senderId=727757ed-4ad4-4779-bda9-081525725cce,
msg=GridCacheQueryRequest \[id=178,
cacheName=com.sbt.tokenization.data.entity.KEKEntity_DPL_union-module,
type=SCAN, fields=false, clause=null, clsName=null, keyValFilter=null,
rdc=null, trans=null, pageSize=1024, incBackups=false, cancel=false,
incMeta=false, all=false, keepBinary=true,
subjId=727757ed-4ad4-4779-bda9-081525725cce, taskHash=0, part=-1,
topVer=AffinityTopologyVersion \[topVer=97, minorTopVer=0], sendTimestamp=-1,
receiveTimestamp=-1, super=GridCacheIdMessage \[cacheId=-1129073400,
super=GridCacheMessage \[msgId=179, depInfo=GridDeploymentInfoBean
\[clsLdrId=c32670e3071-d30ee64b-0833-45d4-abbe-fb6282669caa, depMode=SHARED,
userVer=0, locDepOwner=false, participants=null],
lastAffChangedTopVer=AffinityTopologyVersion \[topVer=8, minorTopVer=6],
err=null, skipPrepare=false]]]]
java.lang.AssertionError: null
at
org.apache.ignite.internal.processors.cache.GridCacheDeploymentManager$CachedDeploymentInfo.<init>(GridCacheDeploymentManager.java:918)
at
org.apache.ignite.internal.processors.cache.GridCacheDeploymentManager$CachedDeploymentInfo.<init>(GridCacheDeploymentManager.java:889)
at
org.apache.ignite.internal.processors.cache.GridCacheDeploymentManager.p2pContext(GridCacheDeploymentManager.java:422)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.unmarshall(GridCacheIoManager.java:1576)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:584)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:386)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:312)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:102)
at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:301)
at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1565)
at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1189)
at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4300(GridIoManager.java:130)
at
org.apache.ignite.internal.managers.communication.GridIoManager$8.run(GridIoManager.java:1092)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){code}
There is no fair reproducer for now, but it seems that we should prevent such
situation in general like following:
1) check the correctness of the message before it will be sent - inside of
GridCacheDeploymentManager#prepare. If we have the corresponding class loader
on local node, we can try to fix message and replace wrong class loader with
local one.
2) log suspicious deployments which we receive from
GridDeploymentManager#deploy - maybe we have obsolete deployments in caches.
3) possibly we can remove this assertion, we should have this class on sender
node and use it as class loader id, and if we don't, we will receive exception
on finishUnmarshall (Failed to peer load class) and try to process this
situation with GridCacheIoManager#processFailedMessage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)