[ 
https://issues.apache.org/jira/browse/IGNITE-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325806#comment-17325806
 ] 

Denis Chudov commented on IGNITE-14474:
---------------------------------------

[~Smolnikov] I added a few comments to PR, please fix them.

Also it would be nice to add log message checking to tests in 
GridCacheRebalancingUnmarshallingFailedSelfTest.

> Improve error message in case rebalance fails
> ---------------------------------------------
>
>                 Key: IGNITE-14474
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14474
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.5
>            Reporter: Denis Chudov
>            Assignee: Rodion
>            Priority: Major
>             Fix For: 2.9.2
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently we can get a message like this when rebalance fails with an 
> exception (examples from ignite 2.5, in newer versions the log messages were 
> changed but the problem is still actual):
> {code:java}
> 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Cancelled rebalancing [grp=ignite-sys-cache, 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion 
> [topVer=1932, minorTopVer=1], time=88 ms]
> 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> {code}
> In the case above, a marshalling exception leads to rebalance failure which 
> will never be resolved - i.e. the cluster enters into a erroneous state.
> We should report issues like this as ERROR. The message should explain that 
> the rebalance has failed, data for the cache was not fully copied to the 
> node, the backup factor is not recovered and the cluster may not work 
> correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to