[ 
https://issues.apache.org/jira/browse/IGNITE-14474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338838#comment-17338838
 ] 

Rodion Smolnikov commented on IGNITE-14474:
-------------------------------------------

[~slava.koptilin] hello, cound you review this PR?

> Improve error message in case rebalance fails
> ---------------------------------------------
>
>                 Key: IGNITE-14474
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14474
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.5
>            Reporter: Denis Chudov
>            Assignee: Rodion Smolnikov
>            Priority: Major
>             Fix For: 2.9.2
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently we can get a message like this when rebalance fails with an 
> exception (examples from ignite 2.5, in newer versions the log messages were 
> changed but the problem is still actual):
> {code:java}
> 2019-11-27 13:41:14,504[WARN ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> 2019-11-27 13:41:14,504[INFO ][utility-#79%xxx%][GridDhtPartitionDemander] 
> Cancelled rebalancing [grp=ignite-sys-cache, 
> supplier=f014f30a-77f2-4459-aa5b-6c12907a7449, topVer=AffinityTopologyVersion 
> [topVer=1932, minorTopVer=1], time=88 ms]
> 2019-11-27 13:41:14,508[WARN ][utility-#76%xxx%][GridDhtPartitionDemander] 
> Rebalancing from node cancelled [grp=ignite-sys-cache, 
> topVer=AffinityTopologyVersion [topVer=1932, minorTopVer=1], 
> supplier=dfa5ee06-48c9-4458-ae55-48cc6ceda998, topic=0]. Supply message 
> couldn't be unmarshalled: class o.a.i.IgniteCheckedException: Failed to 
> unmarshal object with optimized marshaller
> {code}
> In the case above, a marshalling exception leads to rebalance failure which 
> will never be resolved - i.e. the cluster enters into a erroneous state.
> We should report issues like this as ERROR. The message should explain that 
> the rebalance has failed, data for the cache was not fully copied to the 
> node, the backup factor is not recovered and the cluster may not work 
> correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to