[
https://issues.apache.org/jira/browse/HELIX-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444513#comment-16444513
]
ASF GitHub Bot commented on HELIX-690:
--------------------------------------
Github user asfgit closed the pull request at:
https://github.com/apache/helix/pull/181
> Batch message should not share same NotificationContext object to update
> CurrentState
> -------------------------------------------------------------------------------------
>
> Key: HELIX-690
> URL: https://issues.apache.org/jira/browse/HELIX-690
> Project: Apache Helix
> Issue Type: Bug
> Reporter: Hao Zhang
> Priority: Major
>
> Currently batch message has bugs:
> 1. Batch message is triggering a lot of duplicated state transition messages
> sent from controller, result in "state does not match" error on participant
> side. This will further create a lot of ERROR znodes in ZK, which adds up
> both read/write workload in participant and controller
> 2. We see a lot of concurrent update exceptions as well
> {noformat}
> 9909348:[2018-03-30 18:59:55,025] [ERROR] [pool-1-thread-1917]
> [org.apache.helix.messaging.handling.HelixTask:113] - Exception while
> executing a message. java.util.ConcurrentModificat
> ionException msgId: fbdc37d4-ec95-47cb-950c-f9d3d224bbb3 type:
> STATE_TRANSITION
> 9909349-java.util.ConcurrentModificationException
> 9909350- at
> java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115)
> 9909351- at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169)
> 9909352- at org.apache.helix.ZNRecord.merge(ZNRecord.java:497)
> 9909353- at org.apache.helix.GroupCommit.commit(GroupCommit.java:121)
> 9909354- at
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:182)
> 9909355- at
> org.apache.helix.manager.zk.ZKHelixDataAccessor.updateProperty(ZKHelixDataAccessor.java:170)
> 9909356- at
> org.apache.helix.messaging.handling.BatchMessageHandler.postHandleMessage(BatchMessageHandler.java:118)
> 9909357- at
> org.apache.helix.messaging.handling.BatchMessageHandler.handleMessage(BatchMessageHandler.java:203)
> 9909358- at
> org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:96)
> {noformat}
> The above 2 errors are resulted in the fact that in HelixTaskExecutor, all
> HelixTask objects from same batch of messages are sharing the same
> changeContext object. For batch message, HelixTask will create current state
> update map to record current state updates, and therefore result in a racing
> condition in current state recording - it is very normal that due to such
> bug, resource's current state is changed on participant side, current state
> is not updated in ZK, and after message is removed, controller still think
> that state transition is not finished, and send duplicated state transition
> message.
>
> The error situation will only be triggered when the load is high, so not
> covered by our unit / e2e tests
> To fix the issue, we should create deep copies of NotificationContext object
> for each HelixTask in HelixTaskExecutor. I tried this fix using large data
> sets, and it worked.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)