[ https://issues.apache.org/jira/browse/IGNITE-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071590#comment-17071590 ]
Sergey Antonov commented on IGNITE-12845: ----------------------------------------- [~alex_pl] I didn't find any {{Set#contains(Object)}} usages in {{sun.nio.ch.SelectorImpl}} in jdk8 (1.8.0_191). > GridNioServer can infinitely lose some events > ---------------------------------------------- > > Key: IGNITE-12845 > URL: https://issues.apache.org/jira/browse/IGNITE-12845 > Project: Ignite > Issue Type: Bug > Reporter: Aleksey Plekhanov > Priority: Major > > With enabled optimization (IGNITE_NO_SELECTOR_OPTS = false, by default) > {{GridNioServer}} can lose some events for a channel (depending on JDK > version and OS). It can lead to connected applications hang. Reproducer: > {code:java} > public void testConcurrentLoad() throws Exception { > startGrid(0); > try (IgniteClient client = Ignition.startClient(new > ClientConfiguration().setAddresses("127.0.0.1:10800"))) { > ClientCache<Integer, Integer> cache = > client.getOrCreateCache(DEFAULT_CACHE_NAME); > GridTestUtils.runMultiThreaded( > () -> { > for (int i = 0; i < 1000; i++) > cache.put(i, i); > }, 5, "run-async"); > } > } > {code} > This reproducer hangs eventually on MacOS (tested with JDK 8, 11, 12, 13, > 14), hangs on some Linux environments (for example passed more than 100 times > on desktop Linux system with JDK 8, but hangs on team-city agents with JDK 8, > 11) and never hanged (passed more than 100 times) on windows system, but > passes on all systems and JDK versions when system property > {{IGNITE_NO_SELECTOR_OPTS = true}} is set. > > The root cause: optimized {{SelectedSelectionKeySet}} always returns > {{false}} for {{contains()}} method. The {{contains()}} method used by > {{sun.nio.ch.SelectorImpl.processReadyEvents()}} method: > {code:java} > if (selectedKeys.contains(ski)) { > if (ski.translateAndUpdateReadyOps(rOps)) { > return 1; > } > } else { > ski.translateAndSetReadyOps(rOps); > if ((ski.nioReadyOps() & ski.nioInterestOps()) != 0) { > selectedKeys.add(ski); > return 1; > } > } > {code} > So, for fair implementation, if a selection key is contained in the selected > keys set, then ready operations flags are updated, but for > {{SelectedSelectionKeySet}} ready operations flags will be always overridden > and new selector key will be added even if it's already contained in the set. > Some {{SelectorImpl}} implementations can pass several events for one > selector key to {{processReadyEvents}} method (for example, MacOs > implementation {{KQueueSelectorImpl}} works in such a way). In this case, > duplicated selector keys will be added to {{selectedKeys}} and all events > except last will be lost. > Two bad things happen in {{GridNioServer}} due to described above reasons: > # Some event flags are lost and the worker doesn't process corresponding > action (for attached reproducer "channel is ready for reading" event is lost > and the workers never read the channel after some point in time). > # Duplicated selector keys with the same event flags (for attached > reproducer it's "channel is ready for writing" event, this duplication leads > to wrong processing of {{GridSelectorNioSessionImpl#procWrite}} flag, which > will be {{false}} in some cases, but at the same time selector key's > {{interestedOps}} will contain {{OP_WRITE}} operation and this operation > never be excluded) > Possible solutions: > * Fair implementation of {{SelectedSelectionKeySet.contains}} method (this > will solve all problems but can be resource consuming) > * Always set {{GridSelectorNioSessionImpl#procWrite}} to {{true}} when > adding {{OP_WRITE}} to {{interestedOps}} (for example in > {{AbstractNioClientWorker.registerWrite()}} method). In this case, some > "channel is ready for reading" events (but not data) still can be lost, but > not infinitely, and eventually data will be read. > * Exclude {{OP_WRITE}} from {{interestedOps}} even if > {{GridSelectorNioSessionImpl#procWrite}} is {{false}} when there are no write > requests in the queue (see {{GridNioServer.stopPollingForWrite()}} method). > This solution has the same shortcomings as the previous one. > -- This message was sent by Atlassian Jira (v8.3.4#803005)