[ https://issues.apache.org/jira/browse/HADOOP-12189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14615900#comment-14615900 ]
Chris Li commented on HADOOP-12189: ----------------------------------- Hi [~zxu], I see, it's a synchronized map, so its methods acquire the queue's intrinsic lock. Thanks for clarifying I have concerns with performance. Have you measured impact? Last I checked, acquiring a lock every time resulted in measurable performance penalty. It's not just time under contention, it's also the fact that synchronized will introduce a memory barrier, and having 120 threads contesting over yet another lock when processing 60k requests per second sounds questionable. I think it all comes down to performance. The unsafe "sleep 10ms and try again" scheme was born from that (since queue swapping is a relatively infrequent event compared to {{put()}}. If the performance situation has changed since we last discussed this, then we can come up with better solutions that are guarantee consistency. So options are: 1. Start locking and provide strict consistency (and then we don't need to Thread.sleep) 2. Try to increase sleep timeout or increase number of checks to make call loss practically impossible 3. Relax tests and accept the loss of calls during queue swaps [~daryn] might have some thoughts. > CallQueueManager may drop elements from the queue sometimes when calling > swapQueue > ---------------------------------------------------------------------------------- > > Key: HADOOP-12189 > URL: https://issues.apache.org/jira/browse/HADOOP-12189 > Project: Hadoop Common > Issue Type: Bug > Components: ipc, test > Affects Versions: 2.7.1 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: HADOOP-12189.000.patch > > > CallQueueManager may drop elements from the queue sometimes when calling > {{swapQueue}}. > The following test failure from TestCallQueueManager shown some elements in > the queue are dropped. > https://builds.apache.org/job/PreCommit-HADOOP-Build/7150/testReport/org.apache.hadoop.ipc/TestCallQueueManager/testSwapUnderContention/ > {code} > java.lang.AssertionError: expected:<27241> but was:<27245> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.ipc.TestCallQueueManager.testSwapUnderContention(TestCallQueueManager.java:220) > {code} > It looked like the elements in the queue are dropped due to > {{CallQueueManager#swapQueue}} > Looked at the implementation of {{CallQueueManager#swapQueue}}, there is a > possibility that the elements in the queue are dropped. If the queue is full, > the calling thread for {{CallQueueManager#put}} is blocked for long time. It > may put the element into the old queue after queue in {{takeRef}} is changed > by swapQueue, then this element in the old queue will be dropped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)