+ controller-dev Regards, Venkat G
From: Vinh Nguyen Sent: Wednesday, March 15, 2017 2:43 PM To: openflowplugin-dev@lists.opendaylight Cc: Tai, Hideyuki <hideyuki....@necam.com>; Venkatrangan G - ERS, HCL Tech <venkatrang...@hcl.com> Subject: CLOSE_WAIT connections in ODL in scalability testing I have encountered the following problem, wonder if any has any idea. Problem Lsof for the ODL process shows tons of CLOSE_WAIT connections when OVS node repeatedly fails to reconnect to ODL controller port . Eventually ODL throws "Too many open files" as the of CLOSE_WAIT connections piled up and exceeded the maximum allowed number of file descriptors. This problem only happens when we enable TRACE for ODL logging during scalability testing Reproduction step: 1) One control node, one ODL node, using boron 2) On the ODL enable TRACE level logging for netvirt, openflowplugin, openflowjava 3) From control node, use script to define 100 networks/subnetworks in a loop. 4) At around 20-50th network creation, OVS starts to disconnect to ODL openflowplugin port due to inactivity probe. The inactivity from ODL might due to the fact that ODL spends most of its time in logging activity (see step 2). This problem doesn't happen if we don't enable TRACE level logging 5) Subsequently, OVS repeatedly tried to reconnect to ODL openflowplugin port without success 6) Step 4) 5) 6) repeats for about half an hour, with no CLOSE_WAIT connections appear in the lsof result. ODL's karaf shows normal log entries for connections being established and closed. 7) After some failed reconnection attempts as mentioned in 6) the subsequent connection attempts result in CLOSE_WAIT connections as shown in lsof: java 10407 odluser 383u IPv6 77653 0t0 TCP odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40232 (CLOSE_WAIT) java 10407 odluser 401u IPv6 79949 0t0 TCP odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40236 (CLOSE_WAIT) ..... When 7) for the above steps happens, the following are observed: 1) The cluster service didn't send Ownership changes for the last disconnection. 2) When new connection arrives, LifecycleServiceImpl calls ClusterSingletonServiceRegistration#registerClusterSingletonService. Here it doesn't call serviceGroup.initializationClusterSingletonGroup() since the serviceGroup was not cleaned up properly because of 1) 3) Then LifecycleServiceImpl calls ClusterSingletonServiceGroupImpl.registerService. 4) The call in 3) hangs at: LifecycleServiceImpl::instantiateServiceInstance DeviceContextImpl:: onContextInstantiateService DeviceInitializationUtils.initializeNodeInformation DeviceInitializationUtils .createDeviceFeaturesForOF13(deviceContext, switchFeaturesMandatory, convertorExecutor).get(); <--- 5) The call in 3) is invoked in a netty worker thread that handle I/O for ODL. 6) The call in 3) holds a Semaphore on ClusterSingletonServiceGroupImpl::clusterLock 7) Now subsequent incoming requests from the same OVS also result in invoking of ClusterSingletonServiceGroupImpl.registerService 8) The requests in 7) will hang forever waiting for the Semaphore on ClusterSingletonServiceGroupImpl::clusterLock (being locked in 6) 9) The requests in 7 also are handled by netty worker threads 10) As reconnection requests keep coming eventually netty runs out of worker thread to handle new connections 11) Subsequent incoming connections and closing connections from OVS result in CLOSE_WAIT connections since netty has no more thread to handle them Sorry the long email, but does anyone has any idea on the issue or has encountered similar issues? Some questions that I am trying to understand below: 1) Why cluster service didn't send Ownership changes as in 1) 2) Semaphore (in 6) and blocked call (in 4) in netty I/O worker threads might lead to bad situations like this, can they be avoided? Thanks, Vinh ::DISCLAIMER:: ---------------------------------------------------------------------------------------------------------------------------------------------------- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. ----------------------------------------------------------------------------------------------------------------------------------------------------
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev