Re: [controller-dev] CLOSE_WAIT connections in ODL in scalability testing

Venkatrangan G - ERS, HCL Tech Wed, 15 Mar 2017 15:14:28 -0700

+ controller-dev

Regards,
Venkat G

From: Vinh Nguyen
Sent: Wednesday, March 15, 2017 2:43 PM
To: openflowplugin-dev@lists.opendaylight
Cc: Tai, Hideyuki <hideyuki....@necam.com>; Venkatrangan G - ERS, HCL Tech 
<venkatrang...@hcl.com>
Subject: CLOSE_WAIT connections in ODL in scalability testing

I have encountered the following problem, wonder if any has any idea.

Problem

Lsof for the ODL process shows tons of CLOSE_WAIT connections when OVS node 
repeatedly fails to reconnect to ODL controller port . Eventually ODL throws 
"Too many open files" as the of CLOSE_WAIT connections piled up and exceeded 
the maximum allowed number of file descriptors.
This problem only happens when we enable TRACE for ODL logging during 
scalability testing

Reproduction step:

1)      One control node, one ODL node, using boron

2)      On the ODL enable TRACE level logging for netvirt, openflowplugin, 
openflowjava

3)      From control node, use script to define 100 networks/subnetworks in a 
loop.

4)      At around 20-50th network creation, OVS starts to disconnect to ODL 
openflowplugin port due to inactivity probe. The inactivity from ODL might due 
to the fact that ODL spends most of its time in logging activity (see step 2). 
This problem doesn't happen if we don't enable TRACE level logging

5)      Subsequently, OVS repeatedly tried to reconnect to ODL openflowplugin 
port without success

6)      Step 4) 5) 6) repeats for about half an hour, with no CLOSE_WAIT 
connections appear in the lsof result. ODL's karaf shows normal log entries for 
connections being established and closed.

7)      After some failed reconnection attempts as mentioned in 6) the 
subsequent connection attempts result in CLOSE_WAIT connections as shown in 
lsof:

java    10407 odluser 383u     IPv6              77653       0t0       TCP 
odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40232 (CLOSE_WAIT)

java    10407 odluser 401u     IPv6              79949       0t0       TCP 
odl2.c.my-odl.internal:6653->control2.c.my-odl.internal:40236 (CLOSE_WAIT)

.....

When 7) for the above steps happens, the following are observed:

1)      The cluster service didn't send Ownership changes for the last 
disconnection.

2)      When new connection arrives, LifecycleServiceImpl calls 
ClusterSingletonServiceRegistration#registerClusterSingletonService. Here it 
doesn't call serviceGroup.initializationClusterSingletonGroup() since the 
serviceGroup was not cleaned up properly because of 1)

3)      Then LifecycleServiceImpl calls  
ClusterSingletonServiceGroupImpl.registerService.

4)      The call in 3) hangs at:

LifecycleServiceImpl::instantiateServiceInstance

                DeviceContextImpl:: onContextInstantiateService

DeviceInitializationUtils.initializeNodeInformation

                                                DeviceInitializationUtils 
.createDeviceFeaturesForOF13(deviceContext, switchFeaturesMandatory, 
convertorExecutor).get(); <---

5)      The call in 3)  is invoked in a netty worker thread that handle I/O for 
ODL.

6)      The call in 3)  holds a Semaphore on 
ClusterSingletonServiceGroupImpl::clusterLock

7)      Now subsequent incoming requests from the same OVS also result in 
invoking of ClusterSingletonServiceGroupImpl.registerService

8)      The requests in 7) will hang forever waiting for the Semaphore on 
ClusterSingletonServiceGroupImpl::clusterLock (being locked in 6)

9)      The requests in 7 also are handled by netty worker threads

10)   As reconnection requests keep coming eventually netty runs out of worker 
thread to handle new connections

11)   Subsequent incoming connections and closing connections from OVS result 
in CLOSE_WAIT connections since netty has no more thread to handle them

Sorry the long email, but does anyone has any idea on the issue or has 
encountered similar issues? Some questions that I am trying to understand below:

1)      Why cluster service didn't send Ownership changes as in 1)

2)      Semaphore (in 6) and blocked call (in 4) in netty I/O worker threads 
might lead to bad situations like this, can they be avoided?

Thanks, Vinh

::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.

----------------------------------------------------------------------------------------------------------------------------------------------------

_______________________________________________
controller-dev mailing list
controller-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/controller-dev

Re: [controller-dev] CLOSE_WAIT connections in ODL in scalability testing

Reply via email to