I applied the specified patch, and it seems to have stabilized around 5K TIME_WAIT connections on the master end (with the fin wait set to 3 seconds).
So that got me going again, but I'm still interested in figuring out why the connections aren't being reused for http status update messages. They should, correct? Aaron Aaron Klingaman R&D Manager, Sr Architect Urban Robotics, Inc. 503-539-3693 On Wed, Dec 12, 2012 at 11:53 AM, Aaron Klingaman < [email protected]> wrote: > Thanks for the link/info. I'm sure this mentioned bug this is happening > after mesos-slave crashes for a different reason. The mesos-master is still > up (and still scheduling on other nodes that haven't crashed yet). If I > leave it running, eventually all mesos-slave with crash and the master will > still be up. > > I originally suspected something in the selection of the ip/port > libprocess uses (in third_party/libprocess/src/process.cpp), but they seem > to be returning the correct IP and a correct (random) port. That seems key > to socket reuse in the SocketManager class. > > There also seems to be a large number TIME_WAIT sockets on the master 5050 > port. > > This is a few nodes on DHCP with a static hosts file (no DNS at the > moment), if that makes any difference. > > Aaron > > Aaron Klingaman > R&D Manager, Sr Architect > Urban Robotics, Inc. > 503-539-3693 > > > > > On Wed, Dec 12, 2012 at 11:08 AM, Benjamin Mahler <[email protected]>wrote: > >> Hi Aaron, here's what I know about this particular issue: >> >> Here's the bug: https://issues.apache.org/jira/browse/MESOS-220 >> Here's the fix (not in 0.9.0): https://reviews.apache.org/r/5995 >> >> We're planning on releasing 0.10.0 shortly, where the fix is present. >> >> On Wed, Dec 12, 2012 at 10:47 AM, Aaron Klingaman < >> [email protected]> wrote: >> >> > It appears the status update messages between the master/slave aren't >> > keeping the connections open. >> > >> > This is the only data transferred on each of the TIME_WAIT connections >> > before being closed: >> > >> > POST /slave/mesos.internal.StatusUpdateAcknowledgementMessage HTTP/1.0 >> > User-Agent: libprocess/[email protected]:36675 >> > Connection: Keep-Alive >> > Transfer-Encoding: chunked >> > >> > 8b >> > >> > % >> > #2012121210221931258048-5050-14193-4( >> > &2012121210221931258048-5050-14193-0001& >> > $4046162e-448a-11e2-9aa3-080027c264fa"'$FH+*s- >> > 0 >> > >> > I'll keep digging; any tips are appreciated. >> > >> > Aaron Klingaman >> > R&D Manager, Sr Architect >> > Urban Robotics, Inc. >> > 503-539-3693 >> > >> > >> > >> > >> > On Tue, Dec 11, 2012 at 4:31 PM, Aaron Klingaman < >> > [email protected]> wrote: >> > >> > > Update: >> > > >> > > There is something more going on that just a local port exhaustion. I >> > set: >> > > >> > > /proc/sys/net/ipv4/tcp_fin_timeout to 2 >> > > /proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx) >> > > >> > > and I'm still seing crashes. I'm currently looking for some artificial >> > > limit inside mesos on the maximum number of sockets employed. Is there >> > one? >> > > >> > > Much appreciated, >> > > >> > > Aaron >> > > >> > > >> > > Aaron Klingaman >> > > R&D Manager, Sr Architect >> > > Urban Robotics, Inc. >> > > 503-539-3693 >> > > >> > > >> > > >> > > >> > > On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman < >> > > [email protected]> wrote: >> > > >> > >> Has anyone else seen this behavior? I have a python implemented >> executor >> > >> and framework. Currently using 0.90 from the website. The end >> > application >> > >> submits approximately 45K+ tasks to the framework for scheduling. Due >> > to a >> > >> bug in my tasks, they fail immediately. It is still in the process of >> > >> submitting/failing when mesos-slave crashes and a netstat -tp >> indicates >> > a >> > >> very large number of sockets in TIME_WAIT (between the single node >> and >> > the >> > >> master) that belong to mesos-master. The source port is random >> (44700 in >> > >> the last run). The tasks only last about 1-2 seconds. >> > >> >> > >> I'm assuming mesos-slave is crashing because it can't connect to the >> > >> master any more after source port exhaustion. It seems to me that the >> > >> framework is opening a new connection to mesos-master fairly >> frequently >> > for >> > >> task status/submission. Maybe slave->master as well. >> > >> >> > >> Fixing my own bug in the task, it works ok because the tasks finish >> in >> > >> 1-2 seconds each, but there are still a fairly high number of >> TIME_WAIT >> > >> sockets indicating the problem is still there. >> > >> >> > >> The last relevent mesos-slave crash lines: >> > >> >> > >> F1211 08:41:41.626716 26415 process.cpp:1742] Check failed: >> > >> sockets.count(s) > 0 >> > >> *** Check failure stack trace: *** >> > >> @ 0x7fc9e39adebd google::LogMessage::Fail() >> > >> @ 0x7fc9e39b064f google::LogMessage::SendToLog() >> > >> @ 0x7fc9e39adabb google::LogMessage::Flush() >> > >> @ 0x7fc9e39b0edd google::LogMessageFatal::~LogMessageFatal() >> > >> @ 0x7fc9e38e5579 process::SocketManager::next() >> > >> @ 0x7fc9e38e0063 process::send_data() >> > >> @ 0x7fc9e39eb66f ev_invoke_pending >> > >> @ 0x7fc9e39ef9a4 ev_loop >> > >> @ 0x7fc9e38e0fb7 process::serve() >> > >> @ 0x7fc9e32f1e9a start_thread >> > >> @ 0x7fc9e2b08cbd (unknown) >> > >> Aborted >> > >> >> > >> On a side note, I'm anxious to see the changelog for the next >> release. >> > >> >> > >> Aaron >> > >> >> > >> >> > > >> > >> > >
