I applied the specified patch, and it seems to have stabilized around 5K
TIME_WAIT connections on the master end (with the fin wait set to 3
seconds).

So that got me going again, but I'm still interested in figuring out why
the connections aren't being reused for http status update messages. They
should, correct?

Aaron


Aaron Klingaman
R&D Manager, Sr Architect
Urban Robotics, Inc.
503-539-3693




On Wed, Dec 12, 2012 at 11:53 AM, Aaron Klingaman <
[email protected]> wrote:

> Thanks for the link/info. I'm sure this mentioned bug this is happening
> after mesos-slave crashes for a different reason. The mesos-master is still
> up (and still scheduling on other nodes that haven't crashed yet). If I
> leave it running, eventually all mesos-slave with crash and the master will
> still be up.
>
> I originally suspected something in the selection of the ip/port
> libprocess uses (in third_party/libprocess/src/process.cpp), but they seem
> to be returning the correct IP and a correct (random) port. That seems key
> to socket reuse in the SocketManager class.
>
> There also seems to be a large number TIME_WAIT sockets on the master 5050
> port.
>
> This is a few nodes on DHCP with a static hosts file (no DNS at the
> moment), if that makes any difference.
>
> Aaron
>
> Aaron Klingaman
> R&D Manager, Sr Architect
> Urban Robotics, Inc.
> 503-539-3693
>
>
>
>
> On Wed, Dec 12, 2012 at 11:08 AM, Benjamin Mahler <[email protected]>wrote:
>
>> Hi Aaron, here's what I know about this particular issue:
>>
>> Here's the bug: https://issues.apache.org/jira/browse/MESOS-220
>> Here's the fix (not in 0.9.0): https://reviews.apache.org/r/5995
>>
>> We're planning on releasing 0.10.0 shortly, where the fix is present.
>>
>> On Wed, Dec 12, 2012 at 10:47 AM, Aaron Klingaman <
>> [email protected]> wrote:
>>
>> > It appears the status update messages between the master/slave aren't
>> > keeping the connections open.
>> >
>> > This is the only data transferred on each of the TIME_WAIT connections
>> > before being closed:
>> >
>> > POST /slave/mesos.internal.StatusUpdateAcknowledgementMessage HTTP/1.0
>> > User-Agent: libprocess/[email protected]:36675
>> > Connection: Keep-Alive
>> > Transfer-Encoding: chunked
>> >
>> > 8b
>> >
>> > %
>> > #2012121210221931258048-5050-14193-4(
>> > &2012121210221931258048-5050-14193-0001&
>> > $4046162e-448a-11e2-9aa3-080027c264fa"'$FH+*s-
>> > 0
>> >
>> > I'll keep digging; any tips are appreciated.
>> >
>> > Aaron Klingaman
>> > R&D Manager, Sr Architect
>> > Urban Robotics, Inc.
>> > 503-539-3693
>> >
>> >
>> >
>> >
>> > On Tue, Dec 11, 2012 at 4:31 PM, Aaron Klingaman <
>> > [email protected]> wrote:
>> >
>> > > Update:
>> > >
>> > > There is something more going on that just a local port exhaustion. I
>> > set:
>> > >
>> > > /proc/sys/net/ipv4/tcp_fin_timeout to 2
>> > > /proc/sys/net/ipv4/ip_local_port_range to 32768 65535 (+5K approx)
>> > >
>> > > and I'm still seing crashes. I'm currently looking for some artificial
>> > > limit inside mesos on the maximum number of sockets employed. Is there
>> > one?
>> > >
>> > > Much appreciated,
>> > >
>> > > Aaron
>> > >
>> > >
>> > > Aaron Klingaman
>> > > R&D Manager, Sr Architect
>> > > Urban Robotics, Inc.
>> > > 503-539-3693
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Dec 11, 2012 at 9:14 AM, Aaron Klingaman <
>> > > [email protected]> wrote:
>> > >
>> > >> Has anyone else seen this behavior? I have a python implemented
>> executor
>> > >> and framework. Currently using 0.90 from the website. The end
>> > application
>> > >> submits approximately 45K+ tasks to the framework for scheduling. Due
>> > to a
>> > >> bug in my tasks, they fail immediately. It is still in the process of
>> > >> submitting/failing when mesos-slave crashes and a netstat -tp
>> indicates
>> > a
>> > >> very large number of sockets in TIME_WAIT (between the single node
>> and
>> > the
>> > >> master) that belong to mesos-master. The source port is random
>> (44700 in
>> > >> the last run). The tasks only last about 1-2 seconds.
>> > >>
>> > >> I'm assuming mesos-slave is crashing because it can't connect to the
>> > >> master any more after source port exhaustion. It seems to me that the
>> > >> framework is opening a new connection to mesos-master fairly
>> frequently
>> > for
>> > >> task status/submission. Maybe slave->master as well.
>> > >>
>> > >> Fixing my own bug in the task, it works ok because the tasks finish
>> in
>> > >> 1-2 seconds each, but there are still a fairly high number of
>> TIME_WAIT
>> > >> sockets indicating the problem is still there.
>> > >>
>> > >> The last relevent mesos-slave crash lines:
>> > >>
>> > >> F1211 08:41:41.626716 26415 process.cpp:1742] Check failed:
>> > >> sockets.count(s) > 0
>> > >> *** Check failure stack trace: ***
>> > >>     @     0x7fc9e39adebd  google::LogMessage::Fail()
>> > >>     @     0x7fc9e39b064f  google::LogMessage::SendToLog()
>> > >>     @     0x7fc9e39adabb  google::LogMessage::Flush()
>> > >>     @     0x7fc9e39b0edd  google::LogMessageFatal::~LogMessageFatal()
>> > >>     @     0x7fc9e38e5579  process::SocketManager::next()
>> > >>     @     0x7fc9e38e0063  process::send_data()
>> > >>     @     0x7fc9e39eb66f  ev_invoke_pending
>> > >>     @     0x7fc9e39ef9a4  ev_loop
>> > >>     @     0x7fc9e38e0fb7  process::serve()
>> > >>     @     0x7fc9e32f1e9a  start_thread
>> > >>     @     0x7fc9e2b08cbd  (unknown)
>> > >> Aborted
>> > >>
>> > >> On a side note, I'm anxious to see the changelog for the next
>> release.
>> > >>
>> > >> Aaron
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Reply via email to