This is the stack trace I was able to grab from our production box.

#0  _wait (socket=0x0, fd=145, rw=114) at src/task/fd.c:217
#1  0x000000000041a087 in fdrecv (fd=145, buf=0x7fe98c0b16a0, n=4096)
    at src/task/fd.c:361
#2  0x00000000004248bc in plaintext_recv (iob=<value optimized out>,
    buffer=<value optimized out>, len=<value optimized out>) at src/io.c:83
#3  0x0000000000425cbf in IOBuf_read (buf=0x7fe98c1a31b0, need=4096,
    out_len=0x7fe98c0bc91c) at src/io.c:499
#4  0x000000000045bf6d in Connection_read_header (conn=0x7fe98c13eed0,
    req=0x7fe98c6ab820) at src/connection.c:966
#5  0x000000000045c13c in connection_parse (conn=<value optimized out>)
    at src/connection.c:611
#6  0x000000000045c6d4 in State_exec (state=0x7fe98c13eef0,
    event=<value optimized out>, conn=<value optimized out>) at
src/state.rl:53
#7  0x000000000045b688 in Connection_task (v=0x7fe98c13eed0)
    at src/connection.c:868
#8  0x000000000041b16f in taskstart (y=<value optimized out>,
    x=<value optimized out>) at src/task/task.c:37
#9  0x00007fe9960901a0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()

On Thu, Jun 28, 2012 at 8:49 AM, Rob LaRubbio <[email protected]> wrote:

> I'll see if I can connect to it.  I haven't been able to reproduce it in
> my development environment, but it happens often in production.  I've
> written a script to detect it and kill mongrel2 I'll see if I can modify
> that script on one machine to also take the server out of the load balancer
> so I can get a stack trace.
>
> This issue was occurring when I was running from the tip of the develop
> branch, I've recently moved over to a fork that includes my pull request
> that allows filters to handle the CLOSE event.  Running that build I notice
> that the CLOSE event occurs more often than HANDLER event (for the same
> connection) and my filter was occasionally getting called for the HANDLER
> event with the same connection object (before CLOSE was called).  Both
> cases are things I didn't expect to happen, I'm not sure if it's related.
>
> Anyway I'll try and get a stack trace to you.
>
> -Rob
>
>
> On Wed, Jun 27, 2012 at 5:43 PM, Jason Miller <[email protected]> wrote:
>
>> Hi Rob,
>>
>> I've not been able to reproduce this (though I found and fixed an
>> unrelated bug in the process of trying to).  Any chance you could use
>> gdb to attach to mongre2 next time this happens, set a breakpoint on the
>> line where the error message happens and print a backtrace?
>>
>> If you're not familiar with gdb from the shell:
>>
>>  sudo gdb attach <mongrel2 pid>
>> That will attach gdb to mongrel2
>>  b src/task/fd.c:217
>> This will set the breakpoint, which ought to be hit immediately
>>  backtrace
>> This will print the backtrace
>>
>>
>> -Jason
>>
>> On 20:36 Fri 22 Jun     , Rob LaRubbio wrote:
>> > I'm not sure if this is relevant to the issue but I figure I'd throw it
>> > out there in case it is.
>> >
>> > I added a new server to our production env. running a build of mongrel
>> > with my fix for filters getting the CLOSE transition.  My filter
>> > increments a counter on the HANDLER event and decrements it on the CLOSE
>> > event.  I then send that count to statsd.  Looking at my stats I can see
>> > that CLOSE is happening much more frequently than HANDLER so it seems
>> > the same connection is getting closed multiple times.
>> >
>> > -Rob
>> >
>> > On 6/22/12 1:40 PM, Rob LaRubbio wrote:
>> > > Thanks for looking into this.
>> > >
>> > > We aren't using websockets or proxies, just mongrel2 and Tir.  We have
>> > > 4 mongrel2 servers behind a load balance each has 300 handlers.  The
>> > > handlers are not shared across servers (I have a pull request into Tir
>> > > to make it easier to run Tir on a server other than mongrel2)
>> > >
>> > > ==== mongrel2.conf =====
>> > > houston = Filter(
>> > >   name="/opt/mongrel2-1.8-dev/lib/mongrel2/filters/houston.so",
>> > >   settings = {
>> > >     <removed>
>> > >   }
>> > > )
>> > >
>> > > apollo = Handler(send_spec='tcp://127.0.0.1:9999',
>> > >                 send_ident='38f857b8-cbaa-4b58-9271-0d36c27813c4',
>> > >                 recv_spec='tcp://127.0.0.1:9998', recv_ident='',
>> > >                 protocol='tnetstring')
>> > >
>> > > static = Dir(base='static/',
>> > >              index_file='index.html',
>> > >              default_ctype='text/plain')
>> > >
>> > > main = Server(
>> > >     uuid="505417b8-1de4-454f-98b6-07eb9225cca1",
>> > >     access_log="/logs/access.log",
>> > >     error_log="/logs/error.log",
>> > >     chroot="/opt/mongrel2-1.8-dev",
>> > >     default_host="(.+)",
>> > >     name="main",
>> > >     pid_file="/run/mongrel2.pid",
>> > >     port=6767,
>> > >     hosts = [
>> > >         Host(name="(.+)",
>> > >         routes={ '/(.*/.*)': apollo,
>> > >                  '/([^/]*)$': static })
>> > >     ],
>> > >     filters = [
>> > >         houston
>> > >     ]
>> > >   )
>> > >
>> > > settings = {
>> > >     "limits.content_length": 20480000
>> > > }
>> > >
>> > > On 6/22/12 1:11 PM, Tordek wrote:
>> > >> On 22/06/12 13:12, Rob LaRubbio wrote:
>> > >>> Is the dev branch ready for a release? We're running it production
>> > >>> and at least three times a week it starts spinning and writing this
>> > >>> to the logs in an endless loop:
>> > >>>
>> > >>> Fri, 22 Jun 2012 16:04:50 GMT [ERROR] (src/task/fd.c:217: errno:
>> > >>> None) Attempt to wait on a dead socket/fd: (nil) or -1
>> > >>>
>> > >>> The server fills up a 500G disk in about 11 hours and we need to
>> > >>> kill the server to get it handling requests again.
>> > >> Jason and I are looking into this; could you show us your
>> > >> mongrel2.conf? Are you using websockets or proxies?
>> > >>
>> > >>> -Rob
>> > >
>> > >
>> >
>> >
>> >
>>
>>
>

Reply via email to