Re: Weird Unicorn Timeout Issues (Hibernation problem?)
Eric, The problem is a firewall that sits between the servers and the database. It is an idle session timeout of 30 minutes, so it is silently killing the connection. I have reached out to our Network Engineering department but they are saying they can not change that idle session timeout, nor create a special rule to allow this connection to bypass that rule. Currently, I setup a polling device that calls the applications URL every 20 minutes. This causes the connection between the server and DB to refresh it's idle timeout. This is obviously a very hacky way to handle it, so I am trying to look into AR and Oracle_Enhanced to see if they have some sort of keepalive option for the database. I thought it would work with the reaping_frequency, but apparently that does not work out as I had expected when you are not running in pools or a thread. So I'm still on the lookout for something to handle that. On Wed, Aug 6, 2014 at 5:45 AM, Eric Wong e...@80x24.org wrote: Any update? It looks like your DB driver is not using/respecting any timeout at all[1]. It is bad to not have a timeout there. There should be a way to set a timeout so you can at least tell the user the DB connection dropped or maybe get your app to disconnect+retry once. A better looking strace would be something like: write(fd, ...); = success (poll|select|ppoll) syscall ... read(fd, ...); /* only if (poll|select|ppoll) was successful[2] */ This goes for configuring all connections/services for any app. [1] or if it's relying on SO_RCVTIMEO socket option(rare), that's set way too high. Any timeout set for any external connection should be lower than the unicorn (last-resort) timeout feature. [2] any read() syscall after (poll|select|ppoll) should be non-blocking, because (poll|select|ppoll) may spuriously wakeup.
Re: Weird Unicorn Timeout Issues (Hibernation problem?)
I appreciate all your help Eric and Daniel. I have not solved this yet, but I think I have narrowed it down to a Firewall timeout issue. One app uses a database connection to Oracle, the other app uses a 3rd Party API (still on location, but across the network). The ping times to both of these devices are extremely fast, however 30 minutes of inactivity across the Firewall seems to disconnect these connections. At least that appears to be what the strace is telling me. The place in the strace that the timeout occurs is consistent, every time. For example the strace of the app that connects to Oracle shows this: pid 7825] write(14, \0\373\0\0\6\0\0\0\0\0\21iB\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\v\0\0\0\3^Ca\201\0\0\0\0\0\0\376\377\377\377\377\377\377\377\22\0\0\0\376\377\377\377\377\377\377\377\r\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0d\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\22select 1 from dual\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0, 251) = 251 [pid 7825] read(14, unfinished ... [pid 7827] +++ killed by SIGKILL +++ PANIC: handle_group_exit: 7827 leader 7825 [pid 7846] +++ killed by SIGKILL +++ PANIC: handle_group_exit: 7846 leader 7825 +++ killed by SIGKILL +++ Clearly that is a database query 'select 1 from dual'. It times out trying to read the response. At the same time if I watch the lsof -p pid, I see that the database connection drops after 30 minutes. I'll update this thread again once it is solved, for historical and future issues (in case someone else experiences something similar). Again thank you for your help Eric! On Mon, Aug 4, 2014 at 4:46 PM, Eric Wong e...@80x24.org wrote: Eric Wong e...@80x24.org wrote: Did you try strace-ing for 30 minutes and reproducing the error? You can also try setting the unicorn timeout to longer than 30 minutes and get a longer/stalled strace.
Weird Unicorn Timeout Issues (Hibernation problem?)
Current Setup: (the problem existed before updating nginx, unicorn and rails; but in an attempt to solve the problem I updated them). CentOS 6.3 (x64) Ruby 2.1.2p95 Rails 4.0.0 Nginx 1.6.0 Unicorn 4.8.3 We have an issue where if a site is not accessed for around (average) 30 minutes the next query will timeout, and it will timeout on all the workers opened. IE: If I have two workers, then both of those workers will timeout, even if the first one, after timeout, starts to work. As soon as the second worker is called upon it will timeout. Then everything runs perfectly good and great until the site is not accessed for 30 minutes or more. Then the timeout issue starts all over again. This happens on our dev machine on two different applications (only apps on the server) and on our production machine (also running two apps). I can't figure out exactly what is going on. I tried strace the master unicorn process, but I get no relevant information. One of the applications is also tied to a New Relic account and it gives no information either. Nginx logs show this: (sorry, obfuscated certain details due to the internal enterprise nature) 2014/08/04 11:31:21 [error] 6353#0: *339 upstream prematurely closed connection while reading response header from upstream, client: *.*.*.*, server: .org, request: GET /assets/application.css HTTP/1.1, upstream: http://unix:/var/www/sites//shared/sockets/.unicorn.sock.0:/assets/application.css;, host: .org, referrer: http://.org/outages; 2014/08/04 11:31:53 [error] 31058#0: *1 upstream prematurely closed connection while reading response header from upstream, client: *.*.*.*, server: .org, request: GET /outages HTTP/1.1, upstream: http://unix:/var/www/sites//shared/sockets/.unicorn.sock.0:/outages;, host: .org, referrer: http://.org/outages; Unicorn stderr shows this: (sorry, obfuscated certain details due to the internal enterprise nature) E, [2014-08-04T11:31:21.620379 #11991] ERROR -- : worker=1 PID:29701 timeout (21s 20s), killing E, [2014-08-04T11:31:21.630521 #11991] ERROR -- : reaped #Process::Status: pid 29701 SIGKILL (signal 9) worker=1 I, [2014-08-04T11:31:21.639881 #30521] INFO -- : worker=1 ready E, [2014-08-04T11:31:53.666300 #11991] ERROR -- : worker=0 PID:29705 timeout (21s 20s), killing E, [2014-08-04T11:31:53.676984 #11991] ERROR -- : reaped #Process::Status: pid 29705 SIGKILL (signal 9) worker=0 I, [2014-08-04T11:31:53.687157 #30528] INFO -- : worker=0 ready Its interesting that the timeout is occurring on different stages, for example if you look at the two nginx errors, one timeout at /assets/application.css and the other at the root /outages/, I've had it be small gifs as well. I'd also like to point at that I have changed the timeout from 30s to 60s to 300s to 20s and it occurs at every interval. So it is not a true timeout issue.Also the two apps on the server are completely different, with only Ruby and Rails being the same process. The gems are different in both, the database is different in both, in fact one of them has no assets, no css and no ui files at all. The only similarities are Ruby, Rails, Nginx and Unicorn. It's like there is some sort of hibernation problem with the unicorn workers. The hibernation thought comes from these debug messages (I dropped to recording debug messages trying to locate the problem). I get these every now and then: D, [2014-08-03T22:14:49.776538 #11991] DEBUG -- : waiting 11.0s after suspend/hibernation But then they cease to occur for days, example is that message was the last time that occurred (non in the logs for today). I've been trying for the past 4 days to solve this, make a large number of changes to no avail. Any help would be GREATLY appreciated.
Re: Weird Unicorn Timeout Issues (Hibernation problem?)
Eric, Thank you for responding. We use a database on only one of the apps, it is a Oracle 11G RAC Server. I'll get the DBA to double check the idle timeout for that DB. Though the other app does not use a database. On Mon, Aug 4, 2014 at 2:39 PM, Eric Wong e...@80x24.org wrote: Tony Devlin tonydev...@gmail.com wrote: We have an issue where if a site is not accessed for around (average) 30 minutes the next query will timeout, and it will timeout on all the workers opened. IE: If I have two workers, then both of those workers will timeout, even if the first one, after timeout, starts to work. As soon as the second worker is called upon it will timeout. Then everything runs perfectly good and great until the site is not accessed for 30 minutes or more. Then the timeout issue starts all over again. This sounds like the idle timeout for MySQL (or similar) kicking in. What database(s) or other backends are you using? That said, we've had problems with hibernate/suspend in the past, so I'll double check. -- *Tony Devlin* Founder / CTO PrintKEG.com (800) 676-0856 http://www.printkeg.com Facebook: http://www.facebook.com/printkeg Twitter: https://twitter.com/printkeg 250 business cards are only $10 http://www.printkeg.com/cheap-business-cards.php