Re: Weird Unicorn Timeout Issues (Hibernation problem?)

2014-08-06 Thread Tony Devlin
Eric,

The problem is a firewall that sits between the servers and the database.
 It is an idle session timeout of 30 minutes, so it is silently killing the
connection.  I have reached out to our Network Engineering department but
they are saying they can not change that idle session timeout, nor create a
special rule to allow this connection to bypass that rule.

Currently, I setup a polling device that calls the applications URL every
20 minutes.  This causes the connection between the server and DB to
refresh it's idle timeout.  This is obviously a very hacky way to handle
it, so I am trying to look into AR and Oracle_Enhanced to see if they have
some sort of keepalive option for the database.  I thought it would work
with the reaping_frequency, but apparently that does not work out as I had
expected when you are not running in pools or a thread.  So I'm still on
the lookout for something to handle that.




On Wed, Aug 6, 2014 at 5:45 AM, Eric Wong e...@80x24.org wrote:


 Any update?  It looks like your DB driver is not using/respecting any
 timeout at all[1].  It is bad to not have a timeout there.  There should
 be a way to set a timeout so you can at least tell the user the DB
 connection dropped or maybe get your app to disconnect+retry once.

 A better looking strace would be something like:

 write(fd, ...); = success
 (poll|select|ppoll) syscall ...
 read(fd, ...); /* only if (poll|select|ppoll) was successful[2] */

 This goes for configuring all connections/services for any app.

 [1] or if it's relying on SO_RCVTIMEO socket option(rare), that's set
 way too high.  Any timeout set for any external connection should
 be lower than the unicorn (last-resort) timeout feature.

 [2] any read() syscall after (poll|select|ppoll) should be non-blocking,
 because (poll|select|ppoll) may spuriously wakeup.




Re: Weird Unicorn Timeout Issues (Hibernation problem?)

2014-08-05 Thread Tony Devlin
I appreciate all your help Eric and Daniel.  I have not solved this yet,
but I think I have narrowed it down to a Firewall timeout issue.  One app
uses a database connection to Oracle, the other app uses a 3rd Party API
(still on location, but across the network).  The ping times to both of
these devices are extremely fast, however 30 minutes of inactivity across
the Firewall seems to disconnect these connections.  At least that appears
to be what the strace is telling me.  The place in the strace that the
timeout occurs is consistent, every time.  For example the strace of the
app that connects to Oracle shows this:

pid  7825] write(14,
\0\373\0\0\6\0\0\0\0\0\21iB\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\v\0\0\0\3^Ca\201\0\0\0\0\0\0\376\377\377\377\377\377\377\377\22\0\0\0\376\377\377\377\377\377\377\377\r\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0d\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\22select
1 from
dual\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0,
251) = 251
[pid  7825] read(14,  unfinished ...
[pid  7827] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 7827 leader 7825
[pid  7846] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 7846 leader 7825
+++ killed by SIGKILL +++

Clearly that is a database query 'select 1 from dual'.  It times out trying
to read the response.  At the same time if I watch the lsof -p pid, I see
that the database connection drops after 30 minutes.

I'll update this thread again once it is solved, for historical and future
issues (in case someone else experiences something similar).

Again thank you for your help Eric!


On Mon, Aug 4, 2014 at 4:46 PM, Eric Wong e...@80x24.org wrote:

 Eric Wong e...@80x24.org wrote:
  Did you try strace-ing for 30 minutes and reproducing the error?

 You can also try setting the unicorn timeout to longer than 30
 minutes and get a longer/stalled strace.





Weird Unicorn Timeout Issues (Hibernation problem?)

2014-08-04 Thread Tony Devlin
Current Setup:  (the problem existed before updating nginx, unicorn and
rails; but in an attempt to solve the problem I updated them).

CentOS 6.3 (x64)
Ruby 2.1.2p95
Rails 4.0.0
Nginx 1.6.0
Unicorn 4.8.3



We have an issue where if a site is not accessed for around (average) 30
minutes the next query will timeout, and it will timeout on all the workers
opened.  IE:   If I have two workers, then both of those workers will
timeout, even if the first one,
after timeout, starts to work.  As soon as the second worker is called upon
it will timeout.  Then everything runs perfectly good and great until the
site is not accessed for 30 minutes or more.  Then the timeout issue starts
all over again.

This happens on our dev machine on two different applications (only apps on
the server) and on our production machine (also running two apps).

I can't figure out exactly what is going on.  I tried strace the master
unicorn process, but I get no relevant information.   One of the
applications is also tied to a New Relic account and it gives no
information either.

Nginx logs show this: (sorry, obfuscated certain details due to the
internal enterprise nature)

2014/08/04 11:31:21 [error] 6353#0: *339 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: .org, request: GET /assets/application.css HTTP/1.1,
upstream:
http://unix:/var/www/sites//shared/sockets/.unicorn.sock.0:/assets/application.css;,
host: .org, referrer: http://.org/outages;

2014/08/04 11:31:53 [error] 31058#0: *1 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: .org, request: GET /outages HTTP/1.1, upstream:
http://unix:/var/www/sites//shared/sockets/.unicorn.sock.0:/outages;,
host: .org, referrer: http://.org/outages;

Unicorn stderr shows this: (sorry, obfuscated certain details due to the
internal enterprise nature)

E, [2014-08-04T11:31:21.620379 #11991] ERROR -- : worker=1 PID:29701
timeout (21s  20s), killing
E, [2014-08-04T11:31:21.630521 #11991] ERROR -- : reaped #Process::Status:
pid 29701 SIGKILL (signal 9) worker=1
I, [2014-08-04T11:31:21.639881 #30521]  INFO -- : worker=1 ready

E, [2014-08-04T11:31:53.666300 #11991] ERROR -- : worker=0 PID:29705
timeout (21s  20s), killing
E, [2014-08-04T11:31:53.676984 #11991] ERROR -- : reaped #Process::Status:
pid 29705 SIGKILL (signal 9) worker=0
I, [2014-08-04T11:31:53.687157 #30528]  INFO -- : worker=0 ready

Its interesting that the timeout is occurring on different stages, for
example if you look at the two nginx errors, one timeout at
/assets/application.css and the other at the root /outages/, I've had it be
small gifs as well.   I'd also like to point at
that I have changed the timeout from 30s to 60s to 300s to 20s and it
occurs at every interval.  So it is not a true timeout issue.Also the
two apps on the server are completely different, with only Ruby and Rails
being the same process.  The
gems are different in both, the database is different in both, in fact one
of them has no assets, no css and no ui files at all.  The only
similarities are Ruby, Rails, Nginx and Unicorn.

It's like there is some sort of hibernation problem with the unicorn
workers.  The hibernation thought comes from these debug messages (I
dropped to recording debug messages trying to locate the problem).  I get
these every now and then:

D, [2014-08-03T22:14:49.776538 #11991] DEBUG -- : waiting 11.0s after
suspend/hibernation

But then they cease to occur for days, example is that message was the last
time that occurred (non in the logs for today).

I've been trying for the past 4 days to solve this, make a large number of
changes to no avail.  Any help would be GREATLY appreciated.




Re: Weird Unicorn Timeout Issues (Hibernation problem?)

2014-08-04 Thread Tony Devlin
Eric,

Thank you for responding.  We use a database on only one of the apps, it is
a Oracle 11G RAC Server.  I'll get the DBA to double check the idle timeout
for that DB.  Though the other app does not use a database.


On Mon, Aug 4, 2014 at 2:39 PM, Eric Wong e...@80x24.org wrote:

 Tony Devlin tonydev...@gmail.com wrote:
  We have an issue where if a site is not accessed for around (average) 30
  minutes the next query will timeout, and it will timeout on all the
 workers
  opened.  IE:   If I have two workers, then both of those workers will
  timeout, even if the first one,
  after timeout, starts to work.  As soon as the second worker is called
 upon
  it will timeout.  Then everything runs perfectly good and great until the
  site is not accessed for 30 minutes or more.  Then the timeout issue
 starts
  all over again.

 This sounds like the idle timeout for MySQL (or similar) kicking in.
 What database(s) or other backends are you using?

 That said, we've had problems with hibernate/suspend in the past,
 so I'll double check.




-- 

*Tony Devlin*
Founder / CTO
PrintKEG.com
(800) 676-0856
http://www.printkeg.com

Facebook: http://www.facebook.com/printkeg
Twitter: https://twitter.com/printkeg

250 business cards are only $10
http://www.printkeg.com/cheap-business-cards.php