Sending ABRT to timeout-errant process before KILL

J. Austin Hughey Thu, 08 Sep 2011 02:27:56 -0700

Hello there -

I've recently been working with a customer in my capacity as a support engineer 
at Engine Yard who's having some trouble with Unicorn.  Basically, they're 
finding their application being force-killed once it reaches the default 
timeout.  Rather than simply increasing the timeout, we're trying to find out 
_why_ their application is being killed.  Unfortunately we can't quite do that 
because the application is being force-killed without warning, making it so 
that the customer's logging can't actually be written to the disk. (This is an 
intermittent issue as opposed to something that happens 100% of the time.)


In discussing the matter internally and with our customer, I came up with a 
simple monkey patch to Unicorn that _sort of_ works, but I'm having some 
trouble with it once the number of Unicorn workers goes beyond one.  I 
originally limited it to just one worker because I wanted to limit the 
possibility that multiple workers could cause problems in figuring things out 
-- turns out I was right.

I'm going to show the patch in two ways: 1) inline, at the bottom of this post, 
and 2) by link to GitHub:
https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/lib/unicorn/http_server.rb#L438

The general idea is that I'd like to have some way to "warn" the application 
when it's about to be killed.  I've patched murder_lazy_workers to send an 
abort signal via kill_worker, sleep for 5 seconds, then look and see if the 
process still exists by using Process.getpgid.  If so, it sends the original 
kill command, and if not, a rescue block kicks in to prevent the raised error 
from Process.getpgid from making things explode.

I've created a simulation app, built on Rails 3.0, that uses a generic "posts" 
controller to simulate a long-running request.  Instead of just throwing a 
straight-up sleep 65 in there, I have it running through sleeping 1 second on a 
decrementing counter, and doing that 65 times.  The reason is because, assuming 
I've read the code correctly, even with my "skip sleeping workers" commented 
line below, it'll skip over the process, thus rendering my simulation of a 
long-running process invalid.  However, clarification on this point is 
certainly welcome.  You can see the app here: 
https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb

The problem I'm running into, and where I could use some help, is when I 
increase the number of Unicorn workers from one to two.  When running only one 
Unicorn worker, I can access my application's posts_controller's index action, 
which has the intentionally long-running code.  At that time I tail -f 
unicorn.log and production.log.  Those two logs look like this with one Unicorn 
worker:

WITH ONE UNICORN WORKER:
========================

production.log:
---------------
Sleeping 1 second (65 to go)…
 ... continued ...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Caught Unicorn kill exception!
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Completed 500 Internal Server Error in 65131ms

NoMethodError (undefined method `query_options' for nil:NilClass):
  app/controllers/posts_controller.rb:32:in `index'

(I think the NoMethodError issue above is due to me calling a disconnect on 
ActiveRecord in the Signal.trap block, so I think that can be safely ignored.)

As you can see, the Signal.trap block inside the aforementioned 
posts_controller is working in this case.  Corresponding log entries in 
unicorn.log concur:

unicorn.log:
------------
worker=0 ready
master process ready
[2011-09-08 00:31:01 PDT] worker=0 PID: 28921 timeout hit, sending ABRT to 
process 28921 then sleeping 5 seconds...
[2011-09-08 00:31:06 PDT] worker=0 PID:28921 timeout (61s > 60s), killing
reaped #<Process::Status: pid 28921 SIGKILL (signal 9)> worker=0
worker=0 ready

So with one worker, everything seems cool.  But with two workers?


WITH TWO UNICORN WORKERS:
=========================

production.log:
---------------
Sleeping 1 second (8 to go)...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Rendered posts/index.html.erb within layouts/application (13.2ms)
Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms)

Note that there is no notice that the ABRT signal was trapped, nor is there a 
NoMethodError (likely caused by disconnecting from the database) as above.  Odd.

unicorn.log:
------------
Nothing.  No new data whatsoever.

The only potential clue I can see at this point would be a start-up message in 
unicorn.log.  After increasing the number of Unicorn workers to two, I examined 
unicorn.log again and found this:

master complete
I, [2011-09-08T00:34:40.499437 #29572]  INFO -- : unlinking existing 
socket=/var/run/engineyard/unicorn_ut.sock
I, [2011-09-08T00:34:40.499888 #29572]  INFO -- : listening on 
addr=/var/run/engineyard/unicorn_ut.sock fd=5
I, [2011-09-08T00:34:40.504542 #29572]  INFO -- : Refreshing Gem list
worker=0 ready
master process ready
[2011-09-08 00:34:49 PDT] worker=1 PID: 29582 timeout hit, sending ABRT to 
process 29582 then sleeping 5 seconds...
[2011-09-08 00:34:50 PDT] worker=1 PID:29582 timeout (1315467289s > 60s), 
killing
reaped #<Process::Status: pid 29582 SIGIOT (signal 6)> worker=1
worker=1 ready

So it looks like Worker 1 is hitting a strange/false timeout of 1315467289 
seconds, which isn't really possible as it wasn't even running 1315467289 
seconds prior to that (which equates to roughly 41 years ago if my math is 
right).

---

Needless to say, I'm a bit stumped at this point, and would sincerely 
appreciate another point of view on this.  Am I going about this all wrong?  Is 
there a better approach I should consider?  And if I'm on the right track, how 
can I get this to work regardless of how many Unicorn workers are running?

Thank you very much for any assistance you can provide!


-- INLINE VERSION OF PATCH --

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 78d80b4..8a2323f 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -429,6 +429,11 @@ class Unicorn::HttpServer
     proc_name 'master (old)'
   end
 
+  # A custom formatted timestamp for debugging
+  def custom_timestamp
+    return Time.now.strftime("[%Y-%m-%d %T %Z]")
+  end
+
   # forcibly terminate all workers that haven't checked in in timeout seconds. 
 The timeout is implemented using an unlinked File
   def murder_lazy_workers
     t = @timeout
@@ -436,16 +441,40 @@ class Unicorn::HttpServer
     now = Time.now.to_i
     WORKERS.dup.each_pair do |wpid, worker|
       tick = worker.tick
-      0 == tick and next # skip workers that are sleeping
+
+      # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION
+# 0 == tick and next # skip workers that are sleeping
+      # ^ needs to be active, commented here for simulation purposes
+
       diff = now - tick
       tmp = t - diff
       if tmp >= 0
         next_sleep < tmp and next_sleep = tmp
         next
       end
-      logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
-                   "(#{diff}s > #{t}s), killing"
-      kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+
+
+      # Send an ABRT signal to Unicorn and wait 5 seconds before attempting an
+      # actual kill, if and only if the process is still running.
+
+      begin
+        # Send the ABRT signal.
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} 
timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..."
+        kill_worker(:ABRT, wpid)
+
+        sleep 5
+
+        # Now see if the process still exists after being given five
+        # seconds to terminate on its own, and if so, do a hard kill.
+        if Process.getpgid(wpid)
+          logger.error "#{custom_timestamp} worker=#{worker.nr} PID:#{wpid} 
timeout " \
+                       "(#{diff}s > #{@timeout}s), killing"
+          kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+        end
+      rescue Errno::ESRCH => e
+        # No process identified - maybe it exited on its own?
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} 
responded to ABRT on its own and no longer exists. (Received message: #{e})"
+      end
     end
     next_sleep
   end

-- END INLINE VERSION OF PATCH --

--
J. Austin Hughey
Application Support Engineer - Engine Yard
www.engineyard.com | [email protected]

_______________________________________________
Unicorn mailing list - [email protected]
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

Sending ABRT to timeout-errant process before KILL

Reply via email to