We should really remove that init.d daemon script and replace it with runit. That way you a) are guaranteed a restart on crash and b) stdout/err is automatically captured (and rotated). In my experience the stdout/err in these events is very useful. To switch, you need runit (obviously) and then a short stanza that starts couchdb in the foreground, there's a switch for that. Alternatively, start in the foreground in a terminal (as the couchdb user) and pound the server until it crashes.
I've no operational experience with R16 series, unfortunately. All I do know is, since R15, the new process scheduler can interact poorly with NIF's that perform work lasting over a millisecond, which I could imagine happening for JSON encoding/decoding of large documents. If it were a running out of file descriptors or sockets situation, I would expect some useful noise in the log, but we can't rule it out yet. B. On 13 Sep 2013, at 23:20, James Marca <[email protected]> wrote: > I am seeing a lot of random, silent crashes on just *one* of my > CouchDB servers. > > couchdb version 1.4.0 (gentoo ebuild) > > erlang also from gentoo ebuild: > Erlang (BEAM) emulator version 5.10.2 > Compiled on Fri Sep 13 08:39:20 2013 > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] > [async-threads:10] [kernel-poll:false] > > I've got 3 servers running couchdb, A, B, C, and only B is crashing. > All of them are replicating a single db between them, with B acting as > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to > B. > > All three servers have data crunching jobs running that are reading > and writing to the database that is being replicated around. > > The B server, the one in the middle that is push replicating to both A > and C, is the one that is crashing. > > The log looks like this: > > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started > on http://0.0.0.0:5984/ > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` > (document `carb_grid_state4k_push_emma64`). > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document > `carb_grid_state5k_hpms_push_kitty`). > > no warning that the server died or why, and nothing in the > /var/log/messages about anything untoward happening (no OOM killer > invoked or anything like that) > > The restart only happened because I manually did a > /etc/init.d/couchdb restart > Usually couchdb restarts itself, but not with this crash. > > > > I flipped the log to debug level, and still had no warning about the crash: > > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' > /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" > Headers: [{'Accept',"application/json"}, > {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="}, > {'Content-Length',"346"}, > {'Content-Type',"application/json"}, > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, > {'User-Agent',"CouchDB/1.4.0"}, > {"X-Couch-Full-Commit","false"}] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch > of size 128531 bytes > > And that was it. CouchDB was down and out. > > I even tried shutting off the data processing (so as to reduce the db > load) on box B, but that didn't help (all the crashing has put it far > behind in replicating box A and C). > > My guess is that the replication load is too big (too many > connections, too much data being pushed in), but I would expect some > sort of warning before the server dies. > > Any clues or suggestions would be appreciated. I am currently going > to try compling from source directly, but I don't have much faith that > it will make a difference. > > Thanks, > James Marca > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. >
signature.asc
Description: Message signed with OpenPGP using GPGMail
