Hi Marty, It's difficult for me to tell the reason that couchdb is not stopping using your init script, but we had a similar issue that I fixed by patching the couchdb startup script ("executable"). The issue was that the 'shepherd' program was respawning couch after a requested shutdown.
This was discussed some time a while ago on the list and I sent our fix out, but I don't think it was ever integrated. Anyways, here's the gist (for 1.3, though I think the file has remained the same in the newer versions): https://gist.github.com/7601778 Cheers, Mike Am 30.04.2014 um 06:52 schrieb Marty Hu <marty...@gmail.com>: Okay, after doing a bit more work this is what I found out: 1. When I start couchdb on a fresh server, it appears to run correctly. 2. However, the conventional "sudo service couchdb stop" does not actually stop couchdb correctly. I know this because I can kill the couchdb processes with ps -U couchdb -o pid= | xargs kill -9 3. We use chef for configuration, so at a set interval it will queue up a "sudo service couchdb restart", which will try to stop the process (the process won't stop) and then start a new process (this process will actually try to start). However, the second process will not be able to bind to the port (the first process never got killed and still holds it) so will throw the error. I imagine that this is a configuration issue (and so not really a fault of your guys) but welcoming any tips about how to deal with this short of changing the init script to be a messy killer. On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski <kocol...@apache.org> wrote: Hi Marty, the mailing list stripped out the attachments except for spike.txt. I don't know if they're the cause of the load spikes that you see, but the eaddrinuse errors are not normal. They can be caused by another process listening on the same port as CouchDB. Fairly peculiar stuff. The timeout trying to open the splits-v0.1.7 at 21:23 does line up with your report that the system was heavily loaded at the time, but there's really not too much to go on here. Regards, Adam On Apr 29, 2014, at 7:46 PM, Marty Hu <marty...@gmail.com> wrote: Thanks for the follow-up. I've attached nagios graphs (load, disk, and ping) of one such event, which occurred at 2:24pm (after the drop in disk) according to my nagios emails. I've also attached database logs (with some client-specific queries removed). The error was fixed around 2:30pm. Notably, the log files are in GMT. Unfortunately I don't have any graphs for the event other than what's on nagios. Are the connection errors with CouchDB normal? We get them continuously (around every minute) even during normal operation with the DB not crashing. On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxe...@gmail.com> wrote: Hi Marty, thanks for following up! I see your problem, but what would we need: 1. CouchDB stats graphs and your system disk, network and memory ones. If you cannot share them in public, feel free to send me in private. We need to know they are related. For instance, high memory usage may be caused by uploading high amount of big files: you'll easily notice that comparing CouchDB, network and memory graphs for the spike period. 2. CouchDB log entries for spike event. Graphs can only show you that's something going wrong and we could only guess (almost we guess right, but without much precise) what's exactly going wrong. Logs will help to us to find out actual requests that causes memory spike. After that we can start to think about the problem. For instance, if spikes are happens due to large attachments uploads, there is no much to do. On other hand, query server may easily eat quite big chunk of memory. We'll easily notice that by monitoring /_active_tasks resource (if problem is in views) or by looking through logs for the spike period. And this case can be fixed. Not sure which tools you're using for monitoring and graphs drawing, but take a look on next projects: - https://github.com/gws/munin-plugin-couchdb - Munin plugin for CouchDB monitoring. Suddenly, it doesn't handles system metrics for CouchDB process - I'll only add this during this week, but make sure you have similar plugin for your monitoring system. - https://github.com/etsy/skyline - anomalies detector. spikes are so - https://github.com/etsy/oculus - metrics correlation tool. it would be very-very easily to compare multiple graphs for anomaly period with it. -- ,,,^..^,,, On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty...@gmail.com> wrote: We're been running CouchDB v1.5.0 on AWS and its been working fine. Recently AWS came out with new prices for their new m3 instances so we switched our CouchDB instance to use an m3.large. We have a relatively small database with < 10GB of data in it. Our steady state metrics for it are system loads of 0.2 and memory usages of 5% or so. However, we noticed that every few hours (3-4 times per day) we get a huge spike that floors our load to 1.5 or so and memory usage to close to 100%. We don't run any cronjobs that involve the database and our traffic flow about the same over the day. We do run a continuous replication from one database on the west coast to another on the east coast. This has been stumping me for a bit - any ideas? <spike.txt>