Hi Marty,

It's difficult for me to tell the reason that couchdb is not stopping using
your init script, but we had a similar issue that I fixed by patching the
couchdb startup script ("executable").  The issue was that the 'shepherd'
program was respawning couch after a requested shutdown.

This was discussed some time a while ago on the list and I sent our fix
out, but I don't think it was ever integrated.  Anyways, here's the gist
(for 1.3, though I think the file has remained the same in the newer
versions):

https://gist.github.com/7601778

Cheers,
Mike

Am 30.04.2014 um 06:52 schrieb Marty Hu <marty...@gmail.com>:

Okay, after doing a bit more work this is what I found out:

1. When I start couchdb on a fresh server, it appears to run correctly.

2. However, the conventional "sudo service couchdb stop" does not actually
stop couchdb correctly. I know this because I can kill the couchdb
processes with ps -U couchdb -o pid= | xargs kill -9

3. We use chef for configuration, so at a set interval it will queue up a
"sudo service couchdb restart", which will try to stop the process (the
process won't stop) and then start a new process (this process will
actually try to start). However, the second process will not be able to
bind to the port (the first process never got killed and still holds it) so
will throw the error.

I imagine that this is a configuration issue (and so not really a fault of
your guys) but welcoming any tips about how to deal with this short of
changing the init script to be a messy killer.


On Tue, Apr 29, 2014 at 6:54 PM, Adam Kocoloski <kocol...@apache.org> wrote:

Hi Marty, the mailing list stripped out the attachments except for

spike.txt.


I don't know if they're the cause of the load spikes that you see, but the

eaddrinuse errors are not normal. They can be caused by another process

listening on the same port as CouchDB. Fairly peculiar stuff.


The timeout trying to open the splits-v0.1.7 at 21:23 does line up with

your report that the system was heavily loaded at the time, but there's

really not too much to go on here.


Regards, Adam


On Apr 29, 2014, at 7:46 PM, Marty Hu <marty...@gmail.com> wrote:


Thanks for the follow-up.


I've attached nagios graphs (load, disk, and ping) of one such event,

which occurred at 2:24pm (after the drop in disk) according to my nagios

emails. I've also attached database logs (with some client-specific queries

removed). The error was fixed around 2:30pm. Notably, the log files are in

GMT.


Unfortunately I don't have any graphs for the event other than what's on

nagios.


Are the connection errors with CouchDB normal? We get them continuously

(around every minute) even during normal operation with the DB not crashing.



On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxe...@gmail.com>

wrote:

Hi Marty,


thanks for following up! I see your problem, but what would we need:


1. CouchDB stats graphs and your system disk, network and memory ones.

If you cannot share them in public, feel free to send me in private.

We need to know they are related. For instance, high memory usage may

be caused by uploading high amount of big files: you'll easily notice

that comparing CouchDB, network and memory graphs for the spike

period.


2. CouchDB log entries for spike event. Graphs can only show you

that's something going wrong and we could only guess (almost we guess

right, but without much precise) what's exactly going wrong. Logs will

help to us to find out actual requests that causes memory spike.


After that we can start to think about the problem. For instance, if

spikes are happens due to large attachments uploads, there is no much

to do. On other hand, query server may easily eat quite big chunk of

memory. We'll easily notice that by monitoring /_active_tasks resource

(if problem is in views) or by looking through logs for the spike

period. And this case can be fixed.


Not sure which tools you're using for monitoring and graphs drawing,

but take a look on next projects:

- https://github.com/gws/munin-plugin-couchdb - Munin plugin for

CouchDB monitoring. Suddenly, it doesn't handles system metrics for

CouchDB process - I'll only add this during this week, but make sure

you have similar plugin for your monitoring system.

- https://github.com/etsy/skyline - anomalies detector. spikes are so

- https://github.com/etsy/oculus - metrics correlation tool. it would

be very-very easily to compare multiple graphs for anomaly period with

it.


--

,,,^..^,,,



On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty...@gmail.com> wrote:

We're been running CouchDB v1.5.0 on AWS and its been working fine.

Recently AWS came out with new prices for their new m3 instances so we

switched our CouchDB instance to use an m3.large. We have a relatively

small database with < 10GB of data in it.


Our steady state metrics for it are system loads of 0.2 and memory

usages

of 5% or so. However, we noticed that every few hours (3-4 times per

day)

we get a huge spike that floors our load to 1.5 or so and memory usage

to

close to 100%.


We don't run any cronjobs that involve the database and our traffic

flow

about the same over the day. We do run a continuous replication from

one

database on the west coast to another on the east coast.


This has been stumping me for a bit - any ideas?



<spike.txt>

Reply via email to