Hi Marty, the mailing list stripped out the attachments except for spike.txt.
I don't know if they're the cause of the load spikes that you see, but the eaddrinuse errors are not normal. They can be caused by another process listening on the same port as CouchDB. Fairly peculiar stuff. The timeout trying to open the splits-v0.1.7 at 21:23 does line up with your report that the system was heavily loaded at the time, but there's really not too much to go on here. Regards, Adam On Apr 29, 2014, at 7:46 PM, Marty Hu <marty...@gmail.com> wrote: > Thanks for the follow-up. > > I've attached nagios graphs (load, disk, and ping) of one such event, which > occurred at 2:24pm (after the drop in disk) according to my nagios emails. > I've also attached database logs (with some client-specific queries removed). > The error was fixed around 2:30pm. Notably, the log files are in GMT. > > Unfortunately I don't have any graphs for the event other than what's on > nagios. > > Are the connection errors with CouchDB normal? We get them continuously > (around every minute) even during normal operation with the DB not crashing. > > > On Tue, Apr 29, 2014 at 2:34 AM, Alexander Shorin <kxe...@gmail.com> wrote: > Hi Marty, > > thanks for following up! I see your problem, but what would we need: > > 1. CouchDB stats graphs and your system disk, network and memory ones. > If you cannot share them in public, feel free to send me in private. > We need to know they are related. For instance, high memory usage may > be caused by uploading high amount of big files: you'll easily notice > that comparing CouchDB, network and memory graphs for the spike > period. > > 2. CouchDB log entries for spike event. Graphs can only show you > that's something going wrong and we could only guess (almost we guess > right, but without much precise) what's exactly going wrong. Logs will > help to us to find out actual requests that causes memory spike. > > After that we can start to think about the problem. For instance, if > spikes are happens due to large attachments uploads, there is no much > to do. On other hand, query server may easily eat quite big chunk of > memory. We'll easily notice that by monitoring /_active_tasks resource > (if problem is in views) or by looking through logs for the spike > period. And this case can be fixed. > > Not sure which tools you're using for monitoring and graphs drawing, > but take a look on next projects: > - https://github.com/gws/munin-plugin-couchdb - Munin plugin for > CouchDB monitoring. Suddenly, it doesn't handles system metrics for > CouchDB process - I'll only add this during this week, but make sure > you have similar plugin for your monitoring system. > - https://github.com/etsy/skyline - anomalies detector. spikes are so > - https://github.com/etsy/oculus - metrics correlation tool. it would > be very-very easily to compare multiple graphs for anomaly period with > it. > > -- > ,,,^..^,,, > > > On Tue, Apr 29, 2014 at 8:15 AM, Marty Hu <marty...@gmail.com> wrote: > > We're been running CouchDB v1.5.0 on AWS and its been working fine. > > Recently AWS came out with new prices for their new m3 instances so we > > switched our CouchDB instance to use an m3.large. We have a relatively > > small database with < 10GB of data in it. > > > > Our steady state metrics for it are system loads of 0.2 and memory usages > > of 5% or so. However, we noticed that every few hours (3-4 times per day) > > we get a huge spike that floors our load to 1.5 or so and memory usage to > > close to 100%. > > > > We don't run any cronjobs that involve the database and our traffic flow > > about the same over the day. We do run a continuous replication from one > > database on the west coast to another on the east coast. > > > > This has been stumping me for a bit - any ideas? > > > <spike.txt>