Re: Really odd issue (AWS related?)

2013-04-30 Thread Ben Chobot
We've also had issues with ephemeral drives in a single AZ in us-east-1, so much so that we no longer use that AZ. Though our issues tended to be obvious from instance boot - they wouldn't suddenly degrade. On Apr 28, 2013, at 2:27 PM, Alex Major wrote: > Hi Mike, > > We had issues with the ep

Re: Really odd issue (AWS related?)

2013-04-28 Thread Alex Major
Hi Mike, We had issues with the ephemeral drives when we first got started, although we never got to the bottom of it so I can't help much with troubleshooting unfortunately. Contrary to a lot of the comments on the mailing list we've actually had a lot more success with EBS drives (PIOPs!). I'd d

Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
I forgot to mention, When things go really bad, I'm seeing I/O waits in the 80->95% range. I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables). Typically, a node would start in about 5 minutes. Thanks, -Mike On Apr 28, 2

Re: Really odd issue (AWS related?)

2013-04-28 Thread Michael Theroux
Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I c

Re: Really odd issue (AWS related?)

2013-04-26 Thread Michael Theroux
Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: > top command? st : time stolen from this vm by the hypervisor >

Re: Really odd issue (AWS related?)

2013-04-26 Thread Jason Wee
top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux wrote: > Sorry, Not sure what CPU steal is :) > > I have AWS console with detailed monitoring enabled... things seem to > track close to the minute, so I can see the CPU load go t

Re: Really odd issue (AWS related?)

2013-04-25 Thread Michael Theroux
Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote

Re: Really odd issue (AWS related?)

2013-04-25 Thread aaron morton
> The messages appear right after the node "wakes up". Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli wrote: > On Wed, Apr 24, 2013 at 5:03 AM, Michael Ther

Re: Really odd issue (AWS related?)

2013-04-24 Thread Robert Coli
On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux wrote: > Another related question. Once we see messages being dropped on one node, > our cassandra client appears to see this, reporting errors. We use > LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see > an error? I

Really odd issue (AWS related?)

2013-04-24 Thread Michael Theroux
Hello, Since Sunday, we've been experiencing a really odd issue in our Cassandra cluster. We recently started receiving errors that messages are being dropped. But here is the odd part... When looking in the AWS console, instead of seeing statistics being elevated during this time, we actual