Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
On Mon, Dec 29, 2014 at 3:24 PM, mck wrote: > > Especially in CASSANDRA-6285 i see some scary stuff went down. > > But there are no outstanding bugs that we know of, are there? > Right, the question is whether you believe that 6285 has actually been fully resolved. It's relatively plausible that it finally was, which is why I describe my feelings about HSHA "corrupter" implementation as FUD. Really the huge mistake was to rewrite "hsha" despite the fact that this is one of the rare pluggable interfaces, and thereby breaking existing users. If it had been called "hsha2" or something, I'd have a lot less FUD about it... because people would not have corrupted on upgrade, which I view as Super Bad. IMO, probably the only people who should use HSHA are people who have a real need for it, specifically people with huge numbers of client threads they can't reduce. =Rob
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
> Perf is better, correctness seems less so. I value latter more than > former. Yeah no doubt. Especially in CASSANDRA-6285 i see some scary stuff went down. But there are no outstanding bugs that we know of, are there? (CASSANDRA-6815 remains just a wrap up of how options are to be presented in cassandra.yaml?) ~mck
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
On Mon, Dec 29, 2014 at 2:03 PM, mck wrote: > We saw an improvement when we switched to HSHA, particularly for our > offline (hadoop/spark) nodes. > Sorry i don't have the data anymore to support that statement, although > i can say that improvement paled in comparison to cross_node_timeout > which we enabled shortly afterwards. > Perf is better, correctness seems less so. I value latter more than former. =Rob
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
> > Should I stick to 2048 or try > > with something closer to 128 or even something else ? 2048 worked fine for us. > > About HSHA, > > I anti-recommend hsha, serious apparently unresolved problems exist with > it. We saw an improvement when we switched to HSHA, particularly for our offline (hadoop/spark) nodes. Sorry i don't have the data anymore to support that statement, although i can say that improvement paled in comparison to cross_node_timeout which we enabled shortly afterwards. ~mck
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
On Mon, Dec 29, 2014 at 2:29 AM, Alain RODRIGUEZ wrote: > Sorry about the gravedigging, but what would be a good start value to tune > "rpc_max_threads" ? > Depends on whether you prefer that clients get a slow thread or none. > I mean, default is unlimited, the value commented is 2048. Native protocol > seems to only allow 128 simultaneous threads. Should I stick to 2048 or try > with something closer to 128 or even something else ? > Probably closer to 2048 than unlimited. > About HSHA, I have tried this mode from time to time since C* 0.8 and > always faced the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are > you using TFramedTransport on the client side?" error)". I haven't try for > a while (1 year maybe), has this been fixed, or is this due to my > configuration somehow ? > I anti-recommend hsha, serious apparently unresolved problems exist with it. I understand this is FUD, but fool me once shame on you/fool me twice shame on me. =Rob
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
Hi, Sorry about the gravedigging, but what would be a good start value to tune " rpc_max_threads" ? I mean, default is unlimited, the value commented is 2048. Native protocol seems to only allow 128 simultaneous threads. Should I stick to 2048 or try with something closer to 128 or even something else ? About HSHA, I have tried this mode from time to time since C* 0.8 and always faced the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side?" error)". I haven't try for a while (1 year maybe), has this been fixed, or is this due to my configuration somehow ? C*heers Alain 2014-10-29 16:07 GMT+01:00 Peter Haggerty : > That definitely appears to be the issue. Thanks for pointing that out! > > https://issues.apache.org/jira/browse/CASSANDRA-8116 > It looks like 2.0.12 will check for the default and throw an exception > (thanks Mike Adamson) and also includes a bit more text in the config > file but I'm thinking that 2.0.12 should be pushed out sooner rather > than later as anyone using hsha and the default settings will simply > have their cluster stop working a few minutes after the upgrade and > without any indication of the actual problem. > > > Peter > > > On Wed, Oct 29, 2014 at 5:23 AM, Duncan Sands > wrote: > > Hi Peter, are you using the hsha RPC server type on this node? If you > are, > > then it looks like rpc_max_threads threads will be allocated on startup > in > > 2.0.11 while this wasn't the case before. This can exhaust your heap if > the > > value of rpc_max_threads is too large (eg if you use the default). > > > > Ciao, Duncan. > > > > > > On 29/10/14 01:08, Peter Haggerty wrote: > >> > >> On a 3 node test cluster we recently upgraded one node from 2.0.10 to > >> 2.0.11. This is a cluster that had been happily running 2.0.10 for > >> weeks and that has very little load and very capable hardware. The > >> upgrade was just your typical package upgrade: > >> > >> $ dpkg -s cassandra | egrep '^Ver|^Main' > >> Maintainer: Eric Evans > >> Version: 2.0.11 > >> > >> Immediately after started it ran a couple of ParNews and then started > >> executing CMS runs. In 10 minutes the node had become unreachable and > >> was marked as down by the two other nodes in the ring, which are still > >> 2.0.10. > >> > >> We have jstack output and the server logs but nothing seems to be > >> jumping out. Has anyone else run into this? What should we be looking > >> for? > >> > >> > >> Peter > >> > > >
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
That definitely appears to be the issue. Thanks for pointing that out! https://issues.apache.org/jira/browse/CASSANDRA-8116 It looks like 2.0.12 will check for the default and throw an exception (thanks Mike Adamson) and also includes a bit more text in the config file but I'm thinking that 2.0.12 should be pushed out sooner rather than later as anyone using hsha and the default settings will simply have their cluster stop working a few minutes after the upgrade and without any indication of the actual problem. Peter On Wed, Oct 29, 2014 at 5:23 AM, Duncan Sands wrote: > Hi Peter, are you using the hsha RPC server type on this node? If you are, > then it looks like rpc_max_threads threads will be allocated on startup in > 2.0.11 while this wasn't the case before. This can exhaust your heap if the > value of rpc_max_threads is too large (eg if you use the default). > > Ciao, Duncan. > > > On 29/10/14 01:08, Peter Haggerty wrote: >> >> On a 3 node test cluster we recently upgraded one node from 2.0.10 to >> 2.0.11. This is a cluster that had been happily running 2.0.10 for >> weeks and that has very little load and very capable hardware. The >> upgrade was just your typical package upgrade: >> >> $ dpkg -s cassandra | egrep '^Ver|^Main' >> Maintainer: Eric Evans >> Version: 2.0.11 >> >> Immediately after started it ran a couple of ParNews and then started >> executing CMS runs. In 10 minutes the node had become unreachable and >> was marked as down by the two other nodes in the ring, which are still >> 2.0.10. >> >> We have jstack output and the server logs but nothing seems to be >> jumping out. Has anyone else run into this? What should we be looking >> for? >> >> >> Peter >> >
Re: 2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
Hi Peter, are you using the hsha RPC server type on this node? If you are, then it looks like rpc_max_threads threads will be allocated on startup in 2.0.11 while this wasn't the case before. This can exhaust your heap if the value of rpc_max_threads is too large (eg if you use the default). Ciao, Duncan. On 29/10/14 01:08, Peter Haggerty wrote: On a 3 node test cluster we recently upgraded one node from 2.0.10 to 2.0.11. This is a cluster that had been happily running 2.0.10 for weeks and that has very little load and very capable hardware. The upgrade was just your typical package upgrade: $ dpkg -s cassandra | egrep '^Ver|^Main' Maintainer: Eric Evans Version: 2.0.11 Immediately after started it ran a couple of ParNews and then started executing CMS runs. In 10 minutes the node had become unreachable and was marked as down by the two other nodes in the ring, which are still 2.0.10. We have jstack output and the server logs but nothing seems to be jumping out. Has anyone else run into this? What should we be looking for? Peter
2.0.10 to 2.0.11 upgrade and immediate ParNew and CMS GC storm
On a 3 node test cluster we recently upgraded one node from 2.0.10 to 2.0.11. This is a cluster that had been happily running 2.0.10 for weeks and that has very little load and very capable hardware. The upgrade was just your typical package upgrade: $ dpkg -s cassandra | egrep '^Ver|^Main' Maintainer: Eric Evans Version: 2.0.11 Immediately after started it ran a couple of ParNews and then started executing CMS runs. In 10 minutes the node had become unreachable and was marked as down by the two other nodes in the ring, which are still 2.0.10. We have jstack output and the server logs but nothing seems to be jumping out. Has anyone else run into this? What should we be looking for? Peter