On 11 October 2012 00:13, Konstantin Boudnik <[email protected]> wrote:
> Steve, > > great stuff. Here's my initial feedback: > > 1. I am not passing judgement about how the monitoring is done, although > something like Nagios would fill the bill good enough, IMO. Nagios can over-react and 4000 emails out when a service isn't responding, without noting that it's down for another reason -and it's biased towards those email notifications. > Anyway... It seems > like this monitoring is very Hadoop HA specific, It could actually monitor any service with one or more of pid port URL The Hadoop-ness currently comes from 1. specific probes for HDFS and JT 2. use of hadoop XML config for settings (trivial fix) 3. -probe order fixed in source 4. no current support for adding new probes just by putting them on the classpath and declaring them 5. an installation that goes under /usr/lib/hadoop and picks up the hadoop classpath and native lib so its hadoop probes are always in sync with the runtime, I'd fix 2 & 3 by having a better config language that lets you specify an order of operations > I would say that it is better > be kept in Hadoop in one form or another - hadoop/contrib seems like a good > place to start, In other words, I don't think this is generic enough > monitoring software to be included into the BigTop. OK > Say, I'd be happy to > include Ganglia or some Nagios hooks for the same purposes. Packaging for > this monitoring software can be of course added to the BigTop stack like we > are doing this for many other components - it looks very reasonable > approach. > > 2. The failure inducing library seems like a great addition to the iTest. > In > fact, if I were doing Hadoop fault injection again I would certainly go > with > MOP'ping and Groovy-based framework, instead of AspectJ boredom. So, I like > the idea and it seems to fit very well with the original design ideas of > the > iTest - let's add the library to the BigTop. There things to look at and > discuss of course but I like the overall idea! > OK, -this bit of it is v. immature and might ultimately go into its own module, so that hadoop HA tests can use it too > > To summarize: I'm rather negative on keeping the monitoring software as a > part > of the BigTop; and I am quite positive on bring the testing lib as a part > of > the iTest. > I'll have a look at iTest and see where it fits in, then we can start thinking about what a good test framework for triggering infrastructures would be. I think what I've got is just a starting point. FWIW jclouds is looking at vbox integration too, via its Web Service API -it could be used to trigger VM death in any virtual infrastructure, we'd just need to add back ends for physical infrastructures (for now: dialog boxes & fencing scripts), and the code to cause trouble inside the VM itself. BTW, one thing you can do with virtual infrastructure is forced volume unmounts, umount -f, which could be used to simulate disk, disk controller or disk driver problems. Something like that would be really good for generating stress tests of all the storage layers. -steve
