Michael Dalton wrote: > Hi all, Hi, sorry I've not replied, been on holiday/vacation for two weeks
> > I have a quick question regarding SmartFrog and managing Hadoop > clusters. I'm setting up a reasonable-sized Hadoop cluster that's > expected to grow fairly quickly. However, I'm not really sure what the > appropriate cluster management and administration tools are. From what > I understand, I'll need tools to manage software configurations > (packages+config files), obtain hadoop-specific metrics, obtain > overall machine metrics(CPU load, memory usage, etc), and continuously > perform process health/service lifecycle management (i.e. restarting > datanodes that crash, reporting on critical errors like EDAC errors in > Linux indicating uncorrectable DRAM flaws). > > It appears right now folks like Yahoo! use Yum (or apt) + bcfg2 for > the package/configuration, some variant of Ganglia for Hadoop, Nagios > for overall metrics, and there is nothing publicly available for > process health. I'm fine using Ganglia and Nagios for metrics unless > someone can point me to better tools, but I'd rather not use cobble > together something using bcfg2 and hacked up shell scripts for > configuration/process health management. Ganglia and Nagios are the way to go for monitoring, Todd Lipcon, author of Ganglia, now works for Cloudera; we've been discussing adding better monitoring/logging -specifically we want more stable/parseable events logged. > > It looks like SmartFrog in conjunction with HADOOP-3628 would provide > both reasonable configuration and process health/service lifecycle > management. If I understand correctly, SmartFrog will allow me to > manage Hadoop configuration files, manage the installation of my > Hadoop packages, _and_ also provide monitoring of the health of Hadoop > nodes so I can automatically log and restart hadoop nodes when they > crash, etc. > yes, and on VM infrastructure, let you ask for new VMs > Is this a reasonably accurate description of the state of the art with > Hadoop and cluster administration? Also, are there are any estimates > on when the HADOOP-3628 branch will be committed into SVN trunk, and > when that occurs will the SmartFrog project still need to maintain its > own Hadoop branch? Once that change goes in, no more branch, which is good as it takes up too much time. We would still have our own subclasses of the main hadoop objects and other things which would need to stay in sync, but that is easier to handle -and would be in our codebase. that said, once the lifecycle patch goes in, I have some other things I want to deal with * long-haul job submission/monitoring * explicit support for different configuration backends The current state of the branch is that everything worked up to the big three-way -core -hdfs -mapred split, and I've not sat down to update it for the changes, a combination of the two-week break and local pressure to get that version of hadoop working on dynamic VM clusters, clusters where things like hostnames aren't known until startup, which makes configuration that much trickier/more fun. I dont plan to sit down and deal with the merge for another couple of weeks, though I will start looking at it. My code was fairly stable, the big problem is more that the hadoop code tends to move around when you are not looking, > Lastly, I was unable to find a list of real-world > projects running SmartFrog: are there any large-scale(> 1000 node) > clusters running SF? Thanks for your help We're using it for smaller, short-lived clusters. Once the OpenCirrus cloud testbed is live I am going to try on the thousands, as it is we are only working on the 1-10-100 scale. Short cluster life also stops me having to worry about how stable the filesystem is; keep the data on other filestores and I can run against SVN_HEAD code without worrying about what if HDFS goes wrong. If you want to use SF with Hadoop, talk to me direct about your cluster and it will give me extra motivation to get the branch in, and extra pressure on Y! to accept it; julio and I can help you starting out with SmartFrog. It also generates extra motivation for me to create better documentation, which is always something on my TODO list. -Steve ------------------------------------------------------------------------------ _______________________________________________ Smartfrog-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/smartfrog-users
