Im fairly sure we are affected by this bug too, I am happy to help in the hunt and I have looked through the code more than once.
Which version of grid are you trying to fix? I havn't been following grid dev too closely do we still have multiple forks? On Wed, Feb 6, 2013 at 12:07 PM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote: > On Wed, 6 Feb 2013, Orlando Richards wrote: > ... > >> I've had a go at digging through the code, but couldn't really make head >> >> nor tail of it - no doubt in large part due to my not being much of a >> coder :( Any pointers to get me bootstrapped would be most welcome. >> > > General comments about the source... > > Don't be intimidated. It's a large code base, but spend a little time and > it'll start to make sense. Pick a little bit of it to focus on initially. > > Gridengine's source code is layered. The source distribution has a few > HTML files describing them (some of which still need updating from the 6.0 > days...). Functions near the very top and very bottom of the stack are > relatively well commented, but the rest can be a little hit and miss. > > Ignoring most of the layers, you've essentially got: > > At the bottom you've got the wonderful CULL layer: it's very solid and > provides gridengine with safe complicated data structures. I'd like to pat > the person who wrote it on the back, although I admit I've yet to get my > head round the advanced search functionality. State data for jobs and so on > tend to use it. Use of it can be identified by the data types or functions > prefixed with "l". > > While I'm on data structures, there are also "dstrings" - which provide > safe string handling. > > In the middle you've got the GDI, which is the set of libraries used by > the different components to communicate with each other over the network. > > At the top you've got the qmaster, execd, etc., which can be thought of as > loosely coupled applications that all use the same underlying > libraries/layers to coordinate. > > I've spent most of my time in the execd, which is pretty easy but messy [a > very large number of special cases - not totally unexpected with the number > of platforms supported over the years, but ripe for some refactoring]. I've > had a brief play in the qmaster and my first impression is that it's more > consistent and "solid" than the execd, but more complicated. > > > General tips for debugging gridengine... > > 1) Play with the loglevel setting in "qconf -sconf" and read the messages > files. > > > 2) Figure out how to stick gridengine into debug mode. > https://blogs.oracle.com/**templedf/entry/using_**debugging_output<https://blogs.oracle.com/templedf/entry/using_debugging_output> > > Essentially something like: > * Setup sge environment (SGE_ROOT, SGE_QMASTER_PORT, etc.) > * Execute: . $SGE_ROOT/util/dl.sh > * Execute: dl 1 > * Execute: $SGE_ROOT/bin/lx-amd64/sge_**execd > > The program will not daemonise and will print lots of interesting stuff. > Different 'dl' values will give you different output. I generally find that > anything greater than 1 is "too much". > > This technique will work for pretty much any gridengine component. Even > qsub. > > > 3) Run gridengine under gdb. > > I don't know if you've had much experience with gdb but, once you've got > the hang of it, it's very useful in figuring out what some code generally > does without actually understanding the details. Once you've followed your > nose to something that doesn't look right, you can then spend time figuring > things out. > > I think some of the gridengine forks try to provide builds with enough > debugging information for this to work, but I tend to build my own > gridengine so that I can easily recompile after editing the source with > potential fixes. > > Make sure you build with the "-no-opt" and "-debug" flags to aimk > (disables optimisation and enables debugging symbols) and keep the source > tree kicking around for gdb to read. I run our production gridengine with > those flags and haven't noticed any serious performance problems. > > Once you have gridengine running under gdb and playing with breakpoints > and the rest, you can easily examine interesting data structures with > commands like "p lWriteList(ptr)", "p lWriteElem(ptr)" and > "p sge_dstring_get_string(ptr)" (where ptr is a lList*, lListElem* or > dstring*, respectively). > > > ... > > At the moment, I'm trying to get a reproducible test case together to >> allow for useful debugging - basic tests (sleep 60s) don't show an >> obvious triggering of the issue, so I'm moving onto more complicated >> tasks. Certainly, the issue does seem to create orders-of-magnitude >> differences in reported usage. Current offenders include BLAST jobs (run >> by our Biology users) - which are fairly memory heavy. >> > ... > > Being able to reproduce the problem will obviously make things far, far > easier! If you cannot, you're probably reduced to littering the relevant > qmaster code with INFO(())/WARNING(())/ERROR(()) statements (and checking > that loglevel in "qconf -sconf" is set to the appropriate value) and seeing > what appears in the messages files in production. > > If you're lucky, the problem might be evident in the usage information > being sent from the execd to the qmaster. Running the execd in debug mode > with "dl 1" will reveal what CPU/MEM/IO values the qmaster is being given > to be used in the accounting file and the share tree. > > If you're unlucky, the problem is in how the qmaster aggregates, records > and decays the share tree values over time. > > If you're really unlucky, the problem might only occur if the various > gridengine components are under severe stress. > > I find that having a non-production installation of gridengine kicking > around, perhaps in virtual machines, is very handy :) > > Hope this helps... > > > Mark > -- > ------------------------------**------------------------------**----- > Mark Dixon Email : m.c.di...@leeds.ac.uk > HPC/Grid Systems Support Tel (int): 35429 > Information Systems Services Tel (ext): +44(0)113 343 5429 > University of Leeds, LS2 9JT, UK > ------------------------------**------------------------------**----- > ______________________________**_________________ > users mailing list > users@gridengine.org > https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users> >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users