Im fairly sure we are affected by this bug too, I am happy to help in the
hunt and I have looked through the code more than once.

Which version of grid are you trying to fix? I havn't been following grid
dev too closely do we still have multiple forks?



On Wed, Feb 6, 2013 at 12:07 PM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:

> On Wed, 6 Feb 2013, Orlando Richards wrote:
> ...
>
>> I've had a go at digging through the code, but couldn't really make head
>>
>> nor tail of it - no doubt in large part due to my not being much of a
>> coder :( Any pointers to get me bootstrapped would be most welcome.
>>
>
> General comments about the source...
>
> Don't be intimidated. It's a large code base, but spend a little time and
> it'll start to make sense. Pick a little bit of it to focus on initially.
>
> Gridengine's source code is layered. The source distribution has a few
> HTML files describing them (some of which still need updating from the 6.0
> days...). Functions near the very top and very bottom of the stack are
> relatively well commented, but the rest can be a little hit and miss.
>
> Ignoring most of the layers, you've essentially got:
>
> At the bottom you've got the wonderful CULL layer: it's very solid and
> provides gridengine with safe complicated data structures. I'd like to pat
> the person who wrote it on the back, although I admit I've yet to get my
> head round the advanced search functionality. State data for jobs and so on
> tend to use it. Use of it can be identified by the data types or functions
> prefixed with "l".
>
> While I'm on data structures, there are also "dstrings" - which provide
> safe string handling.
>
> In the middle you've got the GDI, which is the set of libraries used by
> the different components to communicate with each other over the network.
>
> At the top you've got the qmaster, execd, etc., which can be thought of as
> loosely coupled applications that all use the same underlying
> libraries/layers to coordinate.
>
> I've spent most of my time in the execd, which is pretty easy but messy [a
> very large number of special cases - not totally unexpected with the number
> of platforms supported over the years, but ripe for some refactoring]. I've
> had a brief play in the qmaster and my first impression is that it's more
> consistent and "solid" than the execd, but more complicated.
>
>
> General tips for debugging gridengine...
>
> 1) Play with the loglevel setting in "qconf -sconf" and read the messages
> files.
>
>
> 2) Figure out how to stick gridengine into debug mode.
> https://blogs.oracle.com/**templedf/entry/using_**debugging_output<https://blogs.oracle.com/templedf/entry/using_debugging_output>
>
> Essentially something like:
>   * Setup sge environment (SGE_ROOT, SGE_QMASTER_PORT, etc.)
>   * Execute: . $SGE_ROOT/util/dl.sh
>   * Execute: dl 1
>   * Execute: $SGE_ROOT/bin/lx-amd64/sge_**execd
>
> The program will not daemonise and will print lots of interesting stuff.
> Different 'dl' values will give you different output. I generally find that
> anything greater than 1 is "too much".
>
> This technique will work for pretty much any gridengine component. Even
> qsub.
>
>
> 3) Run gridengine under gdb.
>
> I don't know if you've had much experience with gdb but, once you've got
> the hang of it, it's very useful in figuring out what some code generally
> does without actually understanding the details. Once you've followed your
> nose to something that doesn't look right, you can then spend time figuring
> things out.
>
> I think some of the gridengine forks try to provide builds with enough
> debugging information for this to work, but I tend to build my own
> gridengine so that I can easily recompile after editing the source with
> potential fixes.
>
> Make sure you build with the "-no-opt" and "-debug" flags to aimk
> (disables optimisation and enables debugging symbols) and keep the source
> tree kicking around for gdb to read. I run our production gridengine with
> those flags and haven't noticed any serious performance problems.
>
> Once you have gridengine running under gdb and playing with breakpoints
> and the rest, you can easily examine interesting data structures with
> commands like "p lWriteList(ptr)", "p lWriteElem(ptr)" and
> "p sge_dstring_get_string(ptr)" (where ptr is a lList*, lListElem* or
> dstring*, respectively).
>
>
> ...
>
>  At the moment, I'm trying to get a reproducible test case together to
>> allow for useful debugging - basic tests (sleep 60s) don't show an
>> obvious triggering of the issue, so I'm moving onto more complicated
>> tasks. Certainly, the issue does seem to create orders-of-magnitude
>> differences in reported usage. Current offenders include BLAST jobs (run
>> by our Biology users) - which are fairly memory heavy.
>>
> ...
>
> Being able to reproduce the problem will obviously make things far, far
> easier! If you cannot, you're probably reduced to littering the relevant
> qmaster code with INFO(())/WARNING(())/ERROR(()) statements (and checking
> that loglevel in "qconf -sconf" is set to the appropriate value) and seeing
> what appears in the messages files in production.
>
> If you're lucky, the problem might be evident in the usage information
> being sent from the execd to the qmaster. Running the execd in debug mode
> with "dl 1" will reveal what CPU/MEM/IO values the qmaster is being given
> to be used in the accounting file and the share tree.
>
> If you're unlucky, the problem is in how the qmaster aggregates, records
> and decays the share tree values over time.
>
> If you're really unlucky, the problem might only occur if the various
> gridengine components are under severe stress.
>
> I find that having a non-production installation of gridengine kicking
> around, perhaps in virtual machines, is very handy :)
>
> Hope this helps...
>
>
> Mark
> --
> ------------------------------**------------------------------**-----
> Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> ------------------------------**------------------------------**-----
> ______________________________**_________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to