On Fri, 2007-04-13 at 01:14 +0900, Naoya Maruyama wrote: > On 4/12/07, Ashley Pittman <[EMAIL PROTECTED]> wrote: > > My advice would be first and foremost to look at the core file, I assume > > your program is receiving a SEGV and exiting? core files can be > > problematical, partly because they aren't always enabled and partly > > because to extract anything useful out of them you need to run the > > debugger with the same environment as the application was, this isn't > > always as easy as it sounds if you are using modules or something like > > that. > > One question. When the debuggee app was a 32-PE MPI job, you would end > up with 32 core files. Would you check each of them manually? Or do > you have any trick to parallellize the checking process? Say, using a > parallel debugger?
Typically the job is torn down after the first process has exited so only one or two core dumps would be preserved, I've never had the need to examine every core dump from a job. RMS has automatic core file analysis so for every "core" file there is a corresponding "core.out" which contains all the information I'm likely to need, you could do this yourself using a wrapper script around the application if required. It's also quite common for jobs to hang which is where debuggers become more useful, the trick here is not to look at every process but just the interesting ones, we have a tool developed in-house for doing just this. Ashley, _______________________________________________ Beowulf mailing list, [EMAIL PROTECTED] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
