Very strange, but I want to make sure I understand what I'm seeing. If I look at process 20221, I see its command swapping between pw.x, sh, vasp5 and sh depending on what time you look at it, right? You also mention pid 3454 is collectl, but that doesn't make a lot of sense to me because while that pid is the parent, collectl doesn't do anything with children like pw.x or vasp5 and I don't think it ever runs a shell either.
You then show some of the raw data for 22377, but none of the earlier examples are for that process so I'm not sure what you're trying to tell me. Here are a few things I do know about process data. The actual process name can be changed at runtime, though I'm not entirely sure how or if this is actually happening or not. One thing that could help is to include the switch --procopts w, which will print out the full 'wide' value of the cmd, rather than just the first piece. At least that way you could see the arguments to the 'sh' command which might offer a clue. One thing - and I've said this on a variety of different occasions - collectl is a rather dumb program in that all it does is print the data it records. In other words I suspect if you look at the raw data you'll see it does match what is being reported (otherwise it IS a bug). Also, if you want to avoid posting usernames, you can tell collectl to use an alternate UID file by pointing it to an alternate password file with --passwd. If you make it an empty file I believe collectl will simply report ??? for usernames. I have no idea if any of this helps or not. It's also not clear whether this is happening for ALL processes or just the few you've pointed out. If just those few I wonder if it always happens to those few and only those AND if it's reproducible. Clearly more data is needed to get to the bottom of this, but if it is reproducible I'm sure we will be able to. one other thought - if it is reproducible, perhaps try doing a ps or just do a tail on the /proc/pid/stat and /proc/pid/cmd files to see if the process name is indeed changing dynamically. -mark On Fri, Oct 28, 2011 at 9:04 AM, Andreas Johansson <[email protected]> wrote: > A compute node in one of our clusters crashed the other day, and while > looking at the stored collectl data I found an anomaly. > > Consider the following data, just varying the starting point in time. > Process 3454 is collectl, so the shell spawned by collectl is somehow > mixed up with data from older processes. > > When the start time is moved beyond the end of the run time of these > processes, they disappear from view. As you can see by comparing 14:02 > and 14:04, the ordering is not consistent, but in this sample it's > always every other process at least. > > (I've redacted the user names below.) > > # collectl -p m85-20111027-000100.raw.gz --from 09:00 | grep "0 Z" | tail > -n 10 > 20221 root 16 3454 0 Z 16G 167M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user1/espresso/PW/pw.x > 20233 root 18 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20248 root 19 3454 0 Z 16G 167M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user1/espresso/PW/pw.x > 20260 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20274 root 19 3454 0 Z 16G 167M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user1/espresso/PW/pw.x > 20287 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20301 root 19 3454 0 Z 16G 167M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user1/espresso/PW/pw.x > 20313 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20327 root 16 3454 0 Z 16G 167M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user1/espresso/PW/pw.x > 20340 root 17 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > > # collectl -p m85-20111027-000100.raw.gz --from 14:02 | grep "0 Z" | tail > -n 10 > 20221 root 16 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20233 root 18 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20248 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20260 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20274 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20287 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20301 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20313 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20327 root 16 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20340 root 17 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > > # collectl -p m85-20111027-000100.raw.gz --from 14:04 | grep "0 Z" | tail > -n 10 > 20221 root 16 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20233 root 18 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20248 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20260 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20274 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20287 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20301 root 19 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20313 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20327 root 16 3454 0 Z 16G 95M 0 0.00 0.00 0 0:00.00 > 0 0 0 7 /home/user2/bin/vasp5 > 20340 root 17 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > > # collectl -p m85-20111027-000100.raw.gz --from 14:06 | grep "0 Z" | tail > -n 10 > 20221 root 16 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20233 root 18 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20248 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20260 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20274 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20287 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20301 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20313 root 19 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20327 root 16 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > 20340 root 17 3454 0 Z 0 0 0 0.00 0.00 0 0:00.00 > 0 0 0 7 sh > > This is a CentOS 5.7 system with collectl 3.5.1 from EPEL. I've > tried playing back the data with 3.6.0-3 downloaded from > Sourceforge and gotten the same results. > > After the processes have ended, there is no mention of them in the > raw data file. > > While looking at this together with my colleague Peter Kjellström, > we couldn't find anything obviously wrong in the raw data. The only > idea we could come up with was that something went wrong while > parsing the orted command that was the parent of the ghost > processes. Below is an example > > proc:22377 stat 22377 (orted) S 22374 22377 22373 0 -1 4202752 1079 0 0 0 > 1 1 0 0 15 0 1 0 6279222 52596736 575 18446744073709551615 4194304 4197348 > 140737418411600 18446744073709551615 47955180684895 0 0 4096 85730 0 0 0 17 > 6 0 0 0 > proc:22377 Tgid: 22377 > proc:22377 Uid: 2536 2536 2536 2536 > proc:22377 VmPeak: 53416 kB > proc:22377 VmSize: 51364 kB > proc:22377 VmLck: 0 kB > proc:22377 VmHWM: 2300 kB > proc:22377 VmRSS: 2300 kB > proc:22377 VmData: 624 kB > proc:22377 VmStk: 88 kB > proc:22377 VmExe: 4 kB > proc:22377 VmLib: 6476 kB > proc:22377 cmd /software/mpi/openmpi/1.4.1/i101011/bin/orted -mca ess > slurm -mca orte_ess_jobid 949420032 -mca orte_ess_vpid 1 -mca > orte_ess_num_procs 8 --hnp-uri "949420032.0;tcp://10.5.10.15:34202;tcp:// > 10.7.0.15:34202" > proc:22377 io rchar: 191142 > proc:22377 io wchar: 135028 > proc:22377 io syscr: 541 > proc:22377 io syscw: 211 > proc:22377 io read_bytes: 0 > proc:22377 io write_bytes: 0 > proc:22377 io cancelled_write_bytes: 0 > proc:22382 stat 22382 (pw.x) R 22377 22377 22373 0 -1 4202496 526606287 0 > 50 0 2031546 75590 0 0 25 0 3 0 6279289 17620967424 44600 > 18446744073709551615 4194304 12175148 140735068407008 18446744073709551615 > 47044798491665 0 0 0 83182 18446744073709551615 0 0 17 6 0 0 637 > proc:22382 Tgid: 22382 > proc:22382 Uid: 2536 2536 2536 2536 > proc:22382 VmPeak: 17654764 kB > proc:22382 VmSize: 17207968 kB > proc:22382 VmLck: 28876 kB > proc:22382 VmHWM: 623160 kB > proc:22382 VmRSS: 178400 kB > proc:22382 VmData: 16975836 kB > proc:22382 VmStk: 1484 kB > proc:22382 VmExe: 7796 kB > proc:22382 VmLib: 54116 kB > proc:22382 cmd /home/user1/espresso/PW/pw.x -npool 8 > proc:22382 io rchar: 9745700495 > proc:22382 io wchar: 4720168364 > proc:22382 io syscr: 1205 > proc:22382 io syscw: 16451 > proc:22382 io read_bytes: 0 > proc:22382 io write_bytes: 5842436096 > proc:22382 io cancelled_write_bytes: 0 > > Any ideas about what can be wrong? > > Best regards, > > -- > Andreas Johansson [email protected] > National Supercomputer Centre +46-13-285778 > Linköping University > > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > Collectl-interest mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/collectl-interest >
------------------------------------------------------------------------------ Get your Android app more play: Bring it to the BlackBerry PlayBook in minutes. BlackBerry App World™ now supports Android™ Apps for the BlackBerry® PlayBook™. Discover just how easy and simple it is! http://p.sf.net/sfu/android-dev2dev
_______________________________________________ Collectl-interest mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/collectl-interest
