Dear Prof. Marks, As I suspected, users can not use ganglia. Our administrators are very jealous !!
Dear Elias Assmann, Many thanks for your comments. I will try to comment on some of them. First of all, I wonder: To what extent is this problem reproducible? > E.g., does your job always run on the same 4 nodes? Yes. > Is it always the > same node(s) that are slow? Yes > Does the problem also show up in other > calculations (maybe just changing the number of k-points, or > restarting the same case from scratch). The strangest part: at the beginning of this month, the same calculation was running properly. I had a crash for convergence problems and when I reduced the "mixing factor" in case.inm (it is now 0.04 in pre-convergence scf cycle) the problems started. Obviously, I do not believe that the mixing factor is the problem. > Is it only lapw1 that is slow? > No. All the executables are running slowly in the problematic node. > > Second, how did you make those ‘top’s? As for ‘lapw0’ and ‘lapw1’, I > am guessing that this is just because the snapshots were taken at > different times (notice that the CPU times of lapw0 on the two nodes > are quite different, too). > Users can do nothing. The administrator sent me the "top's" and I have asked him for simultaneous ones. > > About the CPU usage on ‘n2’, I find this very suspicious. If it is as > Peter said that the jobs are in the initialization and therefore not > computing much, that may be fine; but I have to disagree with his > assessment, because the memory usage of lapw1 on the two nodes is > basically the same (if anything, the image sizes on ‘n2’ are slightly > larger). Note also that it is *not* the case that other processes are > using the CPU; the total usage is at 7.5 %. > > It would be good to clarify that by getting a ‘top’ such that we know > that lapw1 had been running for a while. To this end, top has an ‘-n’ > option which says how many frames to output, e.g. ‘top -bn 10’. > > I am also curious about the load averages. ‘n2’ has larger “mid-term” > and “long-term” load averages than the others, and its “short-term” > average is just as large. I am not sure what that means. > > On 09/23/2015 02:21 PM, Luis Ogando wrote: > > I can not access the nodes. SSH among them is forbidden ! We have > > to ask the administrators for anything !! It is the hell !! Of > > course, only the PBS jobs can "travel" among the nodes. > > I do not know about PBS Pro, but Torque and SGE have an option (I > think ‘-I’ in either case) to submit an interactive job where you get > a login on a node. Of course that is only a realistic option when the > queuing time is not too long. Otherwise, any information that a more > sophisticated tool can give you will also be available from the > command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’, > etc. You can also put these things in a jobs script (which you > apparently already did with ‘top’). > > > Good luck, > > Elias > Finally, I would like to thank all the comments and say that if I did not comment on them is because the administrators said they can not be the origin of the problem, "everything is 0K" (?). All the best, Luis
_______________________________________________ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html