Hi Guys, I just found a strange behavior of SGE 2011.
One user's job generate 1+ million small files in local disk($TEMPDIR). It looks like it makes the execd very busy and from the side of qmaster, the node is lost and unavailable, while I can ssh to login. On the node, execd makes huge IOs( a few hundred KB/s to a few MB/s). Some nodes can survive and get back to normal, some nodes failed at the end(Since this kind of jobs also use a lot of memory, so it looks like these nodes failed while the RAM got used up). I am wondering that whether the execd handles the files that a job generates? Or execd does something else to communicate with qmaster while there are a lot of job generated files? -- Best, Feng _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
