Thanks, Reuti! Yes, the user intended to generate those millions files on the local scratch directory provided by SGE.
> Can you spot any oom-killer in the messages file of the node? No, I did not find any useful info. from the log files. The user happened to run 10 jobs on a node, and these jobs generated 10+million files. It took very long time to even "ls" these files. It looks like it takes very long time for execd to remove these files, since I noticed huge IO by execd after I had deleted these jobs. On Mon, Jan 26, 2015 at 12:04 PM, Reuti <[email protected]> wrote: > Hi, > >> Am 26.01.2015 um 17:15 schrieb Feng Zhang <[email protected]>: >> >> I just found a strange behavior of SGE 2011. >> >> One user's job generate 1+ million small files in local >> disk($TEMPDIR). > > Hence in the local scratch directory provided by SGE? > > >> It looks like it makes the execd very busy and from >> the side of qmaster, the node is lost and unavailable, while I can ssh >> to login. On the node, execd makes huge IOs( a few hundred KB/s to a >> few MB/s). Some nodes can survive and get back to normal, some nodes >> failed at the end(Since this kind of jobs also use a lot of memory, so >> it looks like these nodes failed while the RAM got used up). > > Can you spot any oom-killer in the messages file of the node? > > >> I am >> wondering that whether the execd handles the files that a job >> generates? > > Not that I'm aware of. It will just remove the generated directory after the > job. > > Is it intended by the user to generate this high number of files? It could be > limited with a set disk quota. > > -- Reuti > > >> Or execd does something else to communicate with qmaster >> while there are a lot of job generated files? >> >> >> -- >> Best, >> >> Feng >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > -- Best, Feng _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
