Hey folks,

Got a cluster running OGS 2011.11 via the dropbox download courtesy
binaries that is having trouble when the NFSv4 share is getting hammered
by file access.

I'm 99% certain that this is an NFSv4/kernel/driver/Ubuntu 12.04 LTS
issue but wanted to check in to see if anyone has any awareness of
issues with OGS and Ubuntu 12.04 LTS or maybe any other oddities
regarding the use of NFSv4 over 10GbE

We used to have more error messages but after upgrading the NIC driver
we only see this on the OS:

> xx-05: Oct  9 13:40:16 xx-05 kernel: [167190.710137] nfs4_reclaim_open_state: 
> unhandled error -13. Zeroing state


Primary symptom is nodes appearing to hang and lots of hung SGE ('t')
job states. I think this indicates that under the hood SGE is having
trouble logging state and spool info when the NFSv4 share runs into a
glitch, timeout or errror.

Like I said this clearly feels like a NFSv4/OS/tuning issue but wanted
to check out of paranoia to see if anyone else has info or experience

Next steps for us:

1. Move spooling to local disk
2. See if we can break the same way via NFSv3
3. Play with GlusterFS
4. Standard NFS tuning for OS/kernel
...
N. Maybe recompile or rebuild gridengine native on the OS


-dag


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to