We just had the longest buildsys outage we've had since I started (not happy). But the good news is we have lots of options to make sure this particular issue doesn't happen again.
The problem: Is sort of a 3 fold issue. 1) NFS is running on the xen dom0 (will be fixed next week) 2) nfs lock is hanging and disallowing our clients to lock files 3) When it gets in this state a restart of nslock won't fix the problem, the port stays open so we have to restart the host. There's not much we can do about 2 or 3 except rely on upstream to fix the problem. Seth suggested that when we fix 1) we make it a RHEL4 box. I think this is probably the best solution. I'm also looking at other solutions for some of our other applications. Our environment isn't in terrible shape but I think it could be better. I'm looking at some different tools that might make things easier on us. I think our environment is ok but there's some apps that don't need to be load balanced, just more HA. And sometimes those apps are getting beaten up by apps that do need to be properly load balanced. I'm going to go through and see what some of our options are and I encourage those familiar with our environment to do the same. Some of our apps certainly need work (smolt especially) but I think there are things we can do as infrastructure members to mitigate risks and ensure we're seeing fewer outages. -Mike _______________________________________________ Fedora-infrastructure-list mailing list Fedora-infrastructure-list@redhat.com https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list