On Saturday 23 October 2010 19:31:53 Alan Brown wrote: > I'm running into a few issues with enterprise-scale backups and have a > wishlist of major/minor tweaks I'd like to see. > > We've recently obtained a new 7-drive Neo8000 autochanger (which "works > great" out of the box), but it's shown up some bacula issues relating > to simultaneous jobs and limiting resource hogging on clients, server > and clustered filesystem arrays. > > Wish 1: It would be nice to allow a Pool or Job to have a choice between > more than "one or all" of the available drives (that's effectively what > "prefer mounted volume") gives.
Yes, I can see the utility of that. > > Wish 2: (related) It would be useful to have max concurrent job limits > per pool. Yes, perhaps. Though with so many max concurrent jobs, it becomes a nightmare to understand what is going on. > > Wish 3: It would also be good to have bacula take notice of the > per-drive limits on an autochanger: I wasn't aware that this was a problem. If it is, it is probably a bug. The main use for the drive limits is for autochangers. > > Right now if you choose the autochanger as backup device its limit is > the only control and that can result in one drive running 10 jobs (even > if the individual drive entry has a max concurrent jobs setting lower > than this) whilst others are idle because the director is waiting on > max storage jobs. > > This can result in fileserver backups (timeconsuming) preventing > desktop backups taking place, etc. > > Wish 4: It would be good to be able to define a group of clients and > then set maximum concurrent jobs for that group. This is a rather ambitious project. > > > Why: Linux (and most other OS) NFS server code is fundamentally broken > and unsafe for clustering as it ignores filesystem locks set by other > process. As a result it's unsafe (risk of data corruption) to allow > clustered filesystems to allow activity from any node OTHER than the one > acting as a NFS server (which opens the question of why bother with > clustered filesystems, the answer is that they're useful in a > high-availability environment because the NFS service can be transferred > to another node in seconds) > > (This same problem also raises a risk of data corruption on any Linux > system acting as NFS fileserver, clustered or not! The only safe way to > simultaneously export multiple protocols is to export them from a NFS > client and not run any processes on the NFS fileserver which directly > manipulate the NFS exported filesystem) > > On top of the above problems, with GFS (and most other clustered > filesystems) a read/write lock must be propagated across the cluster for > every file being opened on each cluster node (nfs ignores this!), which > can drive network load up dramatically on an incremental backup as well > as hitting actual backup rates quite hard. One node is usually > notionally the master for any filesystem and in general the master is > dynamically decided by whichever node is actually making the most lock > requests. > > In order to accomodate the problem I've had to define a virtual bacula > client per NFS service. That follows the filesystem's location, but > breaks previous restrictions I was able to enforce using per-client job > limits. > > This is a major problem, because most of our FSes live on a couple of > 40Tb nearline storage arrays. As the number of simultaneous backups > increases their performance falls away rapidly. > > The current situation is resulting in backups being able to badly > affect NFS server performance - which users have noticed and are > complaining loudly about. I really need to restrict the number of > simultaneous backups coming out of any given array and the only way this > seems feasible is to be able to group clients and then impose a > simultaneous job limit across them > > > Wish 5: Better optimisation/caching of directory lists (is this possible?) Not much chance this will be done as it is an OS file system problem, not Bacula's ... > > Most of us are aware that the more entries there are in a directory > the slower it is to load. > > Users are not aware of this - and they resent being told to keep > things in hierarchial layouts instead of one large flat space. > > GFS and GFS2 behave incredibly badly if there are a lot of files in a > directory. I've seen them take 5-6 minutes to open a directory with > 10,000 entries and up to 30 minutes to open a directory with 100,000 > files in it (this not only affects the process, it also causes the > entire filesystem to slow down for all users). When Bacula hits a > directory with a lot of files in it on an incremental backup things get > even slower. :-( > > > Feedback and ideas are welcome. Telling me not to use GFS doesn't tell > me anything I don't already know, however I'm stuck with it for the > moment and avoiding any more cluster deployments until Redhat make it > work properly (I have a 400TB deployment to handle in 2 weeks which will > be XFS or Ext4 for the meantime) Best regards, Kern ------------------------------------------------------------------------------ Nokia and AT&T present the 2010 Calling All Innovators-North America contest Create new apps & games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
