Re: [Bacula-devel] Job scaling on large setups

Kern Sibbald Mon, 25 Oct 2010 08:16:21 -0700

On Saturday 23 October 2010 19:31:53 Alan Brown wrote:
> I'm running into a few issues with enterprise-scale backups and have a
> wishlist of major/minor tweaks I'd like to see.
>
> We've recently obtained a new 7-drive Neo8000 autochanger (which "works
> great" out of the box),  but it's shown up some bacula issues relating
> to simultaneous jobs and limiting resource hogging on clients, server
> and clustered filesystem arrays.
>
> Wish 1: It would be nice to allow a Pool or Job to have a choice between
> more than "one or all" of the available drives (that's effectively what
> "prefer mounted volume") gives.


Yes, I can see the utility of that.

>
> Wish 2: (related) It would be useful to have max concurrent job limits
> per pool.

Yes, perhaps.  Though with so many max concurrent jobs, it becomes a nightmare 
to understand what is going on.

>
> Wish 3: It would also be good to have bacula take notice of the
> per-drive limits on an autochanger:

I wasn't aware that this was a problem.  If it is, it is probably a bug.  The 
main use for the drive limits is for autochangers.

>
>    Right now if you choose the autochanger as backup device its limit is
> the only control and that can result in one drive running 10 jobs (even
> if the individual drive entry has a max concurrent jobs setting lower
> than this)  whilst others are idle because the director is waiting on
> max storage jobs.
>
>   This can result in fileserver backups (timeconsuming) preventing
> desktop backups taking place, etc.
>
> Wish 4: It would be good to be able to define a group of clients and
> then set maximum concurrent jobs for that group.

This is a rather ambitious project.

>
>
>   Why: Linux (and most other OS) NFS server code is fundamentally broken
> and unsafe for clustering as it ignores filesystem locks set by other
> process. As a result it's unsafe (risk of data corruption) to allow
> clustered filesystems to allow activity from any node OTHER than the one
> acting as a NFS server (which opens the question of why bother with
> clustered filesystems, the answer is that they're useful in a
> high-availability environment because the NFS service can be transferred
> to another node in seconds)
>
>   (This same problem also raises a risk of data corruption on any Linux
> system acting as NFS fileserver, clustered or not! The only safe way to
> simultaneously export multiple protocols is to export them from a NFS
> client and not run any processes on the NFS fileserver which directly
> manipulate the NFS exported filesystem)
>
>   On top of the above problems, with GFS (and most other clustered
> filesystems) a read/write lock must be propagated across the cluster for
> every file being opened on each cluster node (nfs ignores this!), which
> can drive network load up dramatically on an incremental backup as well
> as hitting actual backup rates quite hard. One node is usually
> notionally the master for any filesystem and in general the master is
> dynamically decided by whichever node is actually making the most lock
> requests.
>
>   In order to accomodate the problem I've had to define a virtual bacula
> client per NFS service. That follows the filesystem's location, but
> breaks previous restrictions I was able to enforce using per-client job
> limits.
>
>   This is a major problem, because most of our FSes live on a couple of
> 40Tb nearline storage arrays. As the number of simultaneous backups
> increases their performance falls away rapidly.
>
>   The current situation is resulting in backups being able to badly
> affect NFS server performance - which users have noticed and are
> complaining loudly about. I really need to restrict the number of
> simultaneous backups coming out of any given array and the only way this
> seems feasible is to be able to group clients and then impose a
> simultaneous job limit across them
>
>
> Wish 5:  Better optimisation/caching of directory lists (is this possible?)

Not much chance this will be done as it is an OS file system problem, not 
Bacula's ...

>
>   Most of us are aware that the more entries there are in a directory
> the slower it is to load.
>
>   Users are not aware of this - and they resent being told to keep
> things in hierarchial layouts instead of one large flat space.
>
>   GFS and GFS2 behave incredibly badly if there are a lot of files in a
> directory. I've seen them take 5-6 minutes to open a directory with
> 10,000 entries and up to 30 minutes to open a directory with 100,000
> files in it (this not only affects the process, it also causes the
> entire filesystem to slow down for all users). When Bacula hits a
> directory with a lot of files in it on an incremental backup things get
> even slower. :-(
>
>
> Feedback and ideas are welcome. Telling me not to use GFS doesn't tell
> me anything I don't already know, however I'm stuck with it for the
> moment and avoiding any more cluster deployments until Redhat make it
> work properly (I have a 400TB deployment to handle in 2 weeks which will
> be XFS or Ext4 for the meantime)

Best regards,

Kern


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Job scaling on large setups

Reply via email to