> My thoughts on this would be to make the SD-MUX a a totally separate > daemon with perhaps it's own DB. And the mux logic be left completely > out of the Director.
The director has to be involved to some degree to ensure that device reservations are properly registered (to prevent it from trying to make conflicting reservations for devices for non-mux jobs). If we're that far down the road, then having the director tell the sd-mux how to set up the sessions isn't that much further to go. I do agree that the sd-mux has to be a separate daemon, though -- it can borrow a lot of code from the existing sd and fd, though. I think there's several key problems to solve here: 1) having the database record multiple locations for a file 2) having the sd-mux daemon 3) having the director understand how to use the sd-mux (eg, how to know when one is needed, and how to instruct it what to do) 4) modifying the restore process to understand multiple copies and restore from the most preferred one #1 is (IMHO) the least difficult problem: the last major rev of the database schema provided the data structure to record multiple locations. AFAIK, none of the code references anything beyond the first entry, but the space is there to record things once there is code to do so. #2 is essentially a meld of a SD and FD, plus a setup connection to the director. I'd suggest this be a daemon controlled by inetd, triggered by a connection request from the director to the control session port (minimize the # of static ports needed to 1 new port). Inetd would spin off a copy of the sd-mux for the director. The director would then instruct the sd-mux about the # of streams required and which actual SDs are involved. The director would then go about the usual device reservation and volume selection process already in place for normal jobs. Once the actual SDs report ready, the director informs the real FD of the address and port # of the sd-mux, and backups occur as normal, with the sd-mux as the target SD for the real FD. The sd-mux acts like a FD to the real SDs, thus requiring no protocol changes on real FD or SDs. The SDs handle media as normal, signaling the director to notify it of volume changes as required. The sd-mux receives data, writes it to each real SD, and returns status when all the writes complete. At EOJ, the sd-mux handles the shutdown of the sessions to the real SDs, and then shuts down the session to the real FD. It then informs the director of the EOJ state, and exits. This would also require some minor updates to the real SD logic to test for the presence of a file and update it's media record rather than inserting it (if such code doesn't already exist now). #3 is somewhat covered in the above description. The sd-mux would need to know how many streams to prepare (3 is about the practical maximum based on experience with mainframe apps that do this type of work now), and the hostname/ip address and port numbers for the real SDs to use for this job, based on the reservations made by the director. The sd-mux would also need to know how to abort a job if a session to a real SD failed during the job. The sd-mux would also need to know the range of ports valid on the sd-mux host (note that the host running the sd-mux may NOT be the same host running the director, and we should design accordingly), and there may be a good reason to constrain the available ports on the sd-mux host for firewall friendliness reasons. #4 is pretty simple once all the other things are done...8-) Your idea of a priority in the pool definition is a good one; I'd argue that there is a implicit method of defining this priority. If the file is available in a disk pool (or other random access storage), then we should prefer to pull the restored file from the disk. Media pools in the same location should have a lower priority, and media with a different location value should have a even lower priority. If a volume is marked missing or unavailable, it should be automatically skipped. An alternative method that would require more work, but would be ultimately better in terms of self management, would be to measure response time of storage daemons in the director over the last 10-20 requests (eg, time from start of reservation to SD ready) in the director database, and choose the fastest responding SD that contains a copy of the file (subject to conditions listed above wrt to location). This would tend over time to spread out the load over multiple SDs at the same site. In a more general sense, this kind of approach would also be helpful in implementing multiple site migration jobs (a sd-mux could be used to move files between SDs, if a migration job spun off a daemon copy to act as a restore FD that immediately turned around and resent the data to a sd-mux. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users