> My thoughts on this would be to make the SD-MUX a a totally separate
> daemon with perhaps it's own DB. And the mux logic be left completely
> out of the Director. 

The director has to be involved to some degree to ensure that device
reservations are properly registered (to prevent it from trying to make
conflicting reservations for devices for non-mux jobs). If we're that
far down the road, then having the director tell the sd-mux how to set
up the sessions isn't that much further to go. I do agree that the
sd-mux has to be a separate daemon, though -- it can borrow a lot of
code from the existing sd and fd, though. 

I think there's several key problems to solve here: 

1) having the database record multiple locations for a file
2) having the sd-mux daemon
3) having the director understand how to use the sd-mux (eg, how to know
when one is needed, and how to instruct it what to do)
4) modifying the restore process to understand multiple copies and
restore from the most preferred one

#1 is (IMHO) the least difficult problem: the last major rev of the
database schema provided the data structure to record multiple
locations. AFAIK, none of the code references anything beyond the first
entry, but the space is there to record things once there is code to do
so. 

#2 is essentially a meld of a SD and FD, plus a setup connection to the
director. I'd suggest this be a daemon controlled by inetd, triggered by
a connection request from the director to the control session port
(minimize the # of static ports needed to 1 new port). Inetd would spin
off a copy of the sd-mux for the director. The director would then
instruct the sd-mux about the # of streams required and which actual SDs
are involved. The director would then go about the usual device
reservation and volume selection process already in place for normal
jobs. Once the actual SDs report ready, the director informs the real FD
of the address and port # of the sd-mux, and backups occur as normal,
with the sd-mux as the target SD for the real FD. The sd-mux acts like a
FD to the real SDs, thus requiring no protocol changes on real FD or
SDs. The SDs handle media as normal, signaling the director to notify it
of volume changes as required. The sd-mux receives data, writes it to
each real SD, and returns status when all the writes complete. At EOJ,
the sd-mux handles the shutdown of the sessions to the real SDs, and
then shuts down the session to the real FD. It then informs the director
of the EOJ state, and exits. 

This would also require some minor updates to the real SD logic to test
for the presence of a file and update it's media record rather than
inserting it (if such code doesn't already exist now).

#3 is somewhat covered in the above description. The sd-mux would need
to know how many streams to prepare (3 is about the practical maximum
based on experience with mainframe apps that do this type of work now),
and the hostname/ip address and port numbers for the real SDs to use for
this job, based on the reservations made by the director. The sd-mux
would also need to know how to abort a job if a session to a real SD
failed during the job. 
The sd-mux would also need to know the range of ports valid on the
sd-mux host (note that the host running the sd-mux may NOT be the same
host running the director, and we should design accordingly), and there
may be a good reason to constrain the available ports on the sd-mux host
for firewall friendliness reasons. 

#4 is pretty simple once all the other things are done...8-) Your idea
of a priority in the pool definition is a good one; I'd argue that there
is a implicit method of defining this priority. If the file is available
in a disk pool (or other random access storage), then we should prefer
to pull the restored file from the disk. Media pools in the same
location should have a lower priority, and media with a different
location value should have a even lower priority. If a volume is marked
missing or unavailable, it should be automatically skipped. 

An alternative method that would require more work, but would be
ultimately better in terms of self management, would be to measure
response time of storage daemons in the director over the last 10-20
requests (eg, time from start of reservation to SD ready) in the
director database, and choose the fastest responding SD that contains a
copy of the file (subject to conditions listed above wrt to location).
This would tend over time to spread out the load over multiple SDs at
the same site.  

In a more general sense, this kind of approach would also be helpful in
implementing multiple site migration jobs (a sd-mux could be used to
move files between SDs, if a migration job spun off a daemon copy to act
as a restore FD that immediately turned around and resent the data to a
sd-mux. 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to