Hello Howard, On Thursday 01 July 2010 18:42:37 Howard Thomson wrote: > Hi Kern & Bacula-developers, > > I have been working on changes to Bacula to enable chunked backup for large > files, such as multi-gigabyte virtual disks [which I have], and possibly > database files etc.
What does "chunked" backup mean exactly? I am not sure what the high level concept is here. Bacula can already backup multi-gigabyte virtual disks, so obviously you are thinking about something different. > > I need to establish how and when the per-chunk hash values are retrieved > from the database and stored/updated to the database. It sounds a bit like you are trying to implement some sort of deduplication code, but I am not sure. > > I am starting with backup changes, for obvious reasons, and note that the > data stream from FD -> SD is a single contiguous stream, albeit transferred > in record sized pieces. Actually, it is not a single continguous stream -- it is lots of packets. Within those packets are contained multiple streams of data (one at a time). The protocol is adaptable to pretty much anything. > > I was envisaging alternate data-chunk / chunk-hash transfers, but that does > not fit as easily into the existing code as I had hoped > [src/stored/append.c and src/filed/backup.c]. Transferring blocks of data (or chunks) should really be no problem. Bacula currently does essentially that, but it is designed to use very little memory so it does not accumulate the whole contents of a file before writing it out -- it simply writes out what it has as it comes in. > > Does the per-chunk hash value info also need to go onto the storage media > as to the database ? Sorry, I cannot answer that until I understand what a chunk is and what the hash code is used for -- i.e. when it is needed and why. If you are talking about deduplication, it is a very big project which will need a lot of careful design work before implementing. > > If it does, then I could simply accumulate the file-offset/hash-value pairs > and send them as a separate stream after the data, although that may be > less than ideal in memory consumption terms. There are some specific cases where Bacula accumulates things such as hash codes, but we try to avoid it if at all possible because when it does so, it immediately makes limitations on what Bacula can handle. > > For restore, the current code is configured such that the SD is unaware of > the file-offset values for a sparse data stream, which means that the SD > would be unable to be selective about the data which it sends to the FD, > which is somewhat link-inefficient. Yes, we have tried to make the SD know as little as possible about the format of the data. Its job is to store data on disk or tape, then to restore it and send it to the FD. It is the Director which tells the SD what data to retrieve. > > Any comments ? > > Will you [Kern] be at the Amsterdam meeting at all ? No, I will not be attending. Arno will be there though. Best regards, Kern > > Regards, > > Howard > > -- > "Only two things are infinite, the universe and human stupidity, > and I'm not sure about the former." -- Albert Einstein ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ Bacula-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/bacula-devel
