Re: [Bacula-users] [Bacula-devel] Project for strong incremental backup assurance

Kern Sibbald Wed, 06 Jun 2007 02:57:45 -0700

On Wednesday 06 June 2007 11:28, Andre Noll wrote:
> On 10:40, Kern Sibbald wrote:
> 
> > > So I think it would be _really_ nice to store information about
> > > deleted files and directories in the database which would make it
> > > possible to get rid of all deleted files and directories automatically
> > > during restore.
> > > 
> > > The dar backup tool for example has this feature. Are there any
> > > plans to include such a feature also in bacula?
> > 
> > Yes, but no one is currently working on it.  There have been a number of 
> > emails on this subject on the bacula-users' list recently.
> 
> In February you said Robert will be working on this project. Do you
> have any pointers to this work? I would be interested to look at the
> strategy for implementing this and at the work that has been done so
> far, if any.


Robert quit the project, so currently there is no one assigned to it.

I would be *extremely* happy to see someone interested in this project.  If 
your offer is for algorithm help, please see Algorithms below. If your offer 
above includes programming (i.e. C or C++ programmer), and you are interested 
in working on it, please let me know (either off list or if you wish copying 
the bacula-devel list) and we can discuss the project.  I recommend starting 
by reading the Developer notes in the Developer's Guide that is on the web 
site.  It will give you a broad overview of developing for the Bacula 
project.

This project and the project to store only one copy of a file (Base project) 
are closely related because they both require *much* more communication 
between the Dir and the FD -- essentially the Dir must send the current state 
as known in the catalog to the client, which can then determine which files 
to backup.

Algorithms:
This requires potentially sending a *lot* of data (i.e. millions of filenames 
and attribute data), which will require hash coding the names for performance 
reasons.  If we want to handle up to 20 million filenames as we are starting 
to see on some systems, we will probably at some point need a good file 
paging algorithm.

Some years ago, I wrote hasing routines specifically for this, but they have 
never been used yet, and so I am now looking at bringing them up to date -- 
in particular adding a Bloom filter to improve performance (I am currently 
researching Bloom filters).  Where I could use a bit of advice is:

Now:
- Reviewing my hash table code (particularly the hash function)  
src/lib/htable.h src/lib/htable.c
- Proposing how to size a Bloom filter (n bits) and number of hash functions.
- Proposing what hash functions to use for the Bloom filter.

Later:
- Review overall strategy.

Since these two projects (de-duplication of files, tracking new and deleted 
files) are quite hot topics lately, over the next week, I will write up a 
sort of proposal for implementation outlining my general ideas for how to 
implement them within the existing Bacula framework (i.e. without too many 
modifications to the database, ...).

Thanks for your interest in this.

Best regards,

Kern


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] [Bacula-devel] Project for strong incremental backup assurance

Reply via email to