Le Tuesday 28 August 2007 10:24:03 Kern Sibbald, vous avez écrit : > On Sunday 26 August 2007 09:17, Marc Cousin wrote: > > On Sunday 26 August 2007 07:43:25 Kern Sibbald wrote: > > > Hello Marc, > > > > > > I don't yet understand the details of how you use these tables, but it > > > seems to me that we could modify the Bacula table structure as follows: > > > > > > - Split the File table into two new tables: > > > - Files containing only file entries > > > - Dirs containing the equivalent of a current File entry for > > > each directory, except that it would have a pointer to the > > > parent Path entry, and possibly a pointer to the parent Dirs > > > entry (however that entry may not exist in each backup). > > > If necessary we could add the visibility flag -- I don't see > > > its use yet. > > > > For the visibility flag, it may not be that easy : a directory may be > > visible even if it's not in a backup. For instance, /home if /home/marc > > is backuped should be displayed, so we add an entry from /home in > > pathvisibility for the job where /home/marc is backed up. > > Isn't the visibility rather easily deduced from the first path in the > backup?
I'm not sure I understand what you mean. Do you mean : If I have backed up /home/marc, I know I must display /home and / ? Or that as I have decided to backup /home/marc, the first entry for the associated jobid will be /home/marc, and I can deduce from this that /home and / are visible too very easily ? It works, as long as your job only has one of such directories. If you have for instance /home/marc and /usr, you won't be able to find /usr that easily. And I feel that it's too much implementation dependant : if it was decided to save things in a different order, or insert records in a different order (I don't know why, it's just for the sake of giving an example), you may end up screwing the display algorithm. I'm documenting the way we build the visibility and hierarchy tables below (Eric tells me he has problems with his home computer, so it would take him time...) Here we go... First, two things : - pathhierarchy is just a link between ppathid and pathid. There is no jobid information in it. So it really is a list of the subdirectories that may be found in a directory, for ALL servers mixed. The idea behind that is that most of the time, either a directory exists on several servers and will contain almost the same thing, or the directory will be on only one server of a few, and there is no point either in storing jobid or client id, or anything like that. we save a lot of space that way (first try was with jobids in that table, and it got very big very fast) - pathvisibility just tells us : this pathid should be displayed for this jobid. The way we do it right now in brestore is that we get all 'directly' visible dirs from the File table, we insert them into pathvisibility, build missing parts of the hierarchy, and then we iteratively insert the missing parent directories in pathvisibility with a query. The reasoning is that if a directory is visible, it's parent is visible. For a jobid, we insert all parent directories of a directory with a query. We call this query iteratively until it says it has inserted nothing (usually it means calling it 4 or 5 times...). All this is for a given, missing, jobid (we do them one by one,): The first query is : INSERT INTO brestore_pathvisibility (PathId, JobId) (SELECT DISTINCT PathId, JobId FROM File WHERE JobId = $job) This one gives us all the 'directly' visible path from the File table : a path is visible as long as itself is in the backup, or one of it's files ... Then we do this : SELECT brestore_pathvisibility.PathId, Path FROM brestore_pathvisibility JOIN Path ON( brestore_pathvisibility.PathId = Path.PathId) LEFT JOIN brestore_pathhierarchy ON (brestore_pathvisibility.PathId = brestore_pathhierarchy.PathId) WHERE brestore_pathvisibility.JobId = $job AND brestore_pathhierarchy.PathId IS NULL ORDER BY Path This one means : for a given jobid, give me all Pathid/Path which are visible but are not a leaf in pathhierarchy. The purpose is just to save us as many individual selects as possible in the function building missing parts of the hierarchy. For each of these path, we - add it's parent in pathhierarchy if it's missing (we've done a bit of caching in brestore to speed that up, in case we ask for creation of a branch of the hierarchy that has been built before in this go). - take the 'basename' of this path and loop, until path is '' When we've done this, we have built the missing entries in pathierarchy. It is slow on the first time, as there is a lot to build. After a few goes, it is extremely fast, as there is almost nothing to do, as most of the hierarchy is already learnt from other jobids. And knowing that a leaf exists in pathierarchy means everything is built in the tree down to the root dir (because we use transactions). Now we run this : INSERT INTO brestore_pathvisibility (PathId, JobId) ( SELECT a.PathId,$job FROM (SELECT DISTINCT h.PPathId AS PathId FROM brestore_pathhierarchy AS h JOIN brestore_pathvisibility AS p ON (h.PathId=p.PathId) WHERE p.JobId=$job) AS a LEFT JOIN (SELECT PathId FROM brestore_pathvisibility WHERE JobId=$job) AS b ON (a.PathId = b.PathId) WHERE b.PathId IS NULL) This query inserts into pathvisibility table, for the given jobid, all parent directories that are not already marked as visible. We run this query until it tells us that 0 rows were affected (no visibility missing). It usually goes very fast (2 to 5 iterations). When we get here, job's (finally) done. I hope I've been explaining well enough. To give an idea of the volumes, here are the respective sizes of all main tables in our postgres database (index included): table | public.file | 44 GB table | public.path | 533 MB table | public.filename | 1660 MB table | public.brestore_pathvisibility | 1678 MB table | public.brestore_pathhierarchy | 178 MB So the 2 brestore tables aren't very small, but almost negligible compared to the file table (about 5% increase in size for the database). A last thing : we also have a brestore_knownjobid table, that lets us know which jobids have been calculated in our 2 tables. When the client requires an unknown jobid, we add it 'on the fly', and at client startup, we chack that in the known jobids haven't been purged from real bacula tables. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel