Re: [Bacula-devel] Query changes in the catalog browser and indexes

Cousin Marc Tue, 28 Aug 2007 02:59:53 -0700

Le Tuesday 28 August 2007 10:24:03 Kern Sibbald, vous avez écrit :
> On Sunday 26 August 2007 09:17, Marc Cousin wrote:
> > On Sunday 26 August 2007 07:43:25 Kern Sibbald wrote:
> > > Hello Marc,
> > >
> > > I don't yet understand the details of how you use these tables, but it
> > > seems to me that we could modify the Bacula table structure as follows:
> > >
> > > - Split the File table into two new tables:
> > >     - Files containing only file entries
> > >     - Dirs containing the equivalent of a current File entry for
> > >         each directory, except that it would have a pointer to the
> > >         parent Path entry, and possibly a pointer to the parent Dirs
> > >         entry (however that entry may not exist in each backup).
> > >         If necessary we could add the visibility flag -- I don't see
> > > its use yet.
> >
> > For the visibility flag, it may not be that easy : a directory may be
> > visible even if it's not in a backup. For instance, /home if /home/marc
> > is backuped should be displayed, so we add an entry from /home in
> > pathvisibility for the job where /home/marc is backed up.
>
> Isn't the visibility rather easily deduced from the first path in the
> backup?


I'm not sure I understand what you mean.
Do you mean : If I have backed up /home/marc, I know I must display /home 
and / ?
Or that as I have decided to backup /home/marc, the first entry for the 
associated jobid will be /home/marc, and I can deduce from this that /home 
and / are visible too very easily ?

It works, as long as your job only has one of such directories. If you have 
for instance /home/marc and /usr, you won't be able to find /usr that easily. 
And I feel that it's too much implementation dependant : if it was decided to 
save things in a different order, or insert records in a different order (I 
don't know why, it's just for the sake of giving an example), you may end up 
screwing the display algorithm.

I'm documenting the way we build the visibility and hierarchy tables below 
(Eric tells me he has problems with his home computer, so it would take him 
time...)

Here we go...


First, two things :
- pathhierarchy is just a link between ppathid and pathid. There is no jobid 
information in it. So it really is a list of the subdirectories that may be 
found in a directory, for ALL servers mixed. The idea behind that is that 
most of the time, either a directory exists on several servers and will 
contain almost the same thing, or the directory will be on only one server of 
a few, and there is no point either in storing jobid or client id, or 
anything like that. we save a lot of space that way (first try was with 
jobids in that table, and it got very big very fast)
- pathvisibility just tells us : this pathid should be displayed for this 
jobid.


The way we do it right now in brestore is that we get all 'directly' visible 
dirs from the File table, we insert them into pathvisibility, build missing 
parts of the hierarchy, and then we iteratively insert the missing parent 
directories in pathvisibility with a query. The reasoning is that if a 
directory is visible, it's parent is visible. For a jobid, we insert all 
parent directories of a directory with a query. We call this query 
iteratively until it says it has inserted nothing (usually it means calling 
it 4 or 5 times...).

All this is for a given, missing, jobid (we do them one by one,):

The first query is :
INSERT INTO brestore_pathvisibility (PathId, JobId)
      (SELECT DISTINCT PathId, JobId FROM File WHERE JobId = $job)
This one gives us all the 'directly' visible path from the File table : a path 
is visible as long as itself is in the backup, or one of it's files ...

Then we do this :
SELECT brestore_pathvisibility.PathId, Path 
FROM brestore_pathvisibility 
JOIN Path 
        ON( brestore_pathvisibility.PathId = Path.PathId)
LEFT JOIN brestore_pathhierarchy 
        ON (brestore_pathvisibility.PathId = brestore_pathhierarchy.PathId)
WHERE brestore_pathvisibility.JobId = $job
AND brestore_pathhierarchy.PathId IS NULL
ORDER BY Path

This one means : for a given jobid, give me all Pathid/Path which are visible 
but are not a leaf in pathhierarchy. The purpose is just to save us as many 
individual selects as possible in the function building missing parts of the 
hierarchy.

For each of these path, we 
- add it's parent in pathhierarchy if it's missing (we've done a bit of 
caching in brestore to speed that up, in case we ask for creation of a branch 
of the hierarchy that has been built before in this go).
- take the 'basename' of this path and loop, until path is ''

When we've done this, we have built the missing entries in pathierarchy. It is 
slow on the first time, as there is a lot to build. After a few goes, it is 
extremely fast, as there is almost nothing to do, as most of the hierarchy is 
already learnt from other jobids. And knowing that a leaf exists in 
pathierarchy means everything is built in the tree down to the root dir 
(because we use transactions).

Now we run this :

INSERT INTO brestore_pathvisibility (PathId, JobId) (
SELECT a.PathId,$job
FROM
          (SELECT DISTINCT h.PPathId AS PathId
          FROM brestore_pathhierarchy AS h
          JOIN  brestore_pathvisibility AS p ON (h.PathId=p.PathId)
          WHERE p.JobId=$job) AS a
          LEFT JOIN
          (SELECT PathId
          FROM brestore_pathvisibility
          WHERE JobId=$job) AS b
          ON (a.PathId = b.PathId)
WHERE b.PathId IS NULL)

This query inserts into pathvisibility table, for the given jobid, all parent 
directories that are not already marked as visible. We run this query until 
it tells us that 0 rows were affected (no visibility missing). It usually 
goes very fast (2 to 5 iterations).

When we get here, job's (finally) done.

I hope I've been explaining well enough.

To give an idea of the volumes, here are the respective sizes of all main 
tables in our postgres database (index included):

table | public.file                                  | 44 GB
table | public.path                                  | 533 MB
table | public.filename                              | 1660 MB
table | public.brestore_pathvisibility               | 1678 MB
table | public.brestore_pathhierarchy                | 178 MB

So the 2 brestore tables aren't very small, but almost negligible compared to 
the file table (about 5% increase in size for the database).

A last thing :
we also have a brestore_knownjobid table, that lets us know which jobids have 
been calculated in our 2 tables. When the client requires an unknown jobid, 
we add it 'on the fly', and at client startup, we chack that in the known 
jobids haven't been purged from real bacula tables.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Bacula-devel mailing list
Bacula-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-devel

Re: [Bacula-devel] Query changes in the catalog browser and indexes

Reply via email to