I wrote:
> That should fix the most common case of the multiple small jobs starting
> spooling at the same time. I've applied it, and will check that this
> does, in fact, happen, and report back.
It worked as expected. Multiple jobs spooling at the same time, and
despooling sequentially, now have their tape usage properly reflected in
the JOBMEDIA table. Here's an example:
jobmediaid | jobid | mediaid | firstindex | lastindex | startfile | endfile |
startblock | endblock
------------+-------+---------+------------+-----------+-----------+---------+------------+----------
184 | 87 | 3 | 1 | 743 | 58 | 58 |
0 | 192
185 | 88 | 3 | 1 | 103 | 58 | 58 |
193 | 843
186 | 89 | 3 | 1 | 282 | 58 | 58 |
844 | 6341
187 | 90 | 3 | 1 | 76 | 58 | 58 |
6342 | 6389
Kern Sibbald <[EMAIL PROTECTED]> writes:
> Still I have some doubts about what gets put in the jobmedia for the start of
> a second despooling call.
That's the other failure scenario, and no, your fix doesn't address it.
> The session labels don't contain any position dependent data, so that is not
> important.
Ah, great! I didn't read the code carefully; I assumed that since the
function that creates the label copies the tape position into the dcr,
it must also need it.
> If you can convince/show me that something is going wrong with multiple
> spoolings, then your suggestion becomes much more important.
The sequence of events that causes that error is quite simple. It will
"bite" when two jobs are spooling at the same time, and the one that
despools first is big enough to need two rounds of spooling.
I'll give a run-down of what happens (the example is from a run before
your latest fix, but that fix cannot possibly have any effect, since the
fix involves StartFile/StartBlock, not EndFile/EndBlock).
Two jobs are shown. Job 1 is a large job from a fast host, job 2 is a
somewhat smaller one from a slower host. Both start spooling at the
same time, then job 1 hits its limit, and despools. By the time it is
done despooling the first batch, job 2 is ready to despool. While job 2
despools, job 1 completes its second spooling run, and waits for access
to the tape before despooling.
The problem is in the call write_session_label() that job 1 performs at
the end of its second spooling run, to generate the end label. This
causes EndFile and EndBlock in job 1's dcr to be set to the current tape
position (while job 2 is despooling). The reason this is a problem is
that this overwrites the information about where job 1's first
despooling actually finished, and this data is still needed: the last
JOBMEDIA record for the first despooling run has not been written yet,
but will be written when the second despooling run starts.
First, the relevant rows from the FILEMEDIA table (the bogus row is the
one that has jobmediaid 28):
jobmediaid | jobid | mediaid | firstindex | lastindex | startfile | endfile |
startblock | endblock
------------+-------+---------+------------+-----------+-----------+---------+------------+---------
1 | 1 | 1 | 1 | 5959 | 0 | 0 |
1 | 15499
2 | 1 | 1 | 5959 | 9885 | 1 | 1 |
0 | 15499
3 | 1 | 1 | 9885 | 15463 | 2 | 2 |
0 | 15499
4 | 1 | 1 | 15463 | 15673 | 3 | 3 |
0 | 15499
5 | 1 | 1 | 15673 | 15903 | 4 | 4 |
0 | 15499
6 | 1 | 1 | 15903 | 16161 | 5 | 5 |
0 | 15499
7 | 1 | 1 | 16161 | 16386 | 6 | 6 |
0 | 15499
8 | 1 | 1 | 16386 | 16642 | 7 | 7 |
0 | 15499
9 | 1 | 1 | 16642 | 16863 | 8 | 8 |
0 | 15499
10 | 1 | 1 | 16863 | 17048 | 9 | 9 |
0 | 15499
11 | 1 | 1 | 17048 | 17254 | 10 | 10 |
0 | 15499
12 | 1 | 1 | 17254 | 17472 | 11 | 11 |
0 | 15499
13 | 1 | 1 | 17472 | 17682 | 12 | 12 |
0 | 15499
14 | 1 | 1 | 17682 | 17904 | 13 | 13 |
0 | 15499
15 | 1 | 1 | 17904 | 21904 | 14 | 14 |
0 | 15499
16 | 1 | 1 | 21904 | 28550 | 15 | 15 |
0 | 15499
17 | 1 | 1 | 28550 | 28556 | 16 | 16 |
0 | 15499
18 | 2 | 1 | 1 | 4229 | 17 | 17 |
2756 | 15499
19 | 2 | 1 | 4229 | 4905 | 18 | 18 |
0 | 15499
20 | 2 | 1 | 4905 | 9033 | 19 | 19 |
0 | 15499
21 | 2 | 1 | 9033 | 11691 | 20 | 20 |
0 | 15499
22 | 2 | 1 | 11691 | 14411 | 21 | 21 |
0 | 15499
23 | 2 | 1 | 14411 | 15215 | 22 | 22 |
0 | 15499
24 | 2 | 1 | 15215 | 15942 | 23 | 23 |
0 | 15499
25 | 2 | 1 | 15942 | 16795 | 24 | 24 |
0 | 15499
26 | 2 | 1 | 16795 | 17586 | 25 | 25 |
0 | 15499
27 | 2 | 1 | 17586 | 17589 | 26 | 26 |
0 | 11094
28 | 1 | 1 | 28556 | 28558 | 17 | 20 |
0 | 3397
29 | 1 | 1 | 28558 | 28561 | 26 | 26 |
11095 | 15499
30 | 1 | 1 | 28561 | 28707 | 27 | 27 |
0 | 15499
31 | 1 | 1 | 28707 | 31670 | 28 | 28 |
0 | 15499
32 | 1 | 1 | 31670 | 31684 | 29 | 29 |
0 | 15499
33 | 1 | 1 | 31684 | 31812 | 30 | 30 |
0 | 15499
34 | 1 | 1 | 31812 | 32824 | 31 | 31 |
0 | 15499
35 | 1 | 1 | 32824 | 56664 | 32 | 32 |
0 | 15499
36 | 1 | 1 | 56664 | 56668 | 33 | 33 |
0 | 717
Here are some entries from the log file, with a couple of comments interspersed:
24-Mar 17:06 sirius-dir JobId 1: Start Backup JobId 1,
Job=athene_User.2008-03-24_17.06.03
24-Mar 17:06 sirius-sd JobId 1: Spooling data ...
24-Mar 17:06 sirius-dir JobId 2: Start Backup JobId 2,
Job=barsoom_User.2008-03-24_17.06.04
24-Mar 17:06 sirius-sd JobId 2: Spooling data ...
24-Mar 17:34 sirius-sd JobId 1: User specified spool size reached.
24-Mar 17:34 sirius-sd JobId 1: Writing spooled data to Volume. Despooling
17,179,901,776 bytes ...
*** job 1 logs correct tape positions
24-Mar 17:56 sirius-sd JobId 2: Job write elapsed time = 00:50:10, Transfer
rate = 3.166 M bytes/second
24-Mar 17:56 sirius-sd JobId 2: Committing spooled data to Volume "DLT0001".
Despooling 9,539,116,533 bytes ...
24-Mar 18:36 sirius-sd JobId 1: Despooling elapsed time = 01:02:06, Transfer
rate = 4.610 M bytes/second
24-Mar 18:36 sirius-sd JobId 1: Spooling data again ...
24-Mar 18:36 sirius-sd JobId 2: [starts physically despooling to tape]
*** job 2 logs correct tape positions
24-Mar 18:46 sirius-sd JobId 1: Job write elapsed time = 01:40:20, Transfer
rate = 3.901 M bytes/second
*** job 1 erroneously sets end file/block in dcr to current tape position
*** this is done by the write_session_label() call at stored/append.c:291
*** this will cause the old, non-flushed jobmedia record to get the wrong
data when it gets flushed later
24-Mar 18:46 sirius-sd JobId 1: Committing spooled data to Volume "DLT0001".
Despooling 6,331,234,514 bytes ...
24-Mar 19:08 sirius-sd JobId 2: Despooling elapsed time = 00:32:30, Transfer
rate = 4.891 M bytes/second
24-Mar 19:08 sirius-sd JobId 1: [starts physically despooling to tape]
*** job 1 flushes that old record, giving it the wrong end file/block from
the dcr
*** job 1 then logs correct tape positions
24-Mar 19:31 sirius-sd JobId 1: Despooling elapsed time = 00:22:48, Transfer
rate = 4.628 M bytes/second
> However, I have always been unhappy that capturing the Volume position was
> done in the session label subroutine (it is sort of hidden there), so I would
> be more inclined to move the end code out into a new subroutine, like I have
> already done for the set_start_vol_position(), then remove it from the
> session_label() subroutine, and call the two subroutines at the appropriate
> times -- that makes what is actually going on much clearer.
That would be the correct fix, of course. Put it on your to-do list! ;)
Meanwhile, another quick fix (and a sufficient one) would be to add code
to write that final JOBMEDIA row to the database at the end of the first
despooling round. In the case where a large job spools in two or more
rounds, this would lead to a slight inelegancy, in that there would be
two rows together showing that the job used all of the tape file it was
in at the intersection between spooling rounds. However, that data
would still be correct, whereas the current code produces erroneous data.
-tih
--
Self documenting code isn't. User application constraints don't. --Ed Prochak
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Bacula-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/bacula-devel