Re: [Bacula-users] painfully slow backups

Arno Lehmann Thu, 04 Oct 2007 12:05:39 -0700

Hi,

04.10.2007 00:30,, Ross Boylan wrote::
> On Sun, 2007-09-30 at 23:19 +0200, Arno Lehmann wrote:
>> Hello,
>>
>> 30.09.2007 01:36,, Ross Boylan wrote::
>>> On Sat, 2007-09-29 at 13:15 -0700, Ross Boylan wrote:
>>>> On Fri, 2007-09-28 at 08:46 +0200, Arno Lehmann wrote:
>>>>> Hello,
>>>>>
>>>>> 27.09.2007 22:47,, Ross Boylan wrote::
>>>>>> On Thu, 2007-09-27 at 09:19 +0200, Arno Lehmann wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> 27.09.2007 01:17,, Ross Boylan wrote::
>>>>>>>> I've been having really slow backups (13 hours) when I backup a large
>>>>>>>> mail spool.  I've attached a run report.  There are about 1.4M files
>>>>>>>> with a compressed size of 4G.  I get much better throughput (e.g.,
>>>>>>>> 2,000KB/s vs 86KB/s for this job!) with other jobs.
>>>>>>> 2MB/s is still not especially fast for a backup to disk, I think. So 
>>>>>>> your storage disk might also be a factor here.
>>> .....
>>>>> vmstat during a backup would be a good next step in this case, I think.
>>>>>
>>>> Here are the results of a test job.  The first vmstat was shortly after
>>>> I started the job
>>>> # vmstat 15
>>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>>> ----cpu----
>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>>>> id wa
>>>>  1  2   7460  50760 204964 667288    0    0    43    32  197   15 18  5
>>>> 75  2
>>>>  1  1   6852  51476 195492 675524   28    0  1790   358  549 1876 20  6
>>>> 36 38
>>>>  0  2   6852  51484 189332 682612    0    0  1048   416  470 1321 12  4
>>>> 41 43
>>>>  2  0   6852  52508 187344 685328    0    0   303   353  485 1369 16  4
>>>> 68 12
>>>>  1  0   6852  52108 187352 685464    0    0     1   144  468 1987 12  4
>>>> 84  0
>>>>
>>>> Sorry for the bad wrapping.  This clearly shows about 40% of the CPU
>>>> time spent in IO wait during the backup.  Another 40% is idle.  I'm not
>>>> sure if the reports are being thrown off by the fact that I have 2
>>>> virtual CPU's (not really: it's P4 with hyperthreading).  If that's the
>>>> case, the 40% might really mean 80%.
>> Interesting question... I never thought about that, and the man page 
>> writers for vmstat on my system didn't either. I suppose that vmstat 
>> bases its output on the overall available CPU time, i.e. you have 40% 
>> of all available CPU time spent in IOwait. Like, one (HT) CPU spends 
>> 80% waiting, the other no time at all.
>>
>>>> During the run I observed little CPU or memory useage above where I was
>>>> before it.  None of the bacula daemons, postgres or bzip got anywhere
>>>> near the top of my cpu use list (using ksysguard).
>>>>
>>>> A second run went much faster: 14 seconds (1721.6 KB/s) vs 64 seconds
>>>> (376.6 KB/s) the first time.  Both are much better than I got with my
>>>> original, bigger jobs.  It was so quick I think vmstat missed it
>>>> procs -----------memory---------- ---swap-- -----io---- -system--
>>>> ----cpu----
>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>>>> id wa
>>>>  1  0   6852  56496 184148 683932    0    0    43    32  197   19 18  5
>>>> 75  2
>>>>  3  0   6852  56016 178604 690024    0  113     0   429  524 3499 35 10
>>>> 55  0
>>>>  2  0   6852  51988 172476 701556    0    0     1  2023  418 3827 33 11
>>>> 55  1
>>>>
>>>> It looks as if the 2nd run only hit the cache, not the disk, while
>>>> reading the directory (bi is very low)--if I understand the output,
>>>> which is a big if.
>> I agree with your assumption.
>>
>>> Here are some more stats, systematically varying the software.  I'll
>>> just give the total backup time, starting with the 2 reported above:
>>> 64
>>> 14
>>> Upgrade to Postgresql 8.2 from 8.1
>>> 41
>>> 13
>>> upgrade to bacula 2.2.4
>>> 13
>>> 12
>> This helped a lot, I think, though it still might be the buffers 
>> performance you're measuring. 
> Yes, I think the good first time performance reflects the fact that the
> disk sectors were already in memory.  That's why I tried a different
> directory in the next test; it showed the first time performance right
> back to very slow.
>> Anyway, give the relative increase you 
>> observe, I'm rather sure that at least part of it is due to Bacula 
>> performing better.
>>
> Comparing times with the disk apparently in the cache, it was 13 seconds
> before the bacula upgrade and 13 and 12 seconds after.  That doesn't
> seem like much evidence for improvement.  I presume the part the upgrade
> should have helped (database transactions) is the same regardless of the
> issues reading files to backup.


Hmm... I might have misread your test results then - I assumed that, 
for the first runs in sequence, the times were 41 and 13 seconds. 
Assuming both jobs started with buffers not containing the test data, 
the improvement should be due to the faster catalog access.

But a larger job might be more interesting...

> Since the first and 2nd runs share the same need to populate the
> catalog, these results seem to show that the speed of the catalog is not
> the issue.

That would be a pity, because otherwise, the solution would have been 
found :-)

>>> switch to a new directory for source of backup
>>> old one has 1,606 files = 24MB copressed
>>> new one has 4,496 files = 27MB
>>> 92
>>> 22
>>>
> With a new directory, first time is up to 92 seconds (slightly more MB,
> but way more files than the earlier tests).
>>> In the slow cases vmstat shows lots of blocks in and major (40%) CPU
>>> time in iowait.
>>>
>>> I suspect the relatively good first try time with Postgresql 8.2 was a
>>> result of having some of the disk still in the cache.
>>>
>>> Even the best transfer rates were not particularly impressive (1854
>>> kb/s), but the difference between the first and 2nd runs (and the
>>> sensitivity to the number of files) seems indicate that simply finding
>>> and opening the files causes a huge slowdown.
>> Which is more or less to be expected. You can try to emulate this 
>> using something like (untested!) 'time find ./ -type f -exec ls -l 
>> \{\} >/dev/null \;' this should more or less measure the time needed 
>> to walk the diretory tree and read the file metadata.
> # time find upj -ls > /dev/null
> 
> real    0m3.383s
> user    0m0.060s
> sys     0m0.192s
> 
> That's the same directory as the earlier tests.  I figured I'd skip the
> overhead of forking processes (-exec).

Good idea, in fact... I'm simply not sure how POSIX-compliant the -ls 
to find is...

> Of course, the backup needs to open and read the file as well as the
> metadata.
>>> I guess it's not surprising that hitting the cache is faster than
>>> hitting the disk, but the very slow speed of the disk is striking.
>> Yes and yes. How fast can you read the disk raw, i.e. 'dd 
>> if=/dev/evms/CyrusSpool bs=4096 of=/dev/null' or something?
>>
>>> The mount for that partition is
>>> /dev/evms/CyrusSpool /var/spool/cyrus ext3    rw,noatime 0       2
>>> The partition is sitting on top of a lot of layers, since it's from an
>>> LVM container, with evms on top of that.
>> In my experience, that doesn't matter much. On an Athlon 500, with an 
>> LVM volume using old PATA drives connected with a really horrible IDE 
>> controller, I get a read rate of about 1GB/147s, ~7MB/s:
>>
>> time dd if=/dev/mapper/daten-lvhome bs=4096 skip=16k count=256k \
>> of=/dev/null
>> 262144+0 records in
>> 262144+0 records out
>>
>> real    2m27.031s
>> user    0m0.211s
>> sys     0m11.142s
>>
> My machine gives
> # time dd if=/dev/evms/CyrusSpool bs=4096 skip=16k count=256k
> of=/dev/null
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB) copied, 26.4824 seconds, 40.5 MB/s
> so the bottleneck isn't there.

Not lightning-fast, but better than the backup speed...

>> With up-to-date hardware, I'd expect much better throughput. This is 
>> from a dualcore Opteron with a single SATA disk, also an LVM volume:
>>
>> dd if=/dev/mapper/data-squid bs=4096 skip=16k count=256k of=/dev/null
>> 262144+0 records in
>> 262144+0 records out
>> 1073741824 bytes (1.1 GB) copied, 14.9858 s, 71.7 MB/s
>>
>>> Probably trying with the catalog on a different physical disk would be a
>>> good idea.
>> That would be my next step at least.
> With the catalog on the same disk as the  directory being backed up, I
> suppose this could mess up the caching, both by the OS and the hardware.

Caching is one thing, but the seek times and read delays increase a lot.

> Also, in the tests (though not in the real backup) the cyrus server was
> active.  Both it and postgres are probably doing a lot of sync's, which
> can't be helping performance.  It's hard to believe the effect could be
> that big, but then again it's hard to believe anything coud make the
> backup so slow.

Try the 'time dd...' while the server is active and being backed up... 
perhaps even on a file-by-file basis, not simply dumping the devices 
raw data.

I'm sure it should be possible to find the actual bottle-neck...

Arno

> Ross
> 

-- 
Arno Lehmann
IT-Service Lehmann
www.its-lehmann.de

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] painfully slow backups

Reply via email to