Re: R: Re: Slow startup of systemd-journal on BTRFS

2014-06-13 Thread Goffredo Baroncelli
Hi Dave

On 06/13/2014 01:24 AM, Dave Chinner wrote:
 On Thu, Jun 12, 2014 at 12:37:13PM +, Duncan wrote:
 Goffredo Baroncelli kreij...@libero.it posted on Thu, 12 Jun 2014
 13:13:26 +0200 as excerpted:

 systemd has a very stupid journal write pattern. It checks if there is
 space in the file for the write, and if not it fallocates the small
 amount of space it needs (it does *4 byte* fallocate calls!) and then
 does the write to it.  All this does is fragment the crap out of the log
 files because the filesystems cannot optimise the allocation patterns.

 I checked the code, and to me it seems that the fallocate() are done in
 FILE_SIZE_INCREASE unit (actually 8MB).

 FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
 actually pretty much equally bad without NOCOW set on the file.
 
 So maybe it's been fixed in systemd since the last time I looked.
 Yup:
 
 http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58
 
 The reason it was changed? To save a syscall per append, not to
 prevent fragmentation of the file, which was the problem everyone
 was complaining about...

thanks for pointing that. However I am performing my tests on a fedora 20 with 
systemd-208, which seems have this change
 
 Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
 either 4 byte or 8 MiB file allocations is going to end up being the 
 same, forcing (repeated until full) rewrite of each 4 KiB block into its 
 own extent.


I am reaching the conclusion that fallocate is not the problem. The fallocate 
increase the filesize of about 8MB, which is enough for some logging. So it is 
not called very often.

I have to investigate more what happens when the log are copied from /run to 
/var/log/journal: this is when journald seems to slow all.

I am prepared a PC which reboot continuously; I am collecting the time required 
to finish the boot vs the fragmentation of the system.journal file vs the 
number of boot. The results are dramatic: after 20 reboot, the boot time 
increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot 
time to the original one, but after another 20 reboot, the boot time still 
requires 20-30 seconds more

It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 
8GB).

For both PC the HD is a mechanical one...

 
 And that's now a btrfs problem :/

Are you sure ?

ghigo@venice:/var/log$ sudo filefrag messages
messages: 29 extents found

ghigo@venice:/var/log$ sudo filefrag journal/*/system.journal
journal/41d686199835445395ac629d576dfcb9/system.journal: 1378 extents found

So the old rsyslog creates files with fewer fragments. BTRFS (but it seems also 
xfs) for sure highlights more this problem than other filesystem. But also 
systemd seems to create a lot of extens.

BR
G.Baroncelli



 
 Cheers,
 
 Dave.
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: R: Re: Slow startup of systemd-journal on BTRFS

2014-06-12 Thread Duncan
Goffredo Baroncelli kreij...@libero.it posted on Thu, 12 Jun 2014
13:13:26 +0200 as excerpted:

systemd has a very stupid journal write pattern. It checks if there is
space in the file for the write, and if not it fallocates the small
amount of space it needs (it does *4 byte* fallocate calls!) and then
does the write to it.  All this does is fragment the crap out of the log
files because the filesystems cannot optimise the allocation patterns.
 
 I checked the code, and to me it seems that the fallocate() are done in
 FILE_SIZE_INCREASE unit (actually 8MB).

FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think 
actually pretty much equally bad without NOCOW set on the file.

Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for 
either 4 byte or 8 MiB file allocations is going to end up being the 
same, forcing (repeated until full) rewrite of each 4 KiB block into its 
own extent.

Turning off the fallocate should allow btrfs to at least consolidate a 
bit, tho to the extent that multiple 4 KiB blocks cannot be written, 
repeated fsync will still cause issues.

80-100 MiB logs (size mentioned in another reply) should be reasonably 
well handled by btrfs autodefrag, however, if it's turned on.  I'd be 
worried if sizes were  256 MiB and certainly as sizes approached a GiB, 
but it should handle 80-100 MiB just fine.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html