Re: price to pay for nocow file bit?
On Thu, Jan 8, 2015 at 11:53 AM, Lennart Poettering lenn...@poettering.net wrote: On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote: On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE. We already use fallocate(), but this is not enough on cow file systems. With fallocate() you can certainly improve fragmentation when appending things to a file. But on a COW file system this will help little if we change things in the beginning of the file, since COW means that it will then make a copy of those blocks and alter the copy, but leave the original version unmodified. And if we do that all the time the files get heavily fragmented, even though all the blocks we modify have been fallocate()d initially... This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. I think a per-file autodefrag flag would help a lot here. We've made some improvements for autodefrag and slowly growing log files because we noticed that compression ratios on slowly growing files really weren't very good. The problem was we'd never have more than a single block to compress, so the compression code would give up and write the raw data. compression + autodefrag on the other hand would take 64-128K and recow it down, giving very good results. The second problem we hit was with stable page writes. If bdflush decides to write the last block in the file, it's really a wasted IO unless the block is fully filled. We've been experimenting with a patch to leave the last block out of writepages unless its a fsync/O_SYNC. I'll code up the per-file autodefrag, we've hit a few use cases that make sense. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Thu, Jan 8, 2015 at 6:30 AM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Hmm, I am kinda interested in a solution that I can just implement in systemd/journald now and that will then just make things work for people suffering by the problem. I mean, I can hardly make systemd patch the mount options of btrfs just because I place a journal file on some fs... Is autodefrag supposed to become a default one day? Anyway, given the pros and cons I have now changed journald to set the nocow bit on newly created journal files. When files are rotated (and we hence know we will never ever write again to them) the bit is tried to be unset again, and a defrag ioctl will be invoked right after. btrfs currently silently ignores that we unset the bit, and leaves it set, but I figure i should try to unset it anyway, in case it learns that one day. After all, after rotating the files there's no reason to treat the files special anymore... I don't think it makes sense to unset nocow on a non-zero byte file anymore than it makes sense to set it. The functional equivalent that'd need to be done is: touch system@blah.journal~ chattr -C system@blah.journal~ cp system@blah.journal system@blah.journal~ The copy won't have nocow set. I suggest just leaving it alone. +C the /var/log/journal/ directory before machine-name directories are created, and then everything in there automatically inherits +C upon creation. No need to unset or defrag, in particular on SSD's I think it's sorta pointless excess writing. Set it and forget it policy. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
Am Freitag, 9. Januar 2015, 16:52:59 schrieb David Sterba: On Thu, Jan 08, 2015 at 02:30:36PM +0100, Lennart Poettering wrote: On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Hmm, I am kinda interested in a solution that I can just implement in systemd/journald now and that will then just make things work for people suffering by the problem. I mean, I can hardly make systemd patch the mount options of btrfs just because I place a journal file on some fs... Is autodefrag supposed to become a default one day? Maybe. The option brings a performance hit because reading a block that's out of sequential order with it's neighbors will also require to read the neighbors. Then the group (like 8 blocks) will be written sequentially to a new location. It's an increased read latency in the fragmented case and more stress to the block allocator. Practically it's not that bad for general use, eg. a root partition, but now it's still users' decision whether to use it or not. I am concerned about flash based storage as probably not needing it and for the additional writes it causes. And about free space fragmentation due to regular defragmenting. I read on XFS mailing list more than one, not to run xfs_fsr, the XFS online defrag tool regularily from a cron job, as it can make free space fragmentation worse. And given the issues BTRFS still has with free space handling (see the thread I started about it and the kernel bug report 90401), I am vary of anything that could add more of free space fragmentation by default, especially when its not needed, like on an SSD. I have merkaba:/home/martin/.local/share/akonadi/db_data/akonadi filefrag parttable.ibd parttable.ibd: 8039 extents found And I had this up to 4 extents already, I did try manual defragmenting it with various options to look whether I see any effect: None. Same with desktop search database of KDE. On my dual SSD BTRFS RAID 1 setup the amount of extents simply does not seem to matter at all, except for journalctl where I saw some noticable delays on initially calling it. But right now also there its on one hand just about one second – which I consider to be on the other hand much giving its a SSD RAID 1. But heck, the fragmentation of some of those files in there is abysmal considering the small size of the files: merkaba:/var/log/journal/1354039e4d4bb8de4f97ac840004 filefrag * system@00050bbcaeb23ff2-c7230ef5d29df634.journal~: 2030 extents found system@00050be4b7106b25-a4ab21cd18c0424c.journal~: 1859 extents found system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~: 1803 extents found system@2f7df24c6b70488fa9724b00ab6e6043-0001-00050bf84d2ae7be.journal: 1076 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b22f7-00050bfb82b379f8.journal: 84 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b22fb-00050bfb8657c8b0.journal: 1036 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b2693-00050c0d8075ea4b.journal: 1478 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b4136-00050c3782b1c527.journal: 2 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b4137-00050c378666837a.journal: 142 extents found system@2f7df24c6b70488fa9724b00ab6e6043-001b414c-00050c37c7883228.journal: 574 extents found system@5ee315765b1a4c6d9ed2fe833dec7094-00010fdd-00050b56fa20f846.journal: 2309 extents found system.journal: 783 extents found user-1000@cc345f87cb404df6a9588b0b1c707007-00011061-00050b56fa223006.journal: 340 extents found user-1000@cc345f87cb404df6a9588b0b1c707007-001ad624-00050ba77c734a3b.journal: 564 extents found user-1000@cc345f87cb404df6a9588b0b1c707007-001b297c-00050c0d8077447c.journal: 105 extents found user-1000.journal: 133 extents found user-120.journal: 5 extents found user-2012.journal: 2 extents found user-65534.journal: 222 extents found merkaba:/var/log/journal/1354039e4d4bb8de4f97ac840004 du -sh * | cut - c1-72 16M system@00050bbcaeb23ff2-c7230ef5d29df634.journal~ 16M system@00050be4b7106b25-a4ab21cd18c0424c.journal~ 16M system@00050bf84d2efb2c-1e4e85dacaf1252c.journal~ 8,0Msystem@2f7df24c6b70488fa9724b00ab6e6043-0001-00050bf84d 8,0Msystem@2f7df24c6b70488fa9724b00ab6e6043-001b22f7-00050bfb82 8,0Msystem@2f7df24c6b70488fa9724b00ab6e6043-001b22fb-00050bfb86 8,0M
Re: price to pay for nocow file bit?
Am Samstag, 10. Januar 2015, 13:00:23 schrieben Sie: I have seen this setting before, but I thought, well, logs would be good to keep. But for the SSD based laptop I will try volatile storage now. I will see whether I missed a longer history, but I reduced it before anyway to a 14 day maximum retention time already, cause systemd used 1,1 GiB of my root partition for logs while rsyslog + logrotate used much less[1]. And I have yet not seen the immediate benefit for me here on this laptop to justify using up that much resources just for logging. So for me its a useless waste of resources currently. (This may be different on a server or anywhere where logfiles matter more, but then, when I consider some of our server VMs with just 4 to 5 GiB VMDK file, journald on Debian in default settings could easily fill the remaining space on some of them. Which I would consider a regression.) Okay, scratch that. journald is adaptive to the remaining space on the disk AFAIK. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
Am Donnerstag, 8. Januar 2015, 06:30:59 schrieb Duncan: FWIW, I'm systemd on btrfs here, but I use syslog-ng for my non-volatile logs and have Storage=volatile in journald.conf, using journald only for current-session, where unit status including last-10-messages makes troubleshooting /so/ much easier. =:^) Once past current-session, text logs are more useful to me, which is where syslog-ng comes in. Each to its strength, and keeping the journals from wearing the SSDs[1] is a very nice bonus. =:^) Nice, I try this as well. Cause while journalctl provides some nice stuff to query the logs, even by field or time and what not, frankly on my laptop, I don´t care. I have seen this setting before, but I thought, well, logs would be good to keep. But for the SSD based laptop I will try volatile storage now. I will see whether I missed a longer history, but I reduced it before anyway to a 14 day maximum retention time already, cause systemd used 1,1 GiB of my root partition for logs while rsyslog + logrotate used much less[1]. And I have yet not seen the immediate benefit for me here on this laptop to justify using up that much resources just for logging. So for me its a useless waste of resources currently. (This may be different on a server or anywhere where logfiles matter more, but then, when I consider some of our server VMs with just 4 to 5 GiB VMDK file, journald on Debian in default settings could easily fill the remaining space on some of them. Which I would consider a regression.) [1] systemd: journal is quite big compared to rsyslog output https://bugs.debian.org/773538 -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote: Hmmm...it seems the handwaving about tail-packing that I was previously ignoring is important after all. A few quick tests with filefrag show that btrfs isn't doing full tail-packing, only small file allocation (i.e. files smaller than 4096 bytes get stored inline, and nothing else does, not even sparse files with a single 1-byte extent at offset != 0). Thus the inline storage avoids fragmentation only to the minimum extent possible. That's right, btrfs does not do the reiserfs-style tail packing, and IMHO will never do that. This brings a lot of code complexity than it's worth in the end. Short appends to the end of the file effectively become modifications of the last block of the file. That triggers CoW on the append, and if we're doing lots of tiny writes the file becomes extremely fragmented (exactly the worst case of one fragment per block). A mix of big and small appends seems to use fallocated space for those writes that cover complete blocks, which is arguably worse than not fallocating at all. So fallocate will not help until btrfs learns to do tail-packing, or some other way to avoid this problem. This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. Wrong theory but same result. The writes at the beginning just keep replacing a single extent over and over, which has a worst-case effect of adding a single fragment to the beginning of a file that would not otherwise be fragmented. The appends are causing fragmentation all by themselves. :-P OTOH, the appending write and the header rewrite happen at roughly same time so the actual block allocations may end up close to each other as well. But yes, one cannot rely on that. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Fri, Jan 09, 2015 at 04:41:03PM +0100, David Sterba wrote: On Thu, Jan 08, 2015 at 01:36:21PM -0500, Zygo Blaxell wrote: Hmmm...it seems the handwaving about tail-packing that I was previously ignoring is important after all. A few quick tests with filefrag show that btrfs isn't doing full tail-packing, only small file allocation (i.e. files smaller than 4096 bytes get stored inline, and nothing else does, not even sparse files with a single 1-byte extent at offset != 0). Thus the inline storage avoids fragmentation only to the minimum extent possible. That's right, btrfs does not do the reiserfs-style tail packing, and IMHO will never do that. This brings a lot of code complexity than it's worth in the end. If the file has been fallocated past EOF, it may make sense to do the extra work of maintaining a tail fragment in metadata until it's bigger than a block, and therefore large enough to write to the fallocated extent. At least in that case the application has explicitly asked the filesystem for more optimization than in the general append case. Otherwise, what are fallocations past EOF for? If the application appends 4K blocks all the time everything is fine, but that requirement might not work for journald, and doesn't work for rsyslog, mboxes, and many other long-running small-write use cases that append in non-block-sized units. On the other hand...it could be easier to handle such cases with a special case of autodefrag--one that focuses on appends, so it can be enabled by default earlier than the other problematic autodefrag use cases. It may even be faster to defragment in small batches (coalescing a few hundred blocks at a time near the end of file) than to do tail-packing on every append, especially if metadata blocks have much more overhead than data blocks (e.g. dup metadata with single data on spinning rust). The fallocate would be wasted in this case, but the number of fragments in the final file would be reasonably sane. This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. Wrong theory but same result. The writes at the beginning just keep replacing a single extent over and over, which has a worst-case effect of adding a single fragment to the beginning of a file that would not otherwise be fragmented. The appends are causing fragmentation all by themselves. :-P OTOH, the appending write and the header rewrite happen at roughly same time so the actual block allocations may end up close to each other as well. But yes, one cannot rely on that. The header rewrite is close to the last append, but that's not really useful. There will be one header near one appending write, but there are also thousands of other appending writes separated in time and space on the disk, even after fallocate preallocated contiguous space for the file. signature.asc Description: Digital signature
Re: price to pay for nocow file bit?
On Thu, Jan 08, 2015 at 02:30:36PM +0100, Lennart Poettering wrote: On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Hmm, I am kinda interested in a solution that I can just implement in systemd/journald now and that will then just make things work for people suffering by the problem. I mean, I can hardly make systemd patch the mount options of btrfs just because I place a journal file on some fs... Is autodefrag supposed to become a default one day? Maybe. The option brings a performance hit because reading a block that's out of sequential order with it's neighbors will also require to read the neighbors. Then the group (like 8 blocks) will be written sequentially to a new location. It's an increased read latency in the fragmented case and more stress to the block allocator. Practically it's not that bad for general use, eg. a root partition, but now it's still users' decision whether to use it or not. Anyway, given the pros and cons I have now changed journald to set the nocow bit on newly created journal files. When files are rotated (and we hence know we will never ever write again to them) the bit is tried to be unset again, and a defrag ioctl will be invoked right after. btrfs currently silently ignores that we unset the bit, and leaves it set, but I figure i should try to unset it anyway, in case it learns that one day. After all, after rotating the files there's no reason to treat the files special anymore... I'll keep an eye on this, and see if I still get user complaints about it. Should autodefrag become default eventually we can get rid of this code in journald again. One question regarding the btrfs defrag ioctl: playing around with it it appears to be asynchronous, the defrag request is simply queued and the ioctl returns immediately. Which is great for my usecase. However I was wondering if it always was async like this? I googled a bit, and found reports that defrag might take a while, but I am not sure if those reports were about the ioctl taking so long, or the effect of defrag actually hitting the disk... Defrag can be both sync and async, that's what the option -f is for. Schedule file blocks for write and flush it, then go to the next file. This avoids the hit in the async mode when tons of data can get redirtied at once. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Wed, Jan 7, 2015 at 1:10 PM, Josef Bacik jba...@fb.com wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. I've been using autodefrag and haven't run into journal corruptions that I can attribute to btrfs since the last one was fixed over a year ago. Chris Mason has suggested preference to use of autodefrag for this use case rather than xattr +C. But I don't know the time frame for autodefrag by default, it's come up a couple times but it's not the default yet. I've found autodefrag journals are less than 200 fragments, and average between 50-150 fragments. Without it, this spirals into thousands quite quickly. Searches don't seem slower when journal files are made of a few extents vs ~ 100, but beyond several hundred let alone several thousand it becomes noticeable. A somewhat minor negative of +C: In case of RAID 1 or higher and silent data corruption, there will be no Btrfs detection due to lack of checksum and therefore no correction. In the case a drive reports a read error then it's corrected, same as with md or lvm raid1+. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Chris Murphy schreef op 08-01-15 om 09:24: On Wed, Jan 7, 2015 at 1:10 PM, Josef Bacik jba...@fb.com wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. I've been using autodefrag and haven't run into journal corruptions that I can attribute to btrfs since the last one was fixed over a year ago. Chris Mason has suggested preference to use of autodefrag for this use case rather than xattr +C. But I don't know the time frame for autodefrag by default, it's come up a couple times but it's not the default yet. Same here, no issues with using autodefrag and journals. regards, Koen -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.5 (Darwin) Comment: GPGTools - http://gpgtools.org iD8DBQFUrkFVMkyGM64RGpERAgGKAJ9pmXA4STYx6sUJP5HBALcUCkfMqwCeNhzR 8v4u6bvhtFZYxYbGDiHghps= =4MPU -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Hmm, I am kinda interested in a solution that I can just implement in systemd/journald now and that will then just make things work for people suffering by the problem. I mean, I can hardly make systemd patch the mount options of btrfs just because I place a journal file on some fs... Is autodefrag supposed to become a default one day? Anyway, given the pros and cons I have now changed journald to set the nocow bit on newly created journal files. When files are rotated (and we hence know we will never ever write again to them) the bit is tried to be unset again, and a defrag ioctl will be invoked right after. btrfs currently silently ignores that we unset the bit, and leaves it set, but I figure i should try to unset it anyway, in case it learns that one day. After all, after rotating the files there's no reason to treat the files special anymore... I'll keep an eye on this, and see if I still get user complaints about it. Should autodefrag become default eventually we can get rid of this code in journald again. One question regarding the btrfs defrag ioctl: playing around with it it appears to be asynchronous, the defrag request is simply queued and the ioctl returns immediately. Which is great for my usecase. However I was wondering if it always was async like this? I googled a bit, and found reports that defrag might take a while, but I am not sure if those reports were about the ioctl taking so long, or the effect of defrag actually hitting the disk... Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On 8/1/2015 3:30 μμ, Lennart Poettering wrote: On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote: On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Hmm, I am kinda interested in a solution that I can just implement in systemd/journald now and that will then just make things work for people suffering by the problem. I mean, I can hardly make systemd patch the mount options of btrfs just because I place a journal file on some fs... Is autodefrag supposed to become a default one day? Anyway, given the pros and cons I have now changed journald to set the nocow bit on newly created journal files. When files are rotated (and we hence know we will never ever write again to them) the bit is tried to be unset again, and a defrag ioctl will be invoked right after. btrfs currently silently ignores that we unset the bit, and leaves it set, but I figure i should try to unset it anyway, in case it learns that one day. After all, after rotating the files there's no reason to treat the files special anymore... Can this behaviour be optional? I dont mind some fragmentation if i can keep having checksums and the ability for raid 1 to repair those files. I'll keep an eye on this, and see if I still get user complaints about it. Should autodefrag become default eventually we can get rid of this code in journald again. One question regarding the btrfs defrag ioctl: playing around with it it appears to be asynchronous, the defrag request is simply queued and the ioctl returns immediately. Which is great for my usecase. However I was wondering if it always was async like this? I googled a bit, and found reports that defrag might take a while, but I am not sure if those reports were about the ioctl taking so long, or the effect of defrag actually hitting the disk... Lennart -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Thu, Jan 08, 2015 at 05:53:21PM +0100, Lennart Poettering wrote: On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote: On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE. We already use fallocate(), but this is not enough on cow file systems. With fallocate() you can certainly improve fragmentation when appending things to a file. But on a COW file system this will help little if we change things in the beginning of the file, since COW means that it will then make a copy of those blocks and alter the copy, but leave the original version unmodified. And if we do that all the time the files get heavily fragmented, even though all the blocks we modify have been fallocate()d initially... Hmmm...it seems the handwaving about tail-packing that I was previously ignoring is important after all. A few quick tests with filefrag show that btrfs isn't doing full tail-packing, only small file allocation (i.e. files smaller than 4096 bytes get stored inline, and nothing else does, not even sparse files with a single 1-byte extent at offset != 0). Thus the inline storage avoids fragmentation only to the minimum extent possible. Short appends to the end of the file effectively become modifications of the last block of the file. That triggers CoW on the append, and if we're doing lots of tiny writes the file becomes extremely fragmented (exactly the worst case of one fragment per block). A mix of big and small appends seems to use fallocated space for those writes that cover complete blocks, which is arguably worse than not fallocating at all. So fallocate will not help until btrfs learns to do tail-packing, or some other way to avoid this problem. This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. Wrong theory but same result. The writes at the beginning just keep replacing a single extent over and over, which has a worst-case effect of adding a single fragment to the beginning of a file that would not otherwise be fragmented. The appends are causing fragmentation all by themselves. :-P Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: price to pay for nocow file bit?
On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote: On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE. We already use fallocate(), but this is not enough on cow file systems. With fallocate() you can certainly improve fragmentation when appending things to a file. But on a COW file system this will help little if we change things in the beginning of the file, since COW means that it will then make a copy of those blocks and alter the copy, but leave the original version unmodified. And if we do that all the time the files get heavily fragmented, even though all the blocks we modify have been fallocate()d initially... This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE. This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and on btrfs the first write to each block will not be COWed or compressed (I'm hand-waving away some details here related to small writes, file tails, and inline storage, but the end result is the same). If there's a configured target size for journals then allocate that amount; otherwise, double the allocated size each time the visible file size reaches a power of two so that the number of fragments is logarithmic over file size. This should get you what you want without all the dangerous messing around with data integrity controls and defragmentation. Defragmentation has a number of negative side-effects of its own: it searches for free space aggressively and holds locks that can block writes for a long time (I've learned the hard way that this can be over 20 minutes for a 1GB file, long enough to trigger hardware watchdog resets). There are some other good reasons to never defragment, but they don't arise in journald's use cases. I, for one, use btrfs scrub to detect data corruption that occurs during early stages of disk failure. I'd object strongly to applications randomly turning off data integrity features without being explicitly configured to do so, especially those that do most of the writing. It would create areas of the disk that are blind spots when testing for storage corruption errors, and in journald's case those blind spots would be among the most significant sources of data about storage corruption. I don't really care if applications can survive corrupted data--as the owner of the storage, I need to be aware that storage-level corruption is happening. I don't want to have to test different areas of the filesystem with a dozen different application-specific tools. That particular insanity is one of the reasons why I now use btrfs and not ext4. Hence I am mostly interested in what else is lost if this flag is turned on by default for all journal files journald creates: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Thanks, Lennart -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
Re: price to pay for nocow file bit?
On 2015-01-08 19:24, Konstantinos Skarlatos wrote: Anyway, given the pros and cons I have now changed journald to set the nocow bit on newly created journal files. When files are rotated (and we hence know we will never ever write again to them) the bit is tried to be unset again, and a defrag ioctl will be invoked right after. btrfs currently silently ignores that we unset the bit, and leaves it set, but I figure i should try to unset it anyway, in case it learns that one day. After all, after rotating the files there's no reason to treat the files special anymore... Can this behaviour be optional? I dont mind some fragmentation if i can keep having checksums and the ability for raid 1 to repair those files. I agree with Konstantinos's request: please let this behavior optional. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/08/2015 08:53 AM, Lennart Poettering wrote: this will help little if we change things in the beginning of the file, Have you considered changing the format so that those pointers are stored at the end of the file, letting data always be append only? While it is traditional to have things at the beginning as headers, there are formats like zip where metadata is stored at the end instead providing other benefits. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1 iEYEARECAAYFAlSu68gACgkQmOOfHg372QSn5wCfaRAfI/xN3SHiDEPNMjjAuFQB NbcAn2GCjzZyfHocF7yTKEBFdt3znD6n =KL2f -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
Josef Bacik posted on Wed, 07 Jan 2015 15:10:06 -0500 as excerpted: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) Yeah no compression, no checksums. If you do reflink then you'll COW once and then the new COW will be nocow so it'll be fine. Same goes for snapshots. So you'll likely incur some fragmentation but less than before, but I'd measure to just make sure if it's that big of a deal. What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? Performance is the same. The otherwise nocow on-snapshot cow1 is per-block (4096-byte AFAIK), so some fragmentation, but slower. The perfect storm situation is people doing automated per-minute snapshots or similar (some people go to extremes with snapper or the like...), in which case setting nocow often doesn't help a whole lot, depending on how active the file-writing is, of course. But for something like append-plus-pointer-update-pattern log files with something like per-day snapshotting, nocow should at least in theory help quite a bit, since the write-frequency and thus the prevented cows should be MUCH higher than the daily snapshot and thus the forced-block-cow1s. - FWIW, I'm systemd on btrfs here, but I use syslog-ng for my non-volatile logs and have Storage=volatile in journald.conf, using journald only for current-session, where unit status including last-10-messages makes troubleshooting /so/ much easier. =:^) Once past current-session, text logs are more useful to me, which is where syslog-ng comes in. Each to its strength, and keeping the journals from wearing the SSDs[1] is a very nice bonus. =:^) --- [1] I can and do filter what syslog-ng writes, but couldn't find a way to filter journald's writes, only queries/reads. That alone saves writes for repeated noise I'm filtering out with syslog before it's ever written, that journald would still be writing if I let it write non- volatile. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Nope there's no real con other than you don't get csums, but that doesn't really matter for you. Thanks, In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to rebuild the sector ? I suppose no; if so this has to be add to the cons I think. From my tests [1][2] I was unable to get bigger difference between doing a defrag and setting chattr -C the log directory. Did you get other results, if so I am interested to know more. BR G.Baroncelli [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html [2] http://lists.freedesktop.org/archives/systemd-devel/2014-June/020141.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: https://urldefense.proofpoint.com/v1/url?u=http://www.spinics.net/lists/linux-btrfs/msg33134.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=ODekp6cRJncqEDXqNoiRQ1kLtNawlAzzBmNPpCF7hIw%3D%0As=3868518396650e6542b0189719e11f9c490e400c5205c29a20db0b699969c414 it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. Yup its no worse than ext4. Hence I am mostly interested in what else is lost if this flag is turned on by default for all journal files journald creates: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) Yeah no compression, no checksums. If you do reflink then you'll COW once and then the new COW will be nocow so it'll be fine. Same goes for snapshots. So you'll likely incur some fragmentation but less than before, but I'd measure to just make sure if it's that big of a deal. What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? Performance is the same. I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Nope there's no real con other than you don't get csums, but that doesn't really matter for you. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On 01/07/2015 04:05 PM, Goffredo Baroncelli wrote: I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Nope there's no real con other than you don't get csums, but that doesn't really matter for you. Thanks, In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to rebuild the sector ? I suppose no; if so this has to be add to the cons I think. It won't know its corrupted, but it can rebuild if say you yank a drive and add a new one. RAID5/RAID6 would catch corruption of course. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
price to pay for nocow file bit?
Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. Hence I am mostly interested in what else is lost if this flag is turned on by default for all journal files journald creates: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Thanks, Lennart -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html