Re: Slow startup of systemd-journal on BTRFS
On 06/14/2014 04:53 AM, Duncan wrote: Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: On 06/13/2014 01:24 AM, Dave Chinner wrote: On Thu, Jun 12, 2014 at 12:37:13PM +, Duncan wrote: FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think actually pretty much equally bad without NOCOW set on the file. So maybe it's been fixed in systemd since the last time I looked. Yup: http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal- file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58 The reason it was changed? To save a syscall per append, not to prevent fragmentation of the file, which was the problem everyone was complaining about... thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change Why? Because btrfs data blocks are 4 KiB. With COW, the effect for either 4 byte or 8 MiB file allocations is going to end up being the same, forcing (repeated until full) rewrite of each 4 KiB block into its own extent. I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often. But... If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with nodatacow), then an fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. But the first time is a big write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. =:^( Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the btrfs autodefrag mount option should at least catch that and rewrite (again), this time sequentially. I have to investigate more what happens when the log are copied from /run to /var/log/journal: this is when journald seems to slow all. That's an interesting point. At least in theory, during normal operation journald will write to /var/log/journal, but there's a point during boot at which it flushes the information accumulated during boot from the volatile /run location to the non-volatile /var/log location. /That/ write, at least, should be sequential, since there will be 4 KiB of journal accumulated that needs to be transferred at once. However, if it's being handled by the forced pre-write fallocate described above, then that's not going to be the case, as it'll then be a rewrite of already fallocated file blocks and thus will get COWed exactly as I described above. =:^( I am prepared a PC which reboot continuously; I am collecting the time required to finish the boot vs the fragmentation of the system.journal file vs the number of boot. The results are dramatic: after 20 reboot, the boot time increase of 20-30 seconds. Doing a defrag of system.journal reduces the boot time to the original one, but after another 20 reboot, the boot time still requires 20-30 seconds more It is a slow PC, but I saw the same behavior also on a more modern pc (i5 with 8GB). For both PC the HD is a mechanical one... The problem's duplicable. That's the first step toward a fix. =:^) I Hope so And that's now a btrfs problem :/ Are you sure ? As they say, Whoosh! [...] Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. Meanwhile, there's some focus on making btrfs behave better with such rewrite-pattern files, but while I think the problem can be made /some/ better, hopefully enough that the defaults bother far fewer people in far
Re: [PATCH 2/2] Remove a leading '+'.
On Fri, Jun 13, 2014 at 04:46:25PM -0500, Eric Sandeen wrote: -+ fprintf(stderr, \t-F : attempt to dump superblocks with bad magic\n); + fprintf(stderr, \t-F : attempt to dump superblocks with bad magic\n); Whoa, git blame blames me, but that's not the patch I sent. ;) My mistak while merging two conflicting show-super patches, sorry. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Transaction commit: none (default)
Taking snapshots is now output this in addition of the snapshot operation. Transaction commit: none (default) 1) Is that expected/normal? It looks kind of spamming/useless to me? 2) If it's useful, what's the use I'm not getting? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs balance crash BUG ON fs/btrfs/relocation.c:1062 or RIP build_backref_tree+0x9fc/0xcc4
On Mon, Jun 09, 2014 at 04:40:07PM -0700, Marc MERLIN wrote: I did a balance on a system that had 3.11 (yes, I know, it's old). It hung. So, I rebooted with 3.13, and it failed in fs/btrfs/relocation Problem #1: I cannot stop the relocation. It starts on its own as soon as I mount the FS, and I can't stop it. Is there a bug to fix that? Chris, As Russell pointed out, I can stop the balance at mount, but back to the bug, can you make btrfs not crash like this if it gets input it doesn't expect? Separately, I'll send one lsat offer to give data off the FS to give you a chance to figure out how it got into that state and potentially find another bug. That is unless there were one or more known corruption bugs that have been fixed and that there is nothing to be learned from looking into this. Thanks, Marc Problem #2: I rebooted with 3.15rc5, and now it's worse. [ 1569.598026] kernel BUG at fs/btrfs/relocation.c:1064! then leads to [ 1569.613240] RIP [81267ef0] build_backref_tree+0x9fc/0xcc4 [ 1569.613491] RSP 880fde599ad8 [ 1569.614119] ---[ end trace da0f24875bbde960 ]--- [ 1569.614398] Kernel panic - not syncing: Fatal exception (full trace below) I'm sure that filesystem is damaged in some way, but the kernel of course should not crash. 3.15 dies here: struct backref_node *build_backref_tree(struct reloc_control *rc, (...) if (!RB_EMPTY_NODE(upper-rb_node)) { if (upper-lowest) { list_del_init(upper-lower); upper-lowest = 0; } list_add_tail(edge-list[UPPER], upper-lower); continue; } BUG_ON(!upper-checked); here So I'm sure I hit a bug and my FS has issues, but can't the kernel do something better like abort the rebalance instead of crashing? In the meantime, does anyone want anything off that filesystem before I wipe it? = Crash on 3.13: btrfs: found 4930 extents btrfs: relocating block group 82699091968 flags 1 btrfs: found 3719 extents [ cut here ] kernel BUG at /build/buildd/linux-lts-trusty-3.13.0/fs/btrfs/relocation.c:1062! invalid opcode: [#1] SMP Modules linked in: rfcomm parport_pc ppdev bnep binfmt_misc rpcsec_gss_krb5 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc snd_hda_codec_hdmi nvidia(POF) snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio snd_pcm snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_seq btusb bluetooth uvcvideo videobuf2_core videodev videobuf2_vmalloc videobuf2_memops snd_rawmidi snd_timer snd_seq_device drm snd psmouse gpio_ich sb_edac hp_wmi serio_raw edac_core mei_me mei mac_hid soundcore sparse_keymap snd_page_alloc lpc_ich wmi tpm_infineon lp parport btrfs raid6_pq xor libcrc32c hid_generic usbhid hid usb_storage firewire_ohci isci e1000e firewire_core libsas crc_itu_t ptp ahci libahci pps_core scsi_transport_sas CPU: 4 PID: 1710 Comm: btrfs-balance Tainted: PF O 3.13.0-29-generic #53~precise1-Ubuntu Hardware name: Hewlett-Packard HP Z620 Workstation/158A, BIOS J61 v01.17 11/05/2012 task: 881000bec7d0 ti: 8810018a2000 task.ti: 8810018a2000 RIP: 0010:[a01c04a8] [a01c04a8] build_backref_tree+0x1228/0x1290 [btrfs] RSP: 0018:8810018a3ab8 EFLAGS: 00010246 RAX: RBX: 880801a8fa20 RCX: 880801d2cf50 RDX: 880801d2c390 RSI: 880801d2c640 RDI: 8807ecad9c80 RBP: 8810018a3bb8 R08: 8807ecad9c80 R09: 0001 R10: 0001 R11: R12: 880801d2c650 R13: 8807ecad9900 R14: R15: 88003582a800 FS: () GS:88080fc8() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f10ee50a370 CR3: 01c0d000 CR4: 000407e0 Stack: 8807ecad9780 01ffa01bded0 880801d2c620 88003582a920 8807ecad9c80 8807ff72e800 8807ecad9240 8807ecad9200 880801a8fa20 880801a8fab0 88003582a924 88003582a820 Call Trace: [a01c064b] relocate_tree_blocks+0x13b/0x1e0 [btrfs] [a01c12b9] relocate_block_group+0x199/0x550 [btrfs] [a01c182a] btrfs_relocate_block_group+0x1ba/0x300 [btrfs] [a01999f6] btrfs_relocate_chunk.isra.62+0x56/0x3f0 [btrfs] [a01556b3] ? block_group_cache_tree_search+0xb3/0xf0 [btrfs] [a018dfe6] ? release_extent_buffer+0x36/0xe0 [btrfs] [a019c60c] __btrfs_balance+0x32c/0x420 [btrfs] [a019ca38] btrfs_balance+0x338/0x5d0 [btrfs] [a019cd54] balance_kthread+0x84/0x90 [btrfs] [a019ccd0] ? btrfs_balance+0x5d0/0x5d0 [btrfs] [8108f9a9] kthread+0xc9/0xe0 [8108f8e0] ? flush_kthread_worker+0xb0/0xb0 [817665fc] ret_from_fork+0x7c/0xb0 [8108f8e0] ?
Re: Slow startup of systemd-journal on BTRFS
Duncan 1i5t5.dun...@cox.net schrieb: As they say, Whoosh! At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on. Come on, sorry, but this is fud. Really... ;-) IOW, it's a btrfs problem now in practice, not because it is so in a technical sense, but because systemd defines it as such and is unlikely to budge, so the only way to achieve progress is for btrfs to deal with it. I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. I'd suggest the problem is to be found in the different semantics with COW filesystems. And if someone loudly complains to the systemd developers how bad they are at doing their stuff - hmm, well, I would be disapointed/offended, too, as a programmer because much very well done work has been put into systemd and I'd start ignoring such people. In Germany we have a saying for this: Wie man in den Wald hineinruft, so schallt es heraus. [1] They are doing many things right that have not been adopted to modern systems in the last twenty years (or so) with the legacy init systems. So let's start with my journals, on btrfs: $ sudo filefrag * system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal: 5 extents found system.journal: 9 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal: 2 extents found user-500.journal: 2 extents found I don't think these are too bad values, eh? Well, how did I accomblish that? First, I've set the journal directories nocow. Of course, systemd should do this by default. I'm not sure if this is a packaging or systemd code issue, tho. But I think the systemd devs are in common that for cow fs, the journal directories should be set nocow. After all, the journal is a transactional database - it does not need cow protection at all costs. And I think they have their own checksumming protection. So, why let systemd bother with that? A lot of other software has the same semantic problems with btrfs, too (ex. MySQL) where nobody shouts at the inabilities of the programmers. So why for systemd? Just because it's intrusive by its nature for being a radically and newly designed init system and thus requires some learning by its users/admins/packagers? Really? Come on... As admin and/or packager you have to stay with current technologies and developments anyways. It's only important to hide the details from the users. Back to the extents counts: What I did next was implementing a defrag job that regularly defrags the journal (actually, the complete log directory as other log files suffer the same problem): $ cat /usr/local/sbin/defrag-logs.rb #!/bin/sh exec btrfs filesystem defragment -czlib -r /var/log It can be easily converted into a timer job with systemd. This is left as an excercise to the reader. BTW: Actually, that job isn't currently executed on my system which makes the numbers above pretty impressive... However, autodefrag is turned on which may play into the mix. I'm not sure. I stopped automatically running those defrag jobs a while ago (I have a few more). An arguably fairer and more impartial assessment of this particular situations suggests that neither btrfs, which as a COW-based filesystem, like all COW-based filesystems has the existing-file-rewrite as a major technical challenge that it must deal with /somehow/, nor systemd, which in choosing to use fallocate is specifically putting itself in that existing-file-rewrite class,
Re: Slow startup of systemd-journal on BTRFS
Kai Krakow posted on Sat, 14 Jun 2014 12:59:31 +0200 as excerpted: Duncan 1i5t5.dun...@cox.net schrieb: As they say, Whoosh! At least here, I interpreted that remark as primarily sarcastic commentary on the systemd devs' apparent attitude, which can be (controversially) summarized as: Systemd doesn't have problems because it's perfect. Therefore, any problems you have with systemd must instead be with other components which systemd depends on. Come on, sorry, but this is fud. Really... ;-) I should make clear that did recently switch to systemd myself -- by choice as I'm on gentoo and it defaults to openrc, so I'm obviously not entirely anti-systemd. And I _did_ say (controversially), which means I do recognize that there are two sides to the story. That said, I've certainly seen what happens when non-systemd devs are on the receiving end of things -- including kernel devs, see the recent hubbub over the debug kernel commandline option and the somewhat longer ago firmware loading issue, among others. But sarcasm implies neither absolute truth, or it'd be speaking truth not sarcasm, nor absolute untruth, because without a kernel of truth there it'd be simply stupid, not sarcastic. And certainly that's how I read the comment. But in any case, I read it as not to be taken literally, which is what it appeared to me the person to which I was directly replying was doing. I was simply warning him off of reading it too literally as at least from here, it seemed more like sarcasm. Of course if DaveC wishes to clarify one way or another he certainly can... tho I'd not expect it at this point since if it is sarcasm as I believe, that's kind of having to explain the joke... I think that systemd is even one of the early supporters of btrfs because it will defragment readahead files on boot from btrfs. I'd suggest the problem is to be found in the different semantics with COW filesystems. Which is actually what I was saying. In reality it's an interaction between the nature of COW filesystems, where fallocate tends to be a problem, and an application using fallocate because of its benefits on overwrite-in-place filesystems, which happen to be the norm at this point. So neither one is to blame, it's simply a bad interaction that ultimately needs to be made better on one side or the other. But btrfs is still relatively young and COW-based filesystems not that widespread yet, so that the problem hasn't been worked out to be handled automatically, just yet, isn't a big surprise. Tho I think it'll come. Meanwhile, as you point out below and as I've repeatedly said in this thread already myself, NOCOW and/or autodefrag are tools available to an admin faced with the problem, that together actually solve it reasonably well. All an admin has to do is make use of the tools already available. =:^) So let's start with my journals, on btrfs: $ sudo filefrag * system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal: 5 extents found system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal: 5 extents found system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal: 5 extents found system.journal: 9 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal: 2 extents found user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal: 2 extents found user-500.journal: 2 extents found I don't think these are too bad values, eh? Well, how did I accomblish that? First, I've set the journal directories nocow. Of course, systemd should do this by default. I'm not sure if this is a packaging or systemd code issue, tho. But I think the systemd devs are in common that for cow fs, the journal directories should be set nocow. After all, the journal is a transactional database - it does not need cow protection at all costs. And I think they have their own checksumming protection. So, why let systemd bother with that? A lot of other software has the same semantic problems with btrfs, too (ex. MySQL) where nobody shouts at the inabilities of the programmers. So why for systemd? Just because it's intrusive by its nature for being a radically and newly designed init system and thus
Re: Slow startup of systemd-journal on BTRFS
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as excerpted: On 06/14/2014 04:53 AM, Duncan wrote: Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as excerpted: thanks for pointing that. However I am performing my tests on a fedora 20 with systemd-208, which seems have this change I am reaching the conclusion that fallocate is not the problem. The fallocate increase the filesize of about 8MB, which is enough for some logging. So it is not called very often. Right. But... Exactly, _but_... [A]n fallocate of 8 MiB will increase the file size by 8 MiB and write that out. So far so good as at that point the 8 MiB should be a single extent. But then, data gets written into 4 KiB blocks of that 8 MiB one at a time, and because btrfs is COW, the new data in the block must be written to a new location. Which effectively means that by the time the 8 MiB is filled, each 4 KiB block has been rewritten to a new location and is now an extent unto itself. So now that 8 MiB is composed of 2048 new extents, each one a single 4 KiB block in size. Several people pointed fallocate as the problem. But I don't understand the reason. 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during the boot. Often never because the log are less than 8MB. 2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. But the first time is a big write of 8MB; instead the second write would happen in any case. What I mean is that without the fallocate in any case journald would make small write. To be honest, I fatigue to see the gain of having a fallocate on a COW filesystem... may be that I don't understand very well the fallocate() call. The base problem isn't fallocate per se, rather, tho it's the trigger in this case. The base problem is that for COW-based filesystems, *ANY* rewriting of existing file content results in fragmentation. It just so happens that the only reason there's existing file content to be rewritten (as opposed to simply appending) in this case, is because of the fallocate. The rewrite of existing file content is the problem, but the existing file content is only there in this case because of the fallocate. Taking a step back... On a non-COW filesystem, allocating 8 MiB ahead and writing into it rewrites into the already allocated location, thus guaranteeing extents of 8 MiB each, since once the space is allocated it's simply rewritten in- place. Thus, on a non-COW filesystem, pre-allocating in something larger than single filesystem blocks when an app knows the data is eventually going to be written in to fill that space anyway is a GOOD thing, which is why systemd is doing it. But on a COW-based filesystem fallocate is the exact opposite, a BAD thing, because an fallocate forces the file to be written out at that size, effectively filled with nulls/blanks. Then the actual logging comes along and rewrites those nulls/blanks with actual data, but it's now a rewrite, which on a COW, copy-on-write, based filesystem, the rewritten block is copied elsewhere, it does NOT overwrite the existing null/blank block, and elsewhere by definition means detached from the previous blocks, thus in an extent all by itself. Once the full 2048 original blocks composing that 8 MiB are filled in with actual data, because they were rewrites from null/blank blocks that fallocate had already forced to be allocated, that's now 2048 separate extents, 2048 separate file fragments, where without the forced fallocate, the writes would have all been appends, and there would have been at least /some/ chance of some of those 2048 separate blocks being written at close enough to the same time that they would have been written together as a single extent. So while the 8 MiB might not have been a single extent as opposed to 2048 separate extents, it might have been perhaps 512 or 1024 extents, instead of the 2048 that it ended up being because fallocate meant that each block was a rewrite into an existing file, not a new append-write at the end of an existing file. [...] Another alternative is that distros will start setting /var/log/journal NOCOW in their setup scripts by default when it's btrfs, thus avoiding the problem. (Altho if they do automated snapshotting they'll also have to set it as its own subvolume, to avoid the first-write-after-snapshot- is-COW problem.) Well, that, and/or set autodefrag in the default mount options. Pay attention, that this remove also the checksum, which are very useful in a RAID configuration. Well, it can be. But this is only log data, not executable or the like data, and (as Kai K points out) journald has its own checksumming method in any case. Besides which, you still haven't explained why you can't either set the autodefrag mount option and be done with it, or run a systemd-timer- triggered or cron-triggered defrag script to defrag them