Re: Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Goffredo Baroncelli
On 06/14/2014 04:53 AM, Duncan wrote:
 Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
 excerpted:
 
 On 06/13/2014 01:24 AM, Dave Chinner wrote:
 On Thu, Jun 12, 2014 at 12:37:13PM +, Duncan wrote:

 FWIW, either 4 byte or 8 MiB fallocate calls would be bad, I think
 actually pretty much equally bad without NOCOW set on the file.

 So maybe it's been fixed in systemd since the last time I looked.
 Yup:

 http://cgit.freedesktop.org/systemd/systemd/commit/src/journal/journal-
 file.c?id=eda4b58b50509dc8ad0428a46e20f6c5cf516d58

 The reason it was changed? To save a syscall per append, not to
 prevent fragmentation of the file, which was the problem everyone was
 complaining about...

 thanks for pointing that. However I am performing my tests on a fedora
 20 with systemd-208, which seems have this change

 Why?  Because btrfs data blocks are 4 KiB.  With COW, the effect for
 either 4 byte or 8 MiB file allocations is going to end up being the
 same, forcing (repeated until full) rewrite of each 4 KiB block into
 its own extent.

 I am reaching the conclusion that fallocate is not the problem. The
 fallocate increase the filesize of about 8MB, which is enough for some
 logging. So it is not called very often.
 
 But... 
 
 If a file isn't (properly[1]) set NOCOW (and the btrfs isn't mounted with 
 nodatacow), then an fallocate of 8 MiB will increase the file size by 8 
 MiB and write that out.  So far so good as at that point the 8 MiB should 
 be a single extent.  But then, data gets written into 4 KiB blocks of 
 that 8 MiB one at a time, and because btrfs is COW, the new data in the 
 block must be written to a new location.
 
 Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
 block has been rewritten to a new location and is now an extent unto 
 itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
 single 4 KiB block in size.

Several people pointed fallocate as the problem. But I don't understand the 
reason.
1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time during 
the boot. Often never because the log are less than 8MB.
2) it is true that btrfs rewrite almost 2 times each 4kb page with fallocate. 
But the first time is a big write of 8MB; instead the second write would 
happen in any case. What I mean is that without the fallocate in any case 
journald would make small write.

To be honest, I fatigue to see the gain of having a fallocate on a COW 
filesystem... may be that I don't understand very well the fallocate() call.

 
 =:^(
 
 Tho as I already stated, for file sizes upto 128 MiB or so anyway[2], the 
 btrfs autodefrag mount option should at least catch that and rewrite 
 (again), this time sequentially.
 
 I have to investigate more what happens when the log are copied from
 /run to /var/log/journal: this is when journald seems to slow all.
 
 That's an interesting point.
 
 At least in theory, during normal operation journald will write to 
 /var/log/journal, but there's a point during boot at which it flushes the 
 information accumulated during boot from the volatile /run location to 
 the non-volatile /var/log location.  /That/ write, at least, should be 
 sequential, since there will be  4 KiB of journal accumulated that needs 
 to be transferred at once.  However, if it's being handled by the forced 
 pre-write fallocate described above, then that's not going to be the 
 case, as it'll then be a rewrite of already fallocated file blocks and 
 thus will get COWed exactly as I described above.
 
 =:^(
 
 
 I am prepared a PC which reboot continuously; I am collecting the time
 required to finish the boot vs the fragmentation of the system.journal
 file vs the number of boot. The results are dramatic: after 20 reboot,
 the boot time increase of 20-30 seconds. Doing a defrag of
 system.journal reduces the boot time to the original one, but after
 another 20 reboot, the boot time still requires 20-30 seconds more

 It is a slow PC, but I saw the same behavior also on a more modern pc
 (i5 with 8GB).

 For both PC the HD is a mechanical one...
 
 The problem's duplicable.  That's the first step toward a fix. =:^)
I Hope so
 
 And that's now a btrfs problem :/

 Are you sure ?
 
 As they say, Whoosh!
 
[...] 
 Another alternative is that distros will start setting /var/log/journal 
 NOCOW in their setup scripts by default when it's btrfs, thus avoiding 
 the problem.  (Altho if they do automated snapshotting they'll also have 
 to set it as its own subvolume, to avoid the first-write-after-snapshot-
 is-COW problem.)  Well, that, and/or set autodefrag in the default mount 
 options.

Pay attention, that this remove also the checksum, which are very useful in a 
RAID configuration.

 
 Meanwhile, there's some focus on making btrfs behave better with such 
 rewrite-pattern files, but while I think the problem can be made /some/ 
 better, hopefully enough that the defaults bother far fewer people in far 

Re: [PATCH 2/2] Remove a leading '+'.

2014-06-14 Thread David Sterba
On Fri, Jun 13, 2014 at 04:46:25PM -0500, Eric Sandeen wrote:
  -+  fprintf(stderr, \t-F : attempt to dump superblocks with bad magic\n);
  +   fprintf(stderr, \t-F : attempt to dump superblocks with bad magic\n);
 Whoa, git blame blames me, but that's not the patch I sent.  ;)

My mistak while merging two conflicting show-super patches, sorry.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Transaction commit: none (default)

2014-06-14 Thread Marc MERLIN
Taking snapshots is now output this in addition of the snapshot
operation.

Transaction commit: none (default)

1) Is that expected/normal? It looks kind of spamming/useless to me?
2) If it's useful, what's the use I'm not getting?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance crash BUG ON fs/btrfs/relocation.c:1062 or RIP build_backref_tree+0x9fc/0xcc4

2014-06-14 Thread Marc MERLIN
On Mon, Jun 09, 2014 at 04:40:07PM -0700, Marc MERLIN wrote:
 I did a balance on a system that had 3.11 (yes, I know, it's old). It hung.
 So, I rebooted with 3.13, and it failed in fs/btrfs/relocation
 
 Problem #1: I cannot stop the relocation. It starts on its own as soon as I
 mount the FS, and I can't stop it.
 Is there a bug to fix that?

Chris,

As Russell pointed out, I can stop the balance at mount, but back to the
bug, can you make btrfs not crash like this if it gets input it doesn't
expect?

Separately, I'll send one lsat offer to give data off the FS to give you
a chance to figure out how it got into that state and potentially find
another bug.
That is unless there were one or more known corruption bugs that have
been fixed and that there is nothing to be learned from looking into
this.

Thanks,
Marc
 
 Problem #2: I rebooted with 3.15rc5, and now it's worse.
 [ 1569.598026] kernel BUG at fs/btrfs/relocation.c:1064!
 then leads to
 [ 1569.613240] RIP  [81267ef0] build_backref_tree+0x9fc/0xcc4
 [ 1569.613491]  RSP 880fde599ad8
 [ 1569.614119] ---[ end trace da0f24875bbde960 ]---
 [ 1569.614398] Kernel panic - not syncing: Fatal exception
 (full trace below)
 
 I'm sure that filesystem is damaged in some way, but the kernel of course
 should not crash.
 
 3.15 dies here:
 struct backref_node *build_backref_tree(struct reloc_control *rc,
 (...)
   if (!RB_EMPTY_NODE(upper-rb_node)) {
   if (upper-lowest) {
   list_del_init(upper-lower);
   upper-lowest = 0;
   }
 
   list_add_tail(edge-list[UPPER], upper-lower);
   continue;
   }
 
   BUG_ON(!upper-checked);  here
 
 So I'm sure I hit a bug and my FS has issues, but can't the kernel do 
 something better
 like abort the rebalance instead of crashing?
 
 In the meantime, does anyone want anything off that filesystem before I wipe 
 it?
 
 
 = Crash on 3.13:
 btrfs: found 4930 extents
 btrfs: relocating block group 82699091968 flags 1
 btrfs: found 3719 extents
 [ cut here ]
 kernel BUG at 
 /build/buildd/linux-lts-trusty-3.13.0/fs/btrfs/relocation.c:1062!
 invalid opcode:  [#1] SMP 
 Modules linked in: rfcomm parport_pc ppdev bnep binfmt_misc rpcsec_gss_krb5 
 nfsd nfs_acl auth_rpcgss nfs fscache lockd sunrpc snd_hda_codec_hdmi 
 nvidia(POF) snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_usb_audio 
 snd_pcm snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_seq 
 btusb bluetooth uvcvideo videobuf2_core videodev videobuf2_vmalloc 
 videobuf2_memops snd_rawmidi snd_timer snd_seq_device drm snd psmouse 
 gpio_ich sb_edac hp_wmi serio_raw edac_core mei_me mei mac_hid soundcore 
 sparse_keymap snd_page_alloc lpc_ich wmi tpm_infineon lp parport btrfs 
 raid6_pq xor libcrc32c hid_generic usbhid hid usb_storage firewire_ohci isci 
 e1000e firewire_core libsas crc_itu_t ptp ahci libahci pps_core 
 scsi_transport_sas
 CPU: 4 PID: 1710 Comm: btrfs-balance Tainted: PF  O 3.13.0-29-generic 
 #53~precise1-Ubuntu
 Hardware name: Hewlett-Packard HP Z620 Workstation/158A, BIOS J61 v01.17 
 11/05/2012
 task: 881000bec7d0 ti: 8810018a2000 task.ti: 8810018a2000
 RIP: 0010:[a01c04a8]  [a01c04a8] 
 build_backref_tree+0x1228/0x1290 [btrfs]
 RSP: 0018:8810018a3ab8  EFLAGS: 00010246
 RAX:  RBX: 880801a8fa20 RCX: 880801d2cf50
 RDX: 880801d2c390 RSI: 880801d2c640 RDI: 8807ecad9c80
 RBP: 8810018a3bb8 R08: 8807ecad9c80 R09: 0001
 R10: 0001 R11:  R12: 880801d2c650
 R13: 8807ecad9900 R14:  R15: 88003582a800
 FS:  () GS:88080fc8() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7f10ee50a370 CR3: 01c0d000 CR4: 000407e0
 Stack:
  8807ecad9780 01ffa01bded0 880801d2c620 88003582a920
  8807ecad9c80 8807ff72e800 8807ecad9240 8807ecad9200
  880801a8fa20 880801a8fab0 88003582a924 88003582a820
 Call Trace:
  [a01c064b] relocate_tree_blocks+0x13b/0x1e0 [btrfs]
  [a01c12b9] relocate_block_group+0x199/0x550 [btrfs]
  [a01c182a] btrfs_relocate_block_group+0x1ba/0x300 [btrfs]
  [a01999f6] btrfs_relocate_chunk.isra.62+0x56/0x3f0 [btrfs]
  [a01556b3] ? block_group_cache_tree_search+0xb3/0xf0 [btrfs]
  [a018dfe6] ? release_extent_buffer+0x36/0xe0 [btrfs]
  [a019c60c] __btrfs_balance+0x32c/0x420 [btrfs]
  [a019ca38] btrfs_balance+0x338/0x5d0 [btrfs]
  [a019cd54] balance_kthread+0x84/0x90 [btrfs]
  [a019ccd0] ? btrfs_balance+0x5d0/0x5d0 [btrfs]
  [8108f9a9] kthread+0xc9/0xe0
  [8108f8e0] ? flush_kthread_worker+0xb0/0xb0
  [817665fc] ret_from_fork+0x7c/0xb0
  [8108f8e0] ? 

Re: Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Kai Krakow
Duncan 1i5t5.dun...@cox.net schrieb:

 As they say, Whoosh!
 
 At least here, I interpreted that remark as primarily sarcastic
 commentary on the systemd devs' apparent attitude, which can be
 (controversially) summarized as: Systemd doesn't have problems because
 it's perfect.  Therefore, any problems you have with systemd must instead
 be with other components which systemd depends on.

Come on, sorry, but this is fud. Really... ;-)

 IOW, it's a btrfs problem now in practice, not because it is so in a
 technical sense, but because systemd defines it as such and is unlikely
 to budge, so the only way to achieve progress is for btrfs to deal with
 it.

I think that systemd is even one of the early supporters of btrfs because it 
will defragment readahead files on boot from btrfs. I'd suggest the problem 
is to be found in the different semantics with COW filesystems. And if 
someone loudly complains to the systemd developers how bad they are at doing 
their stuff - hmm, well, I would be disapointed/offended, too, as a 
programmer because much very well done work has been put into systemd and 
I'd start ignoring such people. In Germany we have a saying for this: Wie 
man in den Wald hineinruft, so schallt es heraus. [1] They are doing many 
things right that have not been adopted to modern systems in the last twenty 
years (or so) with the legacy init systems.

So let's start with my journals, on btrfs:

$ sudo filefrag *
system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal:
 
5 extents found
system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal:
 
5 extents found 
 
system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal:
 
5 extents found 
 
system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal:
 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal:
 
5 extents found
system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal:
 
5 extents found
system.journal: 9 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal:
 
2 extents found
user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal:
 
2 extents found
user-500.journal: 2 extents found

I don't think these are too bad values, eh?

Well, how did I accomblish that?

First, I've set the journal directories nocow. Of course, systemd should do 
this by default. I'm not sure if this is a packaging or systemd code issue, 
tho. But I think the systemd devs are in common that for cow fs, the journal 
directories should be set nocow. After all, the journal is a transactional 
database - it does not need cow protection at all costs. And I think they 
have their own checksumming protection. So, why let systemd bother with 
that? A lot of other software has the same semantic problems with btrfs, too 
(ex. MySQL) where nobody shouts at the inabilities of the programmers. So 
why for systemd? Just because it's intrusive by its nature for being a 
radically and newly designed init system and thus requires some learning by 
its users/admins/packagers? Really? Come on... As admin and/or packager you 
have to stay with current technologies and developments anyways. It's only 
important to hide the details from the users.

Back to the extents counts: What I did next was implementing a defrag job 
that regularly defrags the journal (actually, the complete log directory as 
other log files suffer the same problem):

$ cat /usr/local/sbin/defrag-logs.rb 
#!/bin/sh
exec btrfs filesystem defragment -czlib -r /var/log

It can be easily converted into a timer job with systemd. This is left as an 
excercise to the reader.

BTW: Actually, that job isn't currently executed on my system which makes 
the numbers above pretty impressive... However, autodefrag is turned on 
which may play into the mix. I'm not sure. I stopped automatically running 
those defrag jobs a while ago (I have a few more).
 
 An arguably fairer and more impartial assessment of this particular
 situations suggests that neither btrfs, which as a COW-based filesystem,
 like all COW-based filesystems has the existing-file-rewrite as a major
 technical challenge that it must deal with /somehow/, nor systemd, which
 in choosing to use fallocate is specifically putting itself in that
 existing-file-rewrite class, 

Re: Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Duncan
Kai Krakow posted on Sat, 14 Jun 2014 12:59:31 +0200 as excerpted:

 Duncan 1i5t5.dun...@cox.net schrieb:
 
 As they say, Whoosh!
 
 At least here, I interpreted that remark as primarily sarcastic
 commentary on the systemd devs' apparent attitude, which can be
 (controversially) summarized as: Systemd doesn't have problems because
 it's perfect.  Therefore, any problems you have with systemd must
 instead be with other components which systemd depends on.
 
 Come on, sorry, but this is fud. Really... ;-)

I should make clear that did recently switch to systemd myself -- by 
choice as I'm on gentoo and it defaults to openrc, so I'm obviously not 
entirely anti-systemd.  And I _did_ say (controversially), which means 
I do recognize that there are two sides to the story.

That said, I've certainly seen what happens when non-systemd devs are on 
the receiving end of things -- including kernel devs, see the recent 
hubbub over the debug kernel commandline option and the somewhat longer 
ago firmware loading issue, among others.

But sarcasm implies neither absolute truth, or it'd be speaking truth not 
sarcasm, nor absolute untruth, because without a kernel of truth there 
it'd be simply stupid, not sarcastic.

And certainly that's how I read the comment.

But in any case, I read it as not to be taken literally, which is what it 
appeared to me the person to which I was directly replying was doing.  I 
was simply warning him off of reading it too literally as at least from 
here, it seemed more like sarcasm.

Of course if DaveC wishes to clarify one way or another he certainly 
can... tho I'd not expect it at this point since if it is sarcasm as I 
believe, that's kind of having to explain the joke...

 I think that systemd is even one of the early supporters of btrfs
 because it will defragment readahead files on boot from btrfs. I'd
 suggest the problem is to be found in the different semantics with COW
 filesystems.

Which is actually what I was saying.  In reality it's an interaction 
between the nature of COW filesystems, where fallocate tends to be a 
problem, and an application using fallocate because of its benefits on 
overwrite-in-place filesystems, which happen to be the norm at this 
point.  So neither one is to blame, it's simply a bad interaction that 
ultimately needs to be made better on one side or the other.

But btrfs is still relatively young and COW-based filesystems not that 
widespread yet, so that the problem hasn't been worked out to be handled 
automatically, just yet, isn't a big surprise.  Tho I think it'll come.

Meanwhile, as you point out below and as I've repeatedly said in this 
thread already myself, NOCOW and/or autodefrag are tools available to an 
admin faced with the problem, that together actually solve it reasonably 
well.  All an admin has to do is make use of the tools already available. 
=:^)

 So let's start with my journals, on btrfs:
 
 $ sudo filefrag *
 system@0004fad12dae7676-98627a3d7df4e35e.journal~: 2 extents found
 system@0004fae8ea4b84a4-3a2dc4a93c5f7dc9.journal~: 2 extents found
 
system@806cd49faa074a49b6cde5ff6fca8adc-0008e4cc-0004f82580cdcb45.journal:
 5 extents found
 
system@806cd49faa074a49b6cde5ff6fca8adc-00097959-0004f89c2e8aff87.journal:
 5 extents found
 
system@806cd49faa074a49b6cde5ff6fca8adc-000a166d-0004f98d7e04157c.journal:
 5 extents found
 
system@806cd49faa074a49b6cde5ff6fca8adc-000aad59-0004fa379b9a1fdf.journal:
 5 extents found
 
system@ec16f60db38f43619f8337153a1cc024-0001-0004fae8e5057259.journal:
 5 extents found
 
system@ec16f60db38f43619f8337153a1cc024-92b1-0004fb59b1d034ad.journal:
 5 extents found system.journal: 9 extents found
 
user-500@e4209c6628ed4a65954678b8011ad73f-00085b7a-0004f77d25ebba04.journal:
 2 extents found
 
user-500@e4209c6628ed4a65954678b8011ad73f-0008e7fb-0004f83c7bf18294.journal:
 2 extents found
 
user-500@e4209c6628ed4a65954678b8011ad73f-00097fe4-0004f8ae69c198ca.journal:
 2 extents found
 
user-500@e4209c6628ed4a65954678b8011ad73f-000a1a7e-0004f9966e9c69d8.journal:
 2 extents found user-500.journal: 2 extents found
 
 I don't think these are too bad values, eh?
 
 Well, how did I accomblish that?
 
 First, I've set the journal directories nocow. Of course, systemd should
 do this by default. I'm not sure if this is a packaging or systemd code
 issue,
 tho. But I think the systemd devs are in common that for cow fs, the
 journal directories should be set nocow. After all, the journal is a
 transactional database - it does not need cow protection at all costs.
 And I think they have their own checksumming protection. So, why let
 systemd bother with that? A lot of other software has the same semantic
 problems with btrfs, too (ex. MySQL) where nobody shouts at the
 inabilities of the programmers. So why for systemd? Just because it's
 intrusive by its nature for being a radically and newly designed init
 system and thus 

Re: Slow startup of systemd-journal on BTRFS

2014-06-14 Thread Duncan
Goffredo Baroncelli posted on Sat, 14 Jun 2014 09:52:39 +0200 as
excerpted:

 On 06/14/2014 04:53 AM, Duncan wrote:
 Goffredo Baroncelli posted on Sat, 14 Jun 2014 00:19:31 +0200 as
 excerpted:
 
 thanks for pointing that. However I am performing my tests on a fedora
 20 with systemd-208, which seems have this change

 I am reaching the conclusion that fallocate is not the problem. The
 fallocate increase the filesize of about 8MB, which is enough for some
 logging. So it is not called very often.

Right.

 But...

Exactly, _but_...

 [A]n fallocate of 8 MiB will increase the file size
 by 8 MiB and write that out.  So far so good as at that point the 8 MiB
 should be a single extent.  But then, data gets written into 4 KiB
 blocks of that 8 MiB one at a time, and because btrfs is COW, the new
 data in the block must be written to a new location.
 
 Which effectively means that by the time the 8 MiB is filled, each 4
 KiB block has been rewritten to a new location and is now an extent
 unto itself.  So now that 8 MiB is composed of 2048 new extents, each
 one a single 4 KiB block in size.
 
 Several people pointed fallocate as the problem. But I don't understand
 the reason.
 1) 8MB is a quite huge value, so fallocate is called (at worst) 1 time
 during the boot. Often never because the log are less than 8MB.
 2) it is true that btrfs rewrite almost 2 times each 4kb page with
 fallocate. But the first time is a big write of 8MB; instead the
 second write would happen in any case. What I mean is that without the
 fallocate in any case journald would make small write.
 
 To be honest, I fatigue to see the gain of having a fallocate on a COW
 filesystem... may be that I don't understand very well the fallocate()
 call.

The base problem isn't fallocate per se, rather, tho it's the trigger in 
this case.  The base problem is that for COW-based filesystems, *ANY* 
rewriting of existing file content results in fragmentation.

It just so happens that the only reason there's existing file content to 
be rewritten (as opposed to simply appending) in this case, is because of 
the fallocate.  The rewrite of existing file content is the problem, but 
the existing file content is only there in this case because of the 
fallocate.

Taking a step back...

On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
rewrites into the already allocated location, thus guaranteeing extents 
of 8 MiB each, since once the space is allocated it's simply rewritten in-
place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
than single filesystem blocks when an app knows the data is eventually 
going to be written in to fill that space anyway is a GOOD thing, which 
is why systemd is doing it.

But on a COW-based filesystem fallocate is the exact opposite, a BAD 
thing, because an fallocate forces the file to be written out at that 
size, effectively filled with nulls/blanks.  Then the actual logging 
comes along and rewrites those nulls/blanks with actual data, but it's 
now a rewrite, which on a COW, copy-on-write, based filesystem, the 
rewritten block is copied elsewhere, it does NOT overwrite the existing 
null/blank block, and elsewhere by definition means detached from the 
previous blocks, thus in an extent all by itself.

Once the full 2048 original blocks composing that 8 MiB are filled in 
with actual data, because they were rewrites from null/blank blocks that 
fallocate had already forced to be allocated, that's now 2048 separate 
extents, 2048 separate file fragments, where without the forced fallocate, 
the writes would have all been appends, and there would have been at 
least /some/ chance of some of those 2048 separate blocks being written 
at close enough to the same time that they would have been written 
together as a single extent.  So while the 8 MiB might not have been a 
single extent as opposed to 2048 separate extents, it might have been 
perhaps 512 or 1024 extents, instead of the 2048 that it ended up being 
because fallocate meant that each block was a rewrite into an existing 
file, not a new append-write at the end of an existing file.

 [...]
 Another alternative is that distros will start setting /var/log/journal
 NOCOW in their setup scripts by default when it's btrfs, thus avoiding
 the problem.  (Altho if they do automated snapshotting they'll also
 have to set it as its own subvolume, to avoid the
 first-write-after-snapshot-
 is-COW problem.)  Well, that, and/or set autodefrag in the default
 mount options.
 
 Pay attention, that this remove also the checksum, which are very useful
 in a RAID configuration.

Well, it can be.  But this is only log data, not executable or the like 
data, and (as Kai K points out) journald has its own checksumming method 
in any case.

Besides which, you still haven't explained why you can't either set the 
autodefrag mount option and be done with it, or run a systemd-timer-
triggered or cron-triggered defrag script to defrag them