On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote: > 18.04.2017 06:50, Chris Murphy пишет:
>>> What exactly "changes" mean? Write() syscall? >> >> filefrag reported entries increase, it's using FIEMAP. >> > > So far it sounds like btrfs allocates new extent on every write to > journal file. Each journal record itself is relatively small indeed. Hence why it would be better if there's no fsync so that it can accumulate these and do its own commit (30s default for Btrfs) and let them accumulate. It is likely that the ssd allocation option on these ssd's is a factor in fragmentation because it's trying to allocation to a unique 2MB section based on expected erase block size. There's a lot of discussion going on right now on the Btrfs list whether these assumptions are still true, and in what cases maybe we should be using nossd on higher end SSD's and NVMe. What's for sure though is that with any of these allocators, nocow is not good for lower end SSDs like SD cards; all that does it ask to write to the same LBA over and over and over again, for a journal. And it just increases write amplification unnecessarily. So I'm beginning to think that on SSDs, it's better if journald did +c rather than +C on journals. But there's still some researching to do. I definitely think /var/log/journal/<machineid> should be a subvolume to avoid its contents being snapshot. That does make the fragmentation problem worse. And also I think defragmentation feature should be disabled at least on SSD; or should include zlib compression. The write amplification on SSD is worse than just leaving the file fragmented. > >> Also with stat I see the times (all three) change on the file. If I go >> to GNOME Terminal and just sudo some command, that itself causes the >> current system.journal file to get all three times modified. It >> happens immediately, there's no delay. So if I'm doing something like >> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus >> the journal, it's just constantly writing stuff to the journal. This >> is without anything running journalctl -f or reading the journal. >> >>> >>>> #Storage=auto >>>> #Compress=yes >>>> #Seal=yes >>>> #SplitMode=uid >>>> #SyncIntervalSec=5m >>> >>> This controls how often systemd calls fsync() on currently active >>> journal file. Do you see fsync() every 3 seconds? >> >> I have no idea if it's fsync or what. How can I tell? >> > > strace -p $(pgrep systemd-journal) > > You will not see actual writes as file is memory mapped, but it > definitely does not do any fsync() every so often. > > Is it possible that btrfs behavior you observe is specific to memory > mapped files handling? Maybe. But even after a reboot I see the same extent entries in the file. Granted a good deal of these 1 block entries have addresses that are one after the other so they often make up larger continuous extents, but they still have separate entries. > >> Also, I don't think these journal files are being compressed. >> >> Using the btrfs-progs/btrfs-debugfs script on a few user journal >> files, I'm seeing massive compression ratios. Maybe I'll try >> Compress=No and see if there's a change. >> > > Only actual message payload above some threshold (I think 256 or 512 > bytes, not sure) is compressed; everything else is not. For average > syslog-type messages payload is far too small. This is really only > interesting when you store core dump or similar. Interesting I see. Thanks. I'll try strace and see what's going on. -- Chris Murphy _______________________________________________ systemd-devel mailing list systemd-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/systemd-devel