On Mon, Jun 16, 2014 at 12:13:07AM +0200, Lennart Poettering wrote:
> On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreij...@libero.it) wrote:
> 
> > > Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
> > > block has been rewritten to a new location and is now an extent unto 
> > > itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
> > > single 4 KiB block in size.
> > 
> > Several people pointed fallocate as the problem. But I don't
> > understand the reason.
> 
> BTW, the reason we use fallocate() in journald is not about trying to
> optimize anything. It's only used for one reason: to avoid SIGBUS on
> disk/quota full, since we actually write everything to the files using
> mmap().

FWIW, fallocate() doesn't absolutely guarantee you that. When at
ENOSPC, a write into that reserved range can still require
un-reserved metadata blocks to be allocated. e.g. splitting a
"reserved" data extent into two extents (used and reserved) requires
an extra btree record, which can cause a split, which can require
allocation. This tends to be pretty damn rare, though, and some
filesystems have reserved block pools specifically for handling this
sort of ENOSPC corner case. Hence, in practice the filesystems
never actually fail with ENOSPC in ranges that have been
fallocate()d.

> I mean, writing things with mmap() is always problematic, and
> handling write errors is awfully difficult, but at least two of the most
> common reasons for failure we'd like protect against in advance, under
> the assumption that disk/quota full will be reported immediately by the
> fallocate(), and the mmap writes later on will then necessarily succeed.
> 
> I am not really following though why this trips up btrfs though. I am
> not sure I understand why this breaks btrfs COW behaviour. I mean,
> fallocate() isn't necessarily supposed to write anything really, it's
> mostly about allocating disk space in advance. I would claim that
> journald's usage of it is very much within the entire reason why it
> exists...
> 
> Anyway, happy to change these things around if necesary, but first I'd
> like to have a very good explanation why fallocate() wouldn't be the
> right thing to invoke here, and a suggestion what we should do instead
> to cover this usecase...

fallocate() of 8MB should be more than sufficient for non-COW
filesystems - 1MB would be enough to prevent performance degradation
due to fragmentation in most cases. The current problems seem to be
with the way btrfs does rewrites, not the use of fallocate() in
systemd.

Thanks for explanation, Lennart.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to