Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-11 Thread Austin S. Hemmelgarn

On 2017-04-11 05:55, Adam Borowski wrote:

On Tue, Apr 11, 2017 at 06:01:19AM +0200, Kai Krakow wrote:

Yes, I know all this. But I don't see why you still want noatime or
relatime if you use lazytime, except for super-optimizing. Lazytime
gives you POSIX conformity for a problem that the other options only
tried to solve.


(Besides lazytime also working on mtime, and, technically, ctime.)
Nope, it by definition can't work on ctime because a ctime update means 
something else changed in the inode, which in turn will cause it to be 
flushed to disk normally (lazytime only defers the flush as long as 
nothing else in the inode is different, so it won't help much on stuff 
like traditional log files because their size is changing regularly 
(which updates the inode, which then causes it to get flushed)).


First: atime, in any form, murders snapshots.  On any filesystem that has
them, not just btrfs -- I've tested zfs and LVM snapshots, there's also
qcow2/vdi and so on.  On all of them, every single read-everything operation
costs you 5% disk space.  For a _read_ operation!

I've tested /usr-y mix of files, for consistency with the guy who mentioned
this problem first.  Your mileage will vary depending on whether you store
100GB disk images or a news spool.

Read-everything is quite rare, but most systems have at least one
stat-everything cronjob.  That touches only diratime, but that's still
1-in-11 inodes (remarkably consistent: I've checked a few machines with
drastically different purposes, and somehow the min was 10, max 12).

And no, marking snapshots as ro doesn't help: reading the live version still
breaks CoW.


Second: atime murders media with limited write endurance.  Modern SSD can
cope well, but I for one work a lot with SD and eMMC.  Every single SoC
image I've seen uses noatime for this reason.
Even on SSD's it's still an issue, especially if it's something like 
ext4 which uses inode tables (updating one inode will usually require a 
RMW of an erase block regardless, but using inode tables means that this 
happens _all the time_).



Third: relatime/lazytime don't eliminate the performance cost.  They fix
only frequently read files -- if you have a big filesystem where you read a
lot but individual files tend to be read rarely, relatime is as bad as
strictatime, and lazytime actually worse.  Both will do an unnecessary write
of all inodes.


Four: why?  Beside being POSIXLY_CORRECT, what do you actually gain from
atime?  I can think only of:
* new mail notification with mbox.  Just patch the mail reader to manually
  futimens(..., {UTIME_NOW,UTIME_OMIT}), it has no extra cost on !noatime
  mounts.  I've personally did so for mutt, the updated version will ship
  in Debian stretch; you can patch other mail readers although they tend
  to be rarely used in conjunction with shell access (and thus they have
  no need for atime at all).
* Debian's popcon's "vote" field.  Use "inst", and there's no gain from
  popcon for you personally.
* some intrusion detection forensics (broken by open(..., O_NOATIME))

On top of all that:
Five:
Handling of atime slows down stat and a handful of other things.  If you 
take a source tree the size of the Linux kernel, write a patch that 
changes every file (even just one character), and then go to commit it 
in Git (or SVN, or Bazaar, or Mercurial), you'll see a pretty serious 
difference in the time it takes to commit because almost all VCS 
software calls stat() on the entire tree.  relatime won't help much here 
because the check to determine whether or not to update the atime still 
has to happen (in fact, it will hurt slightly, strictatime eliminates 
that check).


Six:
It doesn't behave how most users would inherently expect, partly because 
there are ways to bypass it even if the FS is mounted with strictatime.



Conclusion: death to atime!



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-11 Thread Adam Borowski
On Tue, Apr 11, 2017 at 06:01:19AM +0200, Kai Krakow wrote:
> Yes, I know all this. But I don't see why you still want noatime or
> relatime if you use lazytime, except for super-optimizing. Lazytime
> gives you POSIX conformity for a problem that the other options only
> tried to solve.

(Besides lazytime also working on mtime, and, technically, ctime.)

First: atime, in any form, murders snapshots.  On any filesystem that has
them, not just btrfs -- I've tested zfs and LVM snapshots, there's also
qcow2/vdi and so on.  On all of them, every single read-everything operation
costs you 5% disk space.  For a _read_ operation!

I've tested /usr-y mix of files, for consistency with the guy who mentioned
this problem first.  Your mileage will vary depending on whether you store
100GB disk images or a news spool.

Read-everything is quite rare, but most systems have at least one
stat-everything cronjob.  That touches only diratime, but that's still
1-in-11 inodes (remarkably consistent: I've checked a few machines with
drastically different purposes, and somehow the min was 10, max 12).

And no, marking snapshots as ro doesn't help: reading the live version still
breaks CoW.


Second: atime murders media with limited write endurance.  Modern SSD can
cope well, but I for one work a lot with SD and eMMC.  Every single SoC
image I've seen uses noatime for this reason.


Third: relatime/lazytime don't eliminate the performance cost.  They fix
only frequently read files -- if you have a big filesystem where you read a
lot but individual files tend to be read rarely, relatime is as bad as
strictatime, and lazytime actually worse.  Both will do an unnecessary write
of all inodes.


Four: why?  Beside being POSIXLY_CORRECT, what do you actually gain from
atime?  I can think only of:
* new mail notification with mbox.  Just patch the mail reader to manually
  futimens(..., {UTIME_NOW,UTIME_OMIT}), it has no extra cost on !noatime
  mounts.  I've personally did so for mutt, the updated version will ship
  in Debian stretch; you can patch other mail readers although they tend
  to be rarely used in conjunction with shell access (and thus they have
  no need for atime at all).
* Debian's popcon's "vote" field.  Use "inst", and there's no gain from
  popcon for you personally.
* some intrusion detection forensics (broken by open(..., O_NOATIME))


Conclusion: death to atime!
-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Mon, 10 Apr 2017 15:43:57 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-04-10 14:18, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 13:13:39 -0400
> > schrieb "Austin S. Hemmelgarn" :
> >  
> >> On 2017-04-10 12:54, Kai Krakow wrote:  
>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
>  [...]  
> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> I'm not sure there are any anymore, but I know that a handful (mostly 
> really uncommon ones) used to (and BTRFS is not alone in this
> respect, some of the more esoteric ext4 options aren't accepted on
> the kernel command-line either).  I know at a minimum at some point
> in the past alloc-start, check_int, and inode_cache did not work from
> the kernel command-line.

The post from Janos explains why: The difference is with the mount
handler, depending on whether you use initrd or not.

> >> I've just finished some quick testing though, and it looks
> >> like you're right, BTRFS does not support this, which means I now
> >> need to figure out what the hell was causing the IOPS counters in
> >> collectd to change in rough correlation  with remounting
> >> (especially since it appears to happen mostly independent of the
> >> options being changed).  
> >
> > I think that noatime (which I remember you also used?), lazytime,
> > and relatime are mutually exclusive: they all handle the inode
> > updates. Maybe that is the effect you see?  
> They're not exactly exclusive.  The lazytime option will prevent
> changes to the mtime or atime fields in a file from forcing inode
> write-out for up to 24 hours (if the inode would be written out for
> some other reason (such as a file-size change or the inode being
> evicted from the cache), then the timestamps will be too), but it
> does not change the value of the timestamps.  So if you have lazytime
> enabled and use touch to update the mtime on anotherwise idle file,
> the mtime will still be correct as far as userspace is concerned, as
> long as you don't crash before the update hits the disk (but
> userspace will only see the discrepancy _after_ the crash).

Yes, I know all this. But I don't see why you still want noatime or
relatime if you use lazytime, except for super-optimizing. Lazytime
gives you POSIX conformity for a problem that the other options only
tried to solve.

> > Well, relatime is mostly the same thus not perfectly resembling the
> > POSIX standard. I think the only software that relies on atime is
> > mutt...  
> This very much depends on what you're doing.  If you have a WORM 
> workload, then yeah, it's pretty much the same.  If however you have 
> something like a database workload where a specific set of files get 
> internally rewritten regularly, then it actually has a measurable
> impact.

I think "impact" is a whole different story. I'm on your side here.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Tue, 11 Apr 2017 01:45:32 +0200
schrieb "Janos Toth F." :

> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> 
> I think this bug still stands unresolved (for 3+ years, probably
> because most people use init-rd/fs without ever considering to omit it
> in case they don't really need it at all):
> Bug 61601 - rootflags=noatime causes kernel panic when booting
> without initrd. The last time I tried it applied to Btrfs as well:
> https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18

Ah okay, so the difference is with the mount handler. I can only use
initrd here because I have multi-device btrfs ontop of bcache as rootfs.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Janos Toth F.
>> The command-line also rejects a number of perfectly legitimate
>> arguments that BTRFS does understand too though, so that's not much
>> of a test.
>
> Which are those? I didn't encounter any...

I think this bug still stands unresolved (for 3+ years, probably
because most people use init-rd/fs without ever considering to omit it
in case they don't really need it at all):
Bug 61601 - rootflags=noatime causes kernel panic when booting without initrd.
The last time I tried it applied to Btrfs as well:
https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Adam Borowski
On Mon, Apr 10, 2017 at 03:43:57PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-04-10 14:18, Kai Krakow wrote:

> * strictatime, lazytime: Both atime and mtime updates happen, but they
> actual update may not hit the disk for up to 24 hours (this will let mutt
> work correctly as long as your system shuts down cleanly, but still improve
> performance noticeably on at least ext4).

> > Well, relatime is mostly the same thus not perfectly resembling the
> > POSIX standard. I think the only software that relies on atime is
> > mutt...

Well, about that mutt thing...  Neomutt actually, but that's the codebase
Debian uses:

https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-10 14:18, Kai Krakow wrote:

Am Mon, 10 Apr 2017 13:13:39 -0400
schrieb "Austin S. Hemmelgarn" :


On 2017-04-10 12:54, Kai Krakow wrote:

Am Mon, 10 Apr 2017 18:44:44 +0200
schrieb Kai Krakow :


Am Mon, 10 Apr 2017 08:51:38 -0400
schrieb "Austin S. Hemmelgarn" :


 [...]
 [...]

 [...]

 [...]
 [...]


Did you put it in /etc/fstab only for the rootfs? If yes, it
probably has no effect. You would need to give it as rootflags on
the kernel cmdline.


I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
and f2fs know the flag. Kernel 4.10.

So probably you're seeing a placebo effect. If you put lazytime for
rootfs just only into fstab, it won't have an effect because on
initial mount this file cannot be opened (for obvious reasons), and
on remount, btrfs seems to happily accept lazytime but it has no
effect. It won't show up in /proc/mounts. Try using it in rootflags
kernel cmdline and you should see that the kernel won't accept the
flag lazytime.

The command-line also rejects a number of perfectly legitimate
arguments that BTRFS does understand too though, so that's not much
of a test.


Which are those? I didn't encounter any...
I'm not sure there are any anymore, but I know that a handful (mostly 
really uncommon ones) used to (and BTRFS is not alone in this respect, 
some of the more esoteric ext4 options aren't accepted on the kernel 
command-line either).  I know at a minimum at some point in the past 
alloc-start, check_int, and inode_cache did not work from the kernel 
command-line.



I've just finished some quick testing though, and it looks
like you're right, BTRFS does not support this, which means I now
need to figure out what the hell was causing the IOPS counters in
collectd to change in rough correlation  with remounting (especially
since it appears to happen mostly independent of the options being
changed).


I think that noatime (which I remember you also used?), lazytime, and
relatime are mutually exclusive: they all handle the inode updates.
Maybe that is the effect you see?
They're not exactly exclusive.  The lazytime option will prevent changes 
to the mtime or atime fields in a file from forcing inode write-out for 
up to 24 hours (if the inode would be written out for some other reason 
(such as a file-size change or the inode being evicted from the cache), 
then the timestamps will be too), but it does not change the value of 
the timestamps.  So if you have lazytime enabled and use touch to update 
the mtime on anotherwise idle file, the mtime will still be correct as 
far as userspace is concerned, as long as you don't crash before the 
update hits the disk (but userspace will only see the discrepancy 
_after_ the crash).


By comparison, relatime causes the atime not to updated at all if it's 
changed in the last 24 hours, and noatime completely prevents atime 
updates.  In both cases, the atime isn't correct at all in userspace as 
far as POSIX is concerned.


So, you have the following combinations:
* strictatime, nolazytime: Both atime and mtime updates happen, and are 
flushed to disk (almost) immediately.
* relatime, nolazytime (the upstream default): atime updates happen only 
if the atime hasn't changed in 24 hours, mtime updates happen as normal, 
and both types of update are flushed to disk (almost) immediately.
* noatime, nolazytime (the default on some specific kernels (this is 
easy to patch, so a lot of people who already carry custom patches and 
don't use mutt patch it)): atime updates never happen, mtime updates 
happen as normal and are flushed to disk (almost) immediately.
* strictatime, lazytime: Both atime and mtime updates happen, but they 
actual update may not hit the disk for up to 24 hours (this will let 
mutt work correctly as long as your system shuts down cleanly, but still 
improve performance noticeably on at least ext4).
* relatime, lazytime: atime updates happen only if the atime hasn't 
changed in 24 hours, mtime updates happen as normal, and both may not 
hit the disk for up to 24 hours.
* noatime, lazytime (what I'm trying to run): atime updates never 
happen, mtime updates happen as normal, but may not hit the disk for up 
to 24 hours.


In essence, lazytime only impacts inode writeback (deferring it under 
special circumstances), while {no,rel,strict}atime impacts the actual 
value of the time-stamps.



This is somewhat disappointing though, as supporting this would
probably help with the write-amplification issues inherent in COW
filesystems. --


Well, relatime is mostly the same thus not perfectly resembling the
POSIX standard. I think the only software that relies on atime is
mutt...
This very much depends on what you're doing.  If you have a WORM 
workload, then yeah, it's pretty much the same.  If however you have 
something like a database workload where a specific set of files get 
internally rewritten regularly, then it actually has a 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Mon, 10 Apr 2017 13:13:39 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-04-10 12:54, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 18:44:44 +0200
> > schrieb Kai Krakow :
> >  
> >> Am Mon, 10 Apr 2017 08:51:38 -0400
> >> schrieb "Austin S. Hemmelgarn" :
> >>  
>  [...]  
>  [...]  
> >>  [...]  
>  [...]  
>  [...]  
> >>
> >> Did you put it in /etc/fstab only for the rootfs? If yes, it
> >> probably has no effect. You would need to give it as rootflags on
> >> the kernel cmdline.  
> >
> > I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
> > and f2fs know the flag. Kernel 4.10.
> >
> > So probably you're seeing a placebo effect. If you put lazytime for
> > rootfs just only into fstab, it won't have an effect because on
> > initial mount this file cannot be opened (for obvious reasons), and
> > on remount, btrfs seems to happily accept lazytime but it has no
> > effect. It won't show up in /proc/mounts. Try using it in rootflags
> > kernel cmdline and you should see that the kernel won't accept the
> > flag lazytime. 
> The command-line also rejects a number of perfectly legitimate
> arguments that BTRFS does understand too though, so that's not much
> of a test.

Which are those? I didn't encounter any...

> I've just finished some quick testing though, and it looks
> like you're right, BTRFS does not support this, which means I now
> need to figure out what the hell was causing the IOPS counters in
> collectd to change in rough correlation  with remounting (especially
> since it appears to happen mostly independent of the options being
> changed).

I think that noatime (which I remember you also used?), lazytime, and
relatime are mutually exclusive: they all handle the inode updates.
Maybe that is the effect you see?

> This is somewhat disappointing though, as supporting this would
> probably help with the write-amplification issues inherent in COW
> filesystems. --

Well, relatime is mostly the same thus not perfectly resembling the
POSIX standard. I think the only software that relies on atime is
mutt...

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-10 12:54, Kai Krakow wrote:

Am Mon, 10 Apr 2017 18:44:44 +0200
schrieb Kai Krakow :


Am Mon, 10 Apr 2017 08:51:38 -0400
schrieb "Austin S. Hemmelgarn" :


On 2017-04-10 08:45, Kai Krakow wrote:

Am Mon, 10 Apr 2017 08:39:23 -0400
schrieb "Austin S. Hemmelgarn" :


 [...]


Does btrfs really support lazytime now?


It appears to, I do see fewer writes with it than without it.  At
the very least, if it doesn't, then nothing complains about it.


Did you put it in /etc/fstab only for the rootfs? If yes, it probably
has no effect. You would need to give it as rootflags on the kernel
cmdline.


I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
and f2fs know the flag. Kernel 4.10.

So probably you're seeing a placebo effect. If you put lazytime for
rootfs just only into fstab, it won't have an effect because on initial
mount this file cannot be opened (for obvious reasons), and on remount,
btrfs seems to happily accept lazytime but it has no effect. It won't
show up in /proc/mounts. Try using it in rootflags kernel cmdline and
you should see that the kernel won't accept the flag lazytime.

The command-line also rejects a number of perfectly legitimate arguments 
that BTRFS does understand too though, so that's not much of a test. 
I've just finished some quick testing though, and it looks like you're 
right, BTRFS does not support this, which means I now need to figure out 
what the hell was causing the IOPS counters in collectd to change in 
rough correlation  with remounting (especially since it appears to 
happen mostly independent of the options being changed).


This is somewhat disappointing though, as supporting this would probably 
help with the write-amplification issues inherent in COW filesystems.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Mon, 10 Apr 2017 18:44:44 +0200
schrieb Kai Krakow :

> Am Mon, 10 Apr 2017 08:51:38 -0400
> schrieb "Austin S. Hemmelgarn" :
> 
> > On 2017-04-10 08:45, Kai Krakow wrote:  
> > > Am Mon, 10 Apr 2017 08:39:23 -0400
> > > schrieb "Austin S. Hemmelgarn" :
> > >
>  [...]  
> > >
> > > Does btrfs really support lazytime now?
> > >
> > It appears to, I do see fewer writes with it than without it.  At
> > the very least, if it doesn't, then nothing complains about it.  
> 
> Did you put it in /etc/fstab only for the rootfs? If yes, it probably
> has no effect. You would need to give it as rootflags on the kernel
> cmdline.

I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
and f2fs know the flag. Kernel 4.10.

So probably you're seeing a placebo effect. If you put lazytime for
rootfs just only into fstab, it won't have an effect because on initial
mount this file cannot be opened (for obvious reasons), and on remount,
btrfs seems to happily accept lazytime but it has no effect. It won't
show up in /proc/mounts. Try using it in rootflags kernel cmdline and
you should see that the kernel won't accept the flag lazytime.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Mon, 10 Apr 2017 08:51:38 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-04-10 08:45, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 08:39:23 -0400
> > schrieb "Austin S. Hemmelgarn" :
> >  
> >> They've been running BTRFS
> >> with LZO compression, the SSD allocator, atime disabled, and mtime
> >> updates deferred (lazytime mount option) the whole time, so it may
> >> be a slightly different use case than the OP from this thread.  
> >
> > Does btrfs really support lazytime now?
> >  
> It appears to, I do see fewer writes with it than without it.  At the 
> very least, if it doesn't, then nothing complains about it.

Did you put it in /etc/fstab only for the rootfs? If yes, it probably
has no effect. You would need to give it as rootflags on the kernel
cmdline.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-10 08:45, Kai Krakow wrote:

Am Mon, 10 Apr 2017 08:39:23 -0400
schrieb "Austin S. Hemmelgarn" :


They've been running BTRFS
with LZO compression, the SSD allocator, atime disabled, and mtime
updates deferred (lazytime mount option) the whole time, so it may be
a slightly different use case than the OP from this thread.


Does btrfs really support lazytime now?

It appears to, I do see fewer writes with it than without it.  At the 
very least, if it doesn't, then nothing complains about it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow
Am Mon, 10 Apr 2017 08:39:23 -0400
schrieb "Austin S. Hemmelgarn" :

> They've been running BTRFS 
> with LZO compression, the SSD allocator, atime disabled, and mtime 
> updates deferred (lazytime mount option) the whole time, so it may be
> a slightly different use case than the OP from this thread.

Does btrfs really support lazytime now?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Austin S. Hemmelgarn

On 2017-04-09 19:23, Hans van Kranenburg wrote:

On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:

On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:

Ok, I'm going to revive a year old mail thread here with interesting new
info:

[...]

Now, another surprise:

From the exact moment I did mount -o remount,nossd on this filesystem,
the problem vanished.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png

I don't have a new video yet, but I'll set up a cron tonight and post it
later.

I'm going to send another mail specifically about the nossd/ssd
behaviour and other things I found out last week, but that'll probably
be tomorrow.


Well, there it is:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

Amazing... :) I'll update the file later with extra frames.


Added all new pngs up until now to the video, same link to the mp4.

Looks great! It just keeps reusing the same spots of space all the time.

When looking at this, I can understand that this is an unwanted write
pattern on a low-end ssd that was available for sale in 2008.

But, how does this apply to an SSD you can buy in 2017?

Depends on what brand and how cheap you go.  For a decent brand (Intel, 
Samsung, Crucial) and a reasonably good SSD (I'm partial to the Crucial 
MX series), this really don't hurt as much as it used to.


I've got a couple of Crucial MX300's (released middle of last year IIRC) 
which see roughly 200kB/s of writes constantly 24/7 (average write IOPS 
is about 15-20, so most of the writes are around 16kB), and after about 
6 months of this none of their wear-out indicators have changed since I 
first checked them when I installed them.  They've been running BTRFS 
with LZO compression, the SSD allocator, atime disabled, and mtime 
updates deferred (lazytime mount option) the whole time, so it may be a 
slightly different use case than the OP from this thread.


Given this though, combined with the fact that Crucial SSD's are decent 
(they're not quite on par with Samsung EVO's or the good Intel SSD's, 
but they're still pretty good for the price), I'd be willing to say that 
they're not anywhere near as workload sensitive as they used to be.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-09 Thread Hans van Kranenburg
On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>> Ok, I'm going to revive a year old mail thread here with interesting new
>> info:
>>
>> [...]
>>
>> Now, another surprise:
>>
>> From the exact moment I did mount -o remount,nossd on this filesystem,
>> the problem vanished.
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>
>> I don't have a new video yet, but I'll set up a cron tonight and post it
>> later.
>>
>> I'm going to send another mail specifically about the nossd/ssd
>> behaviour and other things I found out last week, but that'll probably
>> be tomorrow.
> 
> Well, there it is:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
> 
> Amazing... :) I'll update the file later with extra frames.

Added all new pngs up until now to the video, same link to the mp4.

Looks great! It just keeps reusing the same spots of space all the time.

When looking at this, I can understand that this is an unwanted write
pattern on a low-end ssd that was available for sale in 2008.

But, how does this apply to an SSD you can buy in 2017?

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Hans van Kranenburg
On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>> Ok, I'm going to revive a year old mail thread here with interesting new
>> info:
>>
>> [...]
>>
>> Now, another surprise:
>>
>> From the exact moment I did mount -o remount,nossd on this filesystem,
>> the problem vanished.
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>
>> I don't have a new video yet, but I'll set up a cron tonight and post it
>> later.
>>
>> I'm going to send another mail specifically about the nossd/ssd
>> behaviour and other things I found out last week, but that'll probably
>> be tomorrow.
> 
> Well, there it is:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
> 
> Amazing... :) I'll update the file later with extra frames.

By the way,

1. For the log files in /var/log... logrotate behaves as a defrag tool
of course. The small free space gaps left behind when scraping the
current log file together and rewriting it as 1 big gzipped file can be
reused throughout the next day or whatever interval by the slow writes
again.

2. For the /var/spool/postfix... small files come and go, and that's
fine now.

3. For the mailman mbox files, which get appended all the time... They
can either stay where they are, having some more extents scattered
around, or, an entry in the monthly cron to point defrag at the files of
last month (which will never change again) will solve that efficiently.

All of that doesn't sound like abnormal things to do when punishing the
filesystem with a 'slow small write' workload.

I'm happy to be able to keep this thing on btrfs. When moving all the
mailman stuff over from a previous VM, I first made it ext4 again, then
immediately ended up with no inodes left (of course!) while copying the
mailman archive, and then thought .. arg .. mkfs.btrfs, yay, unlimited
inodes! :) I was almost at the point of converting it back to ext4 after
all because of the exploding unused free space problems, but now that's
prevented just in time. :D

Moo,
-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Hans van Kranenburg
On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
> Ok, I'm going to revive a year old mail thread here with interesting new
> info:
> 
> [...]
> 
> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
> 
> I don't have a new video yet, but I'll set up a cron tonight and post it
> later.
> 
> I'm going to send another mail specifically about the nossd/ssd
> behaviour and other things I found out last week, but that'll probably
> be tomorrow.

Well, there it is:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

Amazing... :) I'll update the file later with extra frames.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Duncan
Hans van Kranenburg posted on Fri, 07 Apr 2017 23:25:29 +0200 as
excerpted:

> So, this is why putting your /var/log, /var/lib/mailman and /var/spool
> on btrfs is a terrible idea.
> 
> Because the allocator keeps walking forward every file that is created
> and then removed leaves a blank spot behind.
> 
> Autodefrag makes the situation only a little bit better, changing the
> resulting pattern from a sky full of stars into a snowstorm. The result
> of taking a few small writes and rewriting them again is that again the
> small parts of free space are left behind.

> [... B]ecause of the pattern we end
> up with, a large write apparently fails (the files downloaded when doing
> apt-get update by daily cron) which causes a new chunk allocation. This
> is clearly visible in the videos. Directly after that, the new chunk
> gets filled with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc...

> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.

That large write in the middle of small writes pattern might be why I've 
not seen the problem on my btrfs', on ssd, here.

Remember, I'm the guy who keeps advocating multiple independent small 
btrfs on partitioned-up larger devices, with the splits between 
independent btrfs' based on tasks.

So I have a quite tiny sub-GiB independent log btrfs handling those slow 
incremental writes to generally smaller files, a separate / with the main 
system on it that's mounted read-only unless I'm actively updating it, a 
separate home with my reasonably small size but written at-once non-media 
user files, a separate media partition/fs with my much larger but very 
seldom rewritten media files, and a separate update partition/fs with the 
local cache of the distro tree and overlays, sources (since it's gentoo), 
built binpkg cache, etc, with small to medium-large files that are 
comparatively frequently replaced.

So the relatively small slow-written and frequently rotated log files are 
isolated to their own partition/fs, undisturbed by the much larger update-
writes to the updates and / partitions/fs, isolating them from the update-
trigger that triggers the chunk allocations on your larger single general 
purpose filesystem/image, amongst all those fragmenting slow logfile 
writes.

Very interesting and informative thread, BTW.  I'm learning quite a bit. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-07 Thread Peter Grandi
[ ... ]
>>> I've got a mostly inactive btrfs filesystem inside a virtual
>>> machine somewhere that shows interesting behaviour: while no
>>> interesting disk activity is going on, btrfs keeps
>>> allocating new chunks, a GiB at a time.
[ ... ]
> Because the allocator keeps walking forward every file that is
> created and then removed leaves a blank spot behind.

That is a typical "log-structured" filesystem behaviour, not
really surprised that Btrfs is doing something like that being
COW. NILFS2 works like that and it requires a compactor (which
does the requivalent of 'balance' and 'defrag'). It is all about
tradeoffs.

With Btrfs I figured out that fairly frequent 'balance' is
really quite important, even with low percent values like
"usage=50", and usually even 'usage=90' does not take a long
time (while the default takes often a long time, I suspect
needlessly).

>> From the exact moment I did mount -o remount,nossd on this
>> filesystem, the problem vanished.

Haha. Indeed. So it switches from "COW" to more like "log
structured" with the 'ssd' option. F2FS can switch like that
too, with some tunables IIIRC. Except that modern flash SSDs
already do the "log structured" bit internally, so doing in in
Btrfs does not really help that much.

>> And even I saw some early prototypes inside the codes to
>> allow btrfs do allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

I am surprised that this is not already there, but it is a
terrible fix to a big mistake. The big mistake, that nearly all
filesystem designers do, is to assume that contiguous allocation
must bew done by writing contiguous large blocks or extents.

This big mistake was behind the stupid idea of the BSD FFS to
raise the block size from 512B to 4096B plus 512B "tails", and
endless stupid proposals to raise page and block sizes that get
done all the time, and is behind the stupid idea of doing
"delayed allocation", so large extents can be written in one go.

The ancient and tried and obvious idea is to preallocate space
ahead of it being written, so that a file physical size may be
larger than its logical length, and by how much it depends on
some adaptive logic, or hinting from the application (if the
file size if known in advance it can be to preallocate the whole
file).

> [ ... ] So, this is why putting your /var/log, /var/lib/mailman and
> /var/spool on btrfs is a terrible idea. [ ... ]

That is just the old "writing a file slowly" issue, and many if
not most filesystems have this issue:

  http://www.sabi.co.uk/blog/15-one.html?150203#150203

and as that post shows it was already reported for Btrfs here:

  http://kreijack.blogspot.co.uk/2014/06/btrfs-and-systemd-journal.html

> [ ... ] The fun thing is that this might work, but because of
> the pattern we end up with, a large write apparently fails
> (the files downloaded when doing apt-get update by daily cron)
> which causes a new chunk allocation. This is clearly visible
> in the videos. Directly after that, the new chunk gets filled
> with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc... [
> ... ]

The general problem is that filesystems have a very difficult
job especially on rotating media and cannot avoid large
important degenerate corner case by using any adaptive logic.

Only predictive logic can avoid them, and since psychic code is
not possible yet, "predictive" means hints from applications and
users, and application developers and users are usually not
going to give them, or give them wrong.

Consider the "slow writing" corner case, common to logging or
downloads, that you mention: the filesystem logic cannot do well
in the general case because it cannot predict how large the
final file will be, or what the rate of writing will be.

However if the applications or users hint the total final size
or at least a suitable allocation size things are going to be
good. But it is already difficult to expect applications to give
absolutely necessary 'fsync's, so explicit file size or access
pattern hints are a bit of an illusion. It is the ancient
'O_PONIES' issue in one of its many forms.

Fortunately it possible and even easy to do much better
*synthetic* hinting than most library and kernels do today:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d
  http://www.sabi.co.uk/blog/anno05-4th.html?051011b#051011b
  http://www.sabi.co.uk/blog/anno05-4th.html?051011#051011
  http://www.sabi.co.uk/blog/anno05-4th.html?051010#051010

But that has not happened because it is no developer's itch to
fix. I was instead partially impressed that recently the
'vm_cluster' implementation was "fixed", after only one or two
decades from being first reported:

  http://sabi.co.uk/blog/anno05-3rd.html?050923#050923
  https://lwn.net/Articles/716296/
  https://lkml.org/lkml/2001/1/30/160

And still the author(s) of the fix don't see to be persuaded by
many decades of 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-07 Thread Hans van Kranenburg
Ok, I'm going to revive a year old mail thread here with interesting new
info:

On 05/31/2016 03:36 AM, Qu Wenruo wrote:
> 
> 
> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>> Hi,
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)

That picture is still there, for the idea.

> Nice picture.
> Really better than 1000 words.
> 
> AFAIK, the problem may be caused by fragments.

Free space fragmentation is a key thing here indeed.

The major two things involved here are 1) the extent allocator, which
causes the free space fragmentation 2) the extent allocator, which
doesn't handle the fragmentation it just caused really well.

Let's start with the pictures, instead of too many words. The following
two videos are png images of the 4 block groups with highest vaddr.
Every 15 minutes a picture is created, and then they're added together:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

And, with autodefrag enabled, which was the first thing I tried as a change:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-13-autodefrag-ichiban.mp4

So, this is why putting your /var/log, /var/lib/mailman and /var/spool
on btrfs is a terrible idea.

Because the allocator keeps walking forward every file that is created
and then removed leaves a blank spot behind.

Autodefrag makes the situation only a little bit better, changing the
resulting pattern from a sky full of stars into a snowstorm. The result
of taking a few small writes and rewriting them again is that again the
small parts of free space are left behind.

Just a random idea.. for this write pattern, always putting new writes
in the first free available spot at the beginning of the block group
would make a total difference, since the little 4/8KiB parts would be
filled up again all the time, preventing the shotgun blast to spread all
over.

> And even I saw some early prototypes inside the codes to allow btrfs do
> allocation smaller extent than required.
> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
> 
> But it's still prototype and seems no one is really working on it now.
> 
> So when btrfs is writing new data, for example, to write about 16M data,
> it will need to allocate a 16M continuous extent, and if it can't find
> large enough space to allocate, then create a new data chunk.
> 
> [...]

That's the cluster idea right? Combining free space fragments into a
bigger piece of space to fill with writes?

The fun thing is that this might work, but because of the pattern we end
up with, a large write apparently fails (the files downloaded when doing
apt-get update by daily cron) which causes a new chunk allocation. This
is clearly visible in the videos. Directly after that, the new chunk
gets filled with the same pattern, because the extent allocator now
continues there and next day same thing happens again etc...

And voila, there's the answer to my original question.

Now, another surprise:

>From the exact moment I did mount -o remount,nossd on this filesystem,
the problem vanished.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png

I don't have a new video yet, but I'll set up a cron tonight and post it
later.

I'm going to send another mail specifically about the nossd/ssd
behaviour and other things I found out last week, but that'll probably
be tomorrow.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-11 Thread Hans van Kranenburg

On 06/10/2016 07:07 PM, Henk Slager wrote:

On Thu, Jun 9, 2016 at 5:41 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
excerpted:


The next question is what files these extents belong to. To find out, I
need to open up the extent items I get back and follow a backreference
to an inode object. Might do that tomorrow, fun.

To be honest, I suspect /var/log and/or the file storage of mailman to
be the cause of the fragmentation, since there's logging from postfix,
mailman and nginx going on all day long in a slow but steady tempo.
While using btrfs for a number of use cases at work now, we normally
don't use it for the root filesystem. And the cases where it's used as
root filesystem don't do much logging or mail.


FWIW, that's one reason I have a dedicated partition (and filesystem) for
logs, here.  (The other reason is that should something go runaway log-
spewing, I get a warning much sooner when my log filesystem fills up, not
much later, with much worse implications, when the main filesystem fills
up!)


Well, there it is:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/2016-06-11-extents_ichiban_77621886976.txt

Playing around a bit with the search ioctl:
https://github.com/knorrie/btrfs-heatmap/blob/master/chunk-contents.py

This is clearly primarily logging and mailman mbox files. All kinds of 
small extents, and a huge amount of fragmented free space in between.



And no, autodefrag is not in the mount options currently. Would that be
helpful in this case?


It should be helpful, yes.  Be aware that autodefrag works best with
smaller (sub-half-gig) files, however, and that it used to cause
performance issues with larger database and VM files, in particular.


I don't know why you relate filesize and autodefrag. Maybe because you
say '... used to cause ...'.


Log files grow to few tens of MBs and logrotate will copy the contents 
into gzipped files (defragging everything as a side effect) every once 
in a while, so the only concern is the current logfiles.



autodefrag detects random writes and then tries to defrag a certain
range. Its scope size is 256K as far as I see from the code and over
time you see VM images that are on a btrfs fs (CoW, hourly ro
snapshots) having a lot of 256K (or a bit less) sized extents
according to what filefrag reports. I once wanted to try and change
the 256K to 1M or even 4M, but I haven't  come to that.
A 32G VM image would consist of 131072 extents for 256K, 32768 extents
for 1M, 8192 extents for 4M.


Aha.


There used to be a warning on the wiki about that, that was recently
removed, so apparently it's not the issue that it was, but you might wish
to monitor any databases or VMs with gig-plus files to see if it's going
to be a performance issue, once you turn on autodefrag.


For very active databases, I don't know what the effects are, with or
without autodefrag ( either on SSD and/or HDD).
At least on HDD-only, so no persistent SSD caching and noautodefrag,
VMs will result in unacceptable performance soon.


The other issue with autodefrag is that if it hasn't been on and things
are heavily fragmented, it can at first drive down performance as it
rewrites all these heavily fragmented files, until it catches up and is
mostly dealing only with the normal refragmentation load.


I assume you mean that one only gets a performance drop if you
actually do new writes to the fragmented files since autodefrag on. It
shouldn't start defragging by itself AFAIK.


As far as I understand, it only considers new writes yes.

So I can manually defrag the mbox files (which get data appended slowly 
all the time) and turn on autodefrag, which will also take care of the 
log files, and after the next logrotate, all old fragmented extents will 
be freed.



Of course the
best way around that is to run autodefrag from the first time you mount
the filesystem and start writing to it, so it never gets overly
fragmented in the first place.  For a currently in-use and highly
fragmented filesystem, you have two choices, either backup and do a fresh
mkfs.btrfs so you can start with a clean filesystem and autodefrag from
the beginning, or doing manual defrag.

However, be aware that if you have snapshots locking down the old extents
in their fragmented form, a manual defrag will copy the data to new
extents without releasing the old ones as they're locked in place by the
snapshots, thus using additional space.  Worse, if the filesystem is
already heavily fragmented and snapshots are locking most of those
fragments in place, defrag likely won't help a lot, because the free
space as well will be heavily fragmented.   So starting off with a clean
and new filesystem and using autodefrag from the beginning really is your
best bet.


No snapshots here.


If it is about multi-TB fs, I think most important is to have enough
unfragmented free space available and hopefully at the beginning of
the device if it is flat HDD. Maybe a  

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-10 Thread Henk Slager
On Thu, Jun 9, 2016 at 5:41 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
> excerpted:
>
>> The next question is what files these extents belong to. To find out, I
>> need to open up the extent items I get back and follow a backreference
>> to an inode object. Might do that tomorrow, fun.
>>
>> To be honest, I suspect /var/log and/or the file storage of mailman to
>> be the cause of the fragmentation, since there's logging from postfix,
>> mailman and nginx going on all day long in a slow but steady tempo.
>> While using btrfs for a number of use cases at work now, we normally
>> don't use it for the root filesystem. And the cases where it's used as
>> root filesystem don't do much logging or mail.
>
> FWIW, that's one reason I have a dedicated partition (and filesystem) for
> logs, here.  (The other reason is that should something go runaway log-
> spewing, I get a warning much sooner when my log filesystem fills up, not
> much later, with much worse implications, when the main filesystem fills
> up!)
>
>> And no, autodefrag is not in the mount options currently. Would that be
>> helpful in this case?
>
> It should be helpful, yes.  Be aware that autodefrag works best with
> smaller (sub-half-gig) files, however, and that it used to cause
> performance issues with larger database and VM files, in particular.

I don't know why you relate filesize and autodefrag. Maybe because you
say '... used to cause ...'.

autodefrag detects random writes and then tries to defrag a certain
range. Its scope size is 256K as far as I see from the code and over
time you see VM images that are on a btrfs fs (CoW, hourly ro
snapshots) having a lot of 256K (or a bit less) sized extents
according to what filefrag reports. I once wanted to try and change
the 256K to 1M or even 4M, but I haven't  come to that.
A 32G VM image would consist of 131072 extents for 256K, 32768 extents
for 1M, 8192 extents for 4M.

> There used to be a warning on the wiki about that, that was recently
> removed, so apparently it's not the issue that it was, but you might wish
> to monitor any databases or VMs with gig-plus files to see if it's going
> to be a performance issue, once you turn on autodefrag.

For very active databases, I don't know what the effects are, with or
without autodefrag ( either on SSD and/or HDD).
At least on HDD-only, so no persistent SSD caching and noautodefrag,
VMs will result in unacceptable performance soon.

> The other issue with autodefrag is that if it hasn't been on and things
> are heavily fragmented, it can at first drive down performance as it
> rewrites all these heavily fragmented files, until it catches up and is
> mostly dealing only with the normal refragmentation load.

I assume you mean that one only gets a performance drop if you
actually do new writes to the fragmented files since autodefrag on. It
shouldn't start defragging by itself AFAIK.

> Of course the
> best way around that is to run autodefrag from the first time you mount
> the filesystem and start writing to it, so it never gets overly
> fragmented in the first place.  For a currently in-use and highly
> fragmented filesystem, you have two choices, either backup and do a fresh
> mkfs.btrfs so you can start with a clean filesystem and autodefrag from
> the beginning, or doing manual defrag.
>
> However, be aware that if you have snapshots locking down the old extents
> in their fragmented form, a manual defrag will copy the data to new
> extents without releasing the old ones as they're locked in place by the
> snapshots, thus using additional space.  Worse, if the filesystem is
> already heavily fragmented and snapshots are locking most of those
> fragments in place, defrag likely won't help a lot, because the free
> space as well will be heavily fragmented.   So starting off with a clean
> and new filesystem and using autodefrag from the beginning really is your
> best bet.

If it is about multi-TB fs, I think most important is to have enough
unfragmented free space available and hopefully at the beginning of
the device if it is flat HDD. Maybe a  balance -ddrange=1M..<20% of
device> can do that, I haven't tried.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Chris Murphy
On Wed, Jun 8, 2016 at 5:10 PM, Hans van Kranenburg
 wrote:
> Hi list,
>
>
> On 05/31/2016 03:36 AM, Qu Wenruo wrote:
>>
>>
>>
>> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>>>
>>> Hi,
>>>
>>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>>> somewhere that shows interesting behaviour: while no interesting disk
>>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>>
>>> A picture, telling more than 1000 words:
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>>> (when the amount of allocated/unused goes down, I did a btrfs balance)
>>
>>
>> Nice picture.
>> Really better than 1000 words.
>>
>> AFAIK, the problem may be caused by fragments.
>>
>> And even I saw some early prototypes inside the codes to allow btrfs do
>> allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
>>
>> But it's still prototype and seems no one is really working on it now.
>>
>> So when btrfs is writing new data, for example, to write about 16M data,
>> it will need to allocate a 16M continuous extent, and if it can't find
>> large enough space to allocate, then create a new data chunk.
>>
>> Despite the already awesome chunk level usage pricutre, I hope there is
>> info about extent level allocation to confirm my assumption.
>>
>> You could dump it by calling "btrfs-debug-tree -t 2 ".
>> It's normally recommended to do it unmounted, but it's still possible to
>> call it mounted, although not 100% perfect though.
>> (Then I'd better find a good way to draw a picture of
>> allocate/unallocate space and how fragments the chunks are)
>
>
> So, I finally found some spare time to continue investigating. In the
> meantime, the filesystem has happily been allocating new chunks every few
> days, filling them up way below 10% with data before starting a new one.
>
> The chunk allocation primarily seems to happen during cron.daily. But,
> manually executing all the cronjobs that are in there, even multiple times,
> does not result in newly allocated chunks. Yay. :(
>
> After the previous post, I put a little script in between every two jobs in
> /etc/cron.daily that prints the output of btrfs fi df to syslog and sleeps
> for 10 minutes so I can easily find out afterwards during which one it
> happened.
>
> Bingo! The "apt" cron.daily, which refreshes package lists and triggers
> unattended-upgrades.
>
> Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB
> [...]
> 2016-06-07 04:01:56,552 INFO Starting unattended upgrades script
> [...]
> Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB
>
> And, this thing is clever enough to do things once a day, even if you would
> execute it multple times... (Hehehe...)
>
> Ok, let's try doing some apt-get update then.
>
> Today, the latest added chunks look like this:
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 115499008 used_pct 10
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 72663040 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 86986752 used_pct 8
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 64032768 used_pct 5
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 71712768 used_pct 6
>
> Now I apt-get update...
>
> before: Data, single: total=13.00GiB, used=5.64GiB
> during: Data, single: total=13.00GiB, used=5.59GiB
> after : Data, single: total=14.00GiB, used=5.64GiB
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 119279616 used_pct 11
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 73170944 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 82251776 used_pct 7
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 6041600 used_pct 0
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 46178304 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Duncan
Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
excerpted:

> The next question is what files these extents belong to. To find out, I
> need to open up the extent items I get back and follow a backreference
> to an inode object. Might do that tomorrow, fun.
> 
> To be honest, I suspect /var/log and/or the file storage of mailman to
> be the cause of the fragmentation, since there's logging from postfix,
> mailman and nginx going on all day long in a slow but steady tempo.
> While using btrfs for a number of use cases at work now, we normally
> don't use it for the root filesystem. And the cases where it's used as
> root filesystem don't do much logging or mail.

FWIW, that's one reason I have a dedicated partition (and filesystem) for 
logs, here.  (The other reason is that should something go runaway log-
spewing, I get a warning much sooner when my log filesystem fills up, not 
much later, with much worse implications, when the main filesystem fills 
up!)

> And no, autodefrag is not in the mount options currently. Would that be
> helpful in this case?

It should be helpful, yes.  Be aware that autodefrag works best with 
smaller (sub-half-gig) files, however, and that it used to cause 
performance issues with larger database and VM files, in particular.  
There used to be a warning on the wiki about that, that was recently 
removed, so apparently it's not the issue that it was, but you might wish 
to monitor any databases or VMs with gig-plus files to see if it's going 
to be a performance issue, once you turn on autodefrag.

The other issue with autodefrag is that if it hasn't been on and things 
are heavily fragmented, it can at first drive down performance as it 
rewrites all these heavily fragmented files, until it catches up and is 
mostly dealing only with the normal refragmentation load.  Of course the 
best way around that is to run autodefrag from the first time you mount 
the filesystem and start writing to it, so it never gets overly 
fragmented in the first place.  For a currently in-use and highly 
fragmented filesystem, you have two choices, either backup and do a fresh 
mkfs.btrfs so you can start with a clean filesystem and autodefrag from 
the beginning, or doing manual defrag.

However, be aware that if you have snapshots locking down the old extents 
in their fragmented form, a manual defrag will copy the data to new 
extents without releasing the old ones as they're locked in place by the 
snapshots, thus using additional space.  Worse, if the filesystem is 
already heavily fragmented and snapshots are locking most of those 
fragments in place, defrag likely won't help a lot, because the free 
space as well will be heavily fragmented.   So starting off with a clean 
and new filesystem and using autodefrag from the beginning really is your 
best bet.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Hans van Kranenburg

On 06/09/2016 10:52 AM, Marc Haber wrote:

On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:

So, instead of being the cause, apt-get update causing a new chunk to be
allocated might as well be the result of existing ones already filled up
with too many fragments.

The next question is what files these extents belong to. To find out, I need
to open up the extent items I get back and follow a backreference to an
inode object. Might do that tomorrow, fun.


Does your apt use pdiffs to update the packages lists? If yes, I'd try
turning it off just for the fun of it and to see whether this changes
btrfs' allocation behavior. I have never looked at apt's pdiff stuff
in detail, but I guess that it creates many tiny temporary files.


No, it does not:

Acquire::Pdiffs "false";

--
Hans van Kranenburg - System / Network Engineer
Mendix | Driving Digital Innovation | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Marc Haber
On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:
> So, instead of being the cause, apt-get update causing a new chunk to be
> allocated might as well be the result of existing ones already filled up
> with too many fragments.
> 
> The next question is what files these extents belong to. To find out, I need
> to open up the extent items I get back and follow a backreference to an
> inode object. Might do that tomorrow, fun.

Does your apt use pdiffs to update the packages lists? If yes, I'd try
turning it off just for the fun of it and to see whether this changes
btrfs' allocation behavior. I have never looked at apt's pdiff stuff
in detail, but I guess that it creates many tiny temporary files.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-08 Thread Hans van Kranenburg

Hi list,

On 05/31/2016 03:36 AM, Qu Wenruo wrote:



Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:

Hi,

I've got a mostly inactive btrfs filesystem inside a virtual machine
somewhere that shows interesting behaviour: while no interesting disk
activity is going on, btrfs keeps allocating new chunks, a GiB at a time.

A picture, telling more than 1000 words:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
(when the amount of allocated/unused goes down, I did a btrfs balance)


Nice picture.
Really better than 1000 words.

AFAIK, the problem may be caused by fragments.

And even I saw some early prototypes inside the codes to allow btrfs do
allocation smaller extent than required.
(E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

But it's still prototype and seems no one is really working on it now.

So when btrfs is writing new data, for example, to write about 16M data,
it will need to allocate a 16M continuous extent, and if it can't find
large enough space to allocate, then create a new data chunk.

Despite the already awesome chunk level usage pricutre, I hope there is
info about extent level allocation to confirm my assumption.

You could dump it by calling "btrfs-debug-tree -t 2 ".
It's normally recommended to do it unmounted, but it's still possible to
call it mounted, although not 100% perfect though.
(Then I'd better find a good way to draw a picture of
allocate/unallocate space and how fragments the chunks are)


So, I finally found some spare time to continue investigating. In the 
meantime, the filesystem has happily been allocating new chunks every 
few days, filling them up way below 10% with data before starting a new one.


The chunk allocation primarily seems to happen during cron.daily. But, 
manually executing all the cronjobs that are in there, even multiple 
times, does not result in newly allocated chunks. Yay. :(


After the previous post, I put a little script in between every two jobs 
in /etc/cron.daily that prints the output of btrfs fi df to syslog and 
sleeps for 10 minutes so I can easily find out afterwards during which 
one it happened.


Bingo! The "apt" cron.daily, which refreshes package lists and triggers 
unattended-upgrades.


Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB
[...]
2016-06-07 04:01:56,552 INFO Starting unattended upgrades script
[...]
Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB

And, this thing is clever enough to do things once a day, even if you 
would execute it multple times... (Hehehe...)


Ok, let's try doing some apt-get update then.

Today, the latest added chunks look like this:

# ./show_usage.py /
[...]
chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length 
1073741824 used 115499008 used_pct 10
chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 
length 1073741824 used 36585472 used_pct 3
chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 
length 1073741824 used 17510400 used_pct 1
chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length 
268435456 used 72663040 used_pct 27
chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 
length 1073741824 used 86986752 used_pct 8
chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 
length 1073741824 used 21188608 used_pct 1
chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 
length 1073741824 used 64032768 used_pct 5
chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 
length 1073741824 used 71712768 used_pct 6


Now I apt-get update...

before: Data, single: total=13.00GiB, used=5.64GiB
during: Data, single: total=13.00GiB, used=5.59GiB
after : Data, single: total=14.00GiB, used=5.64GiB

# ./show_usage.py /
[...]
chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length 
1073741824 used 119279616 used_pct 11
chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 
length 1073741824 used 36585472 used_pct 3
chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 
length 1073741824 used 17510400 used_pct 1
chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length 
268435456 used 73170944 used_pct 27
chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 
length 1073741824 used 82251776 used_pct 7
chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 
length 1073741824 used 21188608 used_pct 1
chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 
length 1073741824 used 6041600 used_pct 0
chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 
length 1073741824 used 46178304 used_pct 4
chunk vaddr 71280099328 type 1 stripe 0 devid 1 offset 19595788288 
length 1073741824 used 84770816 used_pct 7


Interesting. There's a new one at 71280099328, 7% filled, and the usage 
of the 4 previous ones went down a bit.


Now I want to know what the distribution of data inside these chunks, to 
find 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-05-30 Thread Qu Wenruo



Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:

Hi,

I've got a mostly inactive btrfs filesystem inside a virtual machine
somewhere that shows interesting behaviour: while no interesting disk
activity is going on, btrfs keeps allocating new chunks, a GiB at a time.

A picture, telling more than 1000 words:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
(when the amount of allocated/unused goes down, I did a btrfs balance)


Nice picture.
Really better than 1000 words.

AFAIK, the problem may be caused by fragments.

And even I saw some early prototypes inside the codes to allow btrfs do 
allocation smaller extent than required.

(E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

But it's still prototype and seems no one is really working on it now.

So when btrfs is writing new data, for example, to write about 16M data, 
it will need to allocate a 16M continuous extent, and if it can't find 
large enough space to allocate, then create a new data chunk.


Despite the already awesome chunk level usage pricutre, I hope there is 
info about extent level allocation to confirm my assumption.


You could dump it by calling "btrfs-debug-tree -t 2 ".
It's normally recommended to do it unmounted, but it's still possible to 
call it mounted, although not 100% perfect though.
(Then I'd better find a good way to draw a picture of 
allocate/unallocate space and how fragments the chunks are)


Thanks,
Qu


Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
(2016-04-20) x86_64 GNU/Linux

# btrfs fi show /
Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
Total devices 1 FS bytes used 6.17GiB
devid1 size 20.00GiB used 16.54GiB path /dev/xvda

# btrfs fi df /
Data, single: total=15.01GiB, used=5.16GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.50GiB, used=1.01GiB
GlobalReserve, single: total=144.00MiB, used=0.00B

I'm a bit puzzled, since I haven't seen this happening on other
filesystems that use 4.4 or 4.5 kernels.

If I dump the allocated chunks and their % usage, it's clear that the
last 6 new added ones have a usage of only a few percent.

dev item devid 1 total bytes 21474836480 bytes used 17758683136
chunk vaddr 12582912 type 1 stripe 0 devid 1 offset 12582912 length
8388608 used 4276224 used_pct 50
chunk vaddr 1103101952 type 1 stripe 0 devid 1 offset 2185232384 length
1073741824 used 433127424 used_pct 40
chunk vaddr 3250585600 type 1 stripe 0 devid 1 offset 4332716032 length
1073741824 used 764391424 used_pct 71
chunk vaddr 9271508992 type 1 stripe 0 devid 1 offset 12079595520 length
1073741824 used 270704640 used_pct 25
chunk vaddr 12492734464 type 1 stripe 0 devid 1 offset 13153337344
length 1073741824 used 866574336 used_pct 80
chunk vaddr 13566476288 type 1 stripe 0 devid 1 offset 11005853696
length 1073741824 used 1028059136 used_pct 95
chunk vaddr 14640218112 type 1 stripe 0 devid 1 offset 3258974208 length
1073741824 used 762466304 used_pct 71
chunk vaddr 26250051584 type 1 stripe 0 devid 1 offset 19595788288
length 1073741824 used 114982912 used_pct 10
chunk vaddr 31618760704 type 1 stripe 0 devid 1 offset 15300820992
length 1073741824 used 488902656 used_pct 45
chunk vaddr 32692502528 type 4 stripe 0 devid 1 offset 5406457856 length
268435456 used 209272832 used_pct 77
chunk vaddr 32960937984 type 4 stripe 0 devid 1 offset 5943328768 length
268435456 used 251199488 used_pct 93
chunk vaddr 33229373440 type 4 stripe 0 devid 1 offset 7419723776 length
268435456 used 248709120 used_pct 92
chunk vaddr 33497808896 type 4 stripe 0 devid 1 offset 8896118784 length
268435456 used 247791616 used_pct 92
chunk vaddr 33766244352 type 4 stripe 0 devid 1 offset 8627683328 length
268435456 used 93061120 used_pct 34
chunk vaddr 34303115264 type 2 stripe 0 devid 1 offset 6748635136 length
33554432 used 16384 used_pct 0
chunk vaddr 34336669696 type 1 stripe 0 devid 1 offset 16374562816
length 1073741824 used 105054208 used_pct 9
chunk vaddr 35410411520 type 1 stripe 0 devid 1 offset 20971520 length
1073741824 used 10899456 used_pct 1
chunk vaddr 36484153344 type 1 stripe 0 devid 1 offset 1094713344 length
1073741824 used 441778176 used_pct 41
chunk vaddr 37557895168 type 4 stripe 0 devid 1 offset 5674893312 length
268435456 used 33439744 used_pct 12
chunk vaddr 37826330624 type 1 stripe 0 devid 1 offset 9164554240 length
1073741824 used 32096256 used_pct 2
chunk vaddr 38900072448 type 1 stripe 0 devid 1 offset 14227079168
length 1073741824 used 40140800 used_pct 3
chunk vaddr 39973814272 type 1 stripe 0 devid 1 offset 17448304640
length 1073741824 used 58093568 used_pct 5
chunk vaddr 41047556096 type 1 stripe 0 devid 1 offset 18522046464
length 1073741824 used 119701504 used_pct 11

The only things this host does is
 1) being a webserver for a small internal debian packages repository
 2) running low-volume mailman with a few lists, no archive-gzipping
mega cronjobs or anything enabled.
 3) some little legacy php 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-05-30 Thread Duncan
Hans van Kranenburg posted on Mon, 30 May 2016 23:18:20 +0200 as
excerpted:

>> Snip the dump, but curious as a user (not a dev) what command you used.
>> Presumably one of the debug commands which I'm not particularly
>> familiar with, but I wasn't aware it was even possible.
> 
> It's the output of a little programming exercise calling the search
> ioctl from python. https://github.com/knorrie/btrfs-heatmap
> 
> While using balance I got interested in knowing where balance got the
> information from to find how much % a chunk is used. I want to see that
> list in advance, so I can see what -dusage the most effective would be.
> My munin graphs show the stacked total value, which does not give you an
> idea about how badly the unused space is fragmented over already
> allocated chunks.
> 
> So, with some help of Hugo on IRC to get started, I ended up with this
> PoC, which can create nice movies of your data moving around over the
> physical space of the filesystem over time, like this one:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/heatmap.gif
> 
> Seeing the chunk allocator work its way around the two devices, choosing
> the one with the most free space, and reusing the gaps left by balance
> is super interesting. :-]

Very cool indeed.  Reminds me of the nice eye candy dynamic graphicals 
that MS defrag had back in 9x times.  (I've no idea what they have now as 
I've been off the platform for a decade and a half now.)

I may have to play with it a bit, when I have more time (I'm moving in a 
couple days...).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-05-30 Thread Hans van Kranenburg

On 05/30/2016 09:55 PM, Duncan wrote:

Hans van Kranenburg posted on Mon, 30 May 2016 13:07:26 +0200 as
excerpted:

[Please don't post "upside down".  Reply in context under the quoted
point, here the whole post, you're replying to.  It makes further replies
in context far easier. =:^)  I've pasted your update at the bottom here.]


Sure, thanks.


On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:


I've got a mostly inactive btrfs filesystem inside a virtual machine
somewhere that shows interesting behaviour: while no interesting disk
activity is going on, btrfs keeps allocating new chunks, a GiB at a
time.

A picture, telling more than 1000 words:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
(when the amount of allocated/unused goes down, I did a btrfs balance)


Agreed, that shows something strange going on.


Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
(2016-04-20) x86_64 GNU/Linux


So the kernel is/was current...


Running a slightly newer one now:

Linux ichiban 4.5.0-0.bpo.2-amd64 #1 SMP Debian 4.5.4-1~bpo8+1 
(2016-05-13) x86_64



# btrfs fi show /
Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
  Total devices 1 FS bytes used 6.17GiB
  devid1 size 20.00GiB used 16.54GiB path /dev/xvda

# btrfs fi df /
Data, single: total=15.01GiB, used=5.16GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.50GiB, used=1.01GiB
GlobalReserve, single: total=144.00MiB, used=0.00B

I'm a bit puzzled, since I haven't seen this happening on other
filesystems that use 4.4 or 4.5 kernels.


Nor have I, either reported (save for you) or personally.


If I dump the allocated chunks and their % usage, it's clear that the
last 6 new added ones have a usage of only a few percent.


Snip the dump, but curious as a user (not a dev) what command you used.
Presumably one of the debug commands which I'm not particularly familiar
with, but I wasn't aware it was even possible.


It's the output of a little programming exercise calling the search 
ioctl from python. https://github.com/knorrie/btrfs-heatmap


While using balance I got interested in knowing where balance got the 
information from to find how much % a chunk is used. I want to see that 
list in advance, so I can see what -dusage the most effective would be. 
My munin graphs show the stacked total value, which does not give you an 
idea about how badly the unused space is fragmented over already 
allocated chunks.


So, with some help of Hugo on IRC to get started, I ended up with this 
PoC, which can create nice movies of your data moving around over the 
physical space of the filesystem over time, like this one:


https://syrinx.knorrie.org/~knorrie/btrfs/heatmap.gif

Seeing the chunk allocator work its way around the two devices, choosing 
the one with the most free space, and reusing the gaps left by balance 
is super interesting. :-]



The only things this host does is
   1) being a webserver for a small internal debian packages repository
   2) running low-volume mailman with a few lists, no archive-gzipping
mega cronjobs or anything enabled.
   3) some little legacy php thingies

Interesting fact is that most of the 1GiB increases happen at the same
time as cron.daily runs. However, there's only a few standard things in
there. An occasional package upgrade by unattended-upgrade, or some
logrotate. The total contents of /var/log/ together is only 66MB...
Graphs show only less than about 100 MB reads/writes in total around
this time.


The cron.daily timing is interesting.  I'll come back to that below.


Well, it obviously has a very large sign saying "LOOK HERE" directly 
next to it yes.



As you can see in the graph the amount of used space is even
decreasing, because I cleaned up a bunch of old packages in the
repository, and still, btrfs keeps allocating new data chunks like a
hungry beast.

Why would this happen?



since it got any followup and since I'm bold enough to bump it one more
time... :)

I really don't understand the behaviour I described. Does it ring a bell
with anyone? This system is still allocating new 1GB data chunks every 1
or 2 days without using them at all, and I have to use balance every
week to get them away again.


Honestly I can only guess, and it's a new guess I didn't think of the
first time around, thus my lack of response the first time around.  But
lacking anyone else replying with better theories, given that I do have a
guess, I might as well put it out there.

Is it possible something in that daily cron allocates/writes a large but
likely spare file, perhaps a gig or more, probably fsyncing to lock the
large size in place, then truncates it to actual size, which might be
only a few kilobytes?

That's sort of behavior could at least in theory trigger the behavior you
describe, tho not being a dev and not being a Linux filesystem behavior
expert by any means, I'm admittedly fuzzy on exactly what details might
translate that theory into 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-05-30 Thread Duncan
Hans van Kranenburg posted on Mon, 30 May 2016 13:07:26 +0200 as
excerpted:

[Please don't post "upside down".  Reply in context under the quoted 
point, here the whole post, you're replying to.  It makes further replies 
in context far easier. =:^)  I've pasted your update at the bottom here.]

> On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:
>>
>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>> somewhere that shows interesting behaviour: while no interesting disk
>> activity is going on, btrfs keeps allocating new chunks, a GiB at a
>> time.
>>
>> A picture, telling more than 1000 words:
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>> (when the amount of allocated/unused goes down, I did a btrfs balance)

Agreed, that shows something strange going on.

>> Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
>> (2016-04-20) x86_64 GNU/Linux

So the kernel is/was current...

>> # btrfs fi show /
>> Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
>>  Total devices 1 FS bytes used 6.17GiB
>>  devid1 size 20.00GiB used 16.54GiB path /dev/xvda
>>
>> # btrfs fi df /
>> Data, single: total=15.01GiB, used=5.16GiB
>> System, single: total=32.00MiB, used=16.00KiB
>> Metadata, single: total=1.50GiB, used=1.01GiB
>> GlobalReserve, single: total=144.00MiB, used=0.00B
>>
>> I'm a bit puzzled, since I haven't seen this happening on other
>> filesystems that use 4.4 or 4.5 kernels.

Nor have I, either reported (save for you) or personally.

>> If I dump the allocated chunks and their % usage, it's clear that the
>> last 6 new added ones have a usage of only a few percent.

Snip the dump, but curious as a user (not a dev) what command you used.  
Presumably one of the debug commands which I'm not particularly familiar 
with, but I wasn't aware it was even possible.

>> The only things this host does is
>>   1) being a webserver for a small internal debian packages repository
>>   2) running low-volume mailman with a few lists, no archive-gzipping
>> mega cronjobs or anything enabled.
>>   3) some little legacy php thingies
>>
>> Interesting fact is that most of the 1GiB increases happen at the same
>> time as cron.daily runs. However, there's only a few standard things in
>> there. An occasional package upgrade by unattended-upgrade, or some
>> logrotate. The total contents of /var/log/ together is only 66MB...
>> Graphs show only less than about 100 MB reads/writes in total around
>> this time.

The cron.daily timing is interesting.  I'll come back to that below.

>> As you can see in the graph the amount of used space is even
>> decreasing, because I cleaned up a bunch of old packages in the
>> repository, and still, btrfs keeps allocating new data chunks like a
>> hungry beast.
>>
>> Why would this happen?

> since it got any followup and since I'm bold enough to bump it one more
> time... :)
> 
> I really don't understand the behaviour I described. Does it ring a bell
> with anyone? This system is still allocating new 1GB data chunks every 1
> or 2 days without using them at all, and I have to use balance every
> week to get them away again.

Honestly I can only guess, and it's a new guess I didn't think of the 
first time around, thus my lack of response the first time around.  But 
lacking anyone else replying with better theories, given that I do have a 
guess, I might as well put it out there.

Is it possible something in that daily cron allocates/writes a large but 
likely spare file, perhaps a gig or more, probably fsyncing to lock the 
large size in place, then truncates it to actual size, which might be 
only a few kilobytes?

That's sort of behavior could at least in theory trigger the behavior you 
describe, tho not being a dev and not being a Linux filesystem behavior 
expert by any means, I'm admittedly fuzzy on exactly what details might 
translate that theory into the reality you're seeing.


In any event, my usual "brute force" approach to such mysteries is to 
bisect the problem space down until I know where the issue is.

First, try rescheduling your cron.daily run to a different time, and see 
if the behavior follows it, thus specifically tying it to something in 
that run.

Second, try either running all tasks it runs manually, checking which one 
triggers the problem, or if you have too many tasks for that to be 
convenient, split them into cron.daily1 and cron.daily2, scheduled at 
different times, bisecting the problem by seeing which one the behavior 
follows.

Repeat as needed until you've discovered the culprit, then examine 
exactly what it's doing to the filesystem.

And please report your results.  Besides satisfying my own personal 
curiosity, there's a fair chance someone else will have the same issue at 
some point and either post their own question, or discover this thread 
via google or whatever.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-05-30 Thread Hans van Kranenburg

Hi,

since it got any followup and since I'm bold enough to bump it one more 
time... :)


I really don't understand the behaviour I described. Does it ring a bell 
with anyone? This system is still allocating new 1GB data chunks every 1 
or 2 days without using them at all, and I have to use balance every 
week to get them away again.


Hans

On 05/06/2016 11:28 PM, Hans van Kranenburg wrote:

Hi,

I've got a mostly inactive btrfs filesystem inside a virtual machine
somewhere that shows interesting behaviour: while no interesting disk
activity is going on, btrfs keeps allocating new chunks, a GiB at a time.

A picture, telling more than 1000 words:
https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
(when the amount of allocated/unused goes down, I did a btrfs balance)

Linux ichiban 4.5.0-0.bpo.1-amd64 #1 SMP Debian 4.5.1-1~bpo8+1
(2016-04-20) x86_64 GNU/Linux

# btrfs fi show /
Label: none  uuid: 9881fc30-8f69-4069-a8c8-c057b842b0c4
 Total devices 1 FS bytes used 6.17GiB
 devid1 size 20.00GiB used 16.54GiB path /dev/xvda

# btrfs fi df /
Data, single: total=15.01GiB, used=5.16GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=1.50GiB, used=1.01GiB
GlobalReserve, single: total=144.00MiB, used=0.00B

I'm a bit puzzled, since I haven't seen this happening on other
filesystems that use 4.4 or 4.5 kernels.

If I dump the allocated chunks and their % usage, it's clear that the
last 6 new added ones have a usage of only a few percent.

dev item devid 1 total bytes 21474836480 bytes used 17758683136
chunk vaddr 12582912 type 1 stripe 0 devid 1 offset 12582912 length
8388608 used 4276224 used_pct 50
chunk vaddr 1103101952 type 1 stripe 0 devid 1 offset 2185232384 length
1073741824 used 433127424 used_pct 40
chunk vaddr 3250585600 type 1 stripe 0 devid 1 offset 4332716032 length
1073741824 used 764391424 used_pct 71
chunk vaddr 9271508992 type 1 stripe 0 devid 1 offset 12079595520 length
1073741824 used 270704640 used_pct 25
chunk vaddr 12492734464 type 1 stripe 0 devid 1 offset 13153337344
length 1073741824 used 866574336 used_pct 80
chunk vaddr 13566476288 type 1 stripe 0 devid 1 offset 11005853696
length 1073741824 used 1028059136 used_pct 95
chunk vaddr 14640218112 type 1 stripe 0 devid 1 offset 3258974208 length
1073741824 used 762466304 used_pct 71
chunk vaddr 26250051584 type 1 stripe 0 devid 1 offset 19595788288
length 1073741824 used 114982912 used_pct 10
chunk vaddr 31618760704 type 1 stripe 0 devid 1 offset 15300820992
length 1073741824 used 488902656 used_pct 45
chunk vaddr 32692502528 type 4 stripe 0 devid 1 offset 5406457856 length
268435456 used 209272832 used_pct 77
chunk vaddr 32960937984 type 4 stripe 0 devid 1 offset 5943328768 length
268435456 used 251199488 used_pct 93
chunk vaddr 33229373440 type 4 stripe 0 devid 1 offset 7419723776 length
268435456 used 248709120 used_pct 92
chunk vaddr 33497808896 type 4 stripe 0 devid 1 offset 8896118784 length
268435456 used 247791616 used_pct 92
chunk vaddr 33766244352 type 4 stripe 0 devid 1 offset 8627683328 length
268435456 used 93061120 used_pct 34
chunk vaddr 34303115264 type 2 stripe 0 devid 1 offset 6748635136 length
33554432 used 16384 used_pct 0
chunk vaddr 34336669696 type 1 stripe 0 devid 1 offset 16374562816
length 1073741824 used 105054208 used_pct 9
chunk vaddr 35410411520 type 1 stripe 0 devid 1 offset 20971520 length
1073741824 used 10899456 used_pct 1
chunk vaddr 36484153344 type 1 stripe 0 devid 1 offset 1094713344 length
1073741824 used 441778176 used_pct 41
chunk vaddr 37557895168 type 4 stripe 0 devid 1 offset 5674893312 length
268435456 used 33439744 used_pct 12
chunk vaddr 37826330624 type 1 stripe 0 devid 1 offset 9164554240 length
1073741824 used 32096256 used_pct 2
chunk vaddr 38900072448 type 1 stripe 0 devid 1 offset 14227079168
length 1073741824 used 40140800 used_pct 3
chunk vaddr 39973814272 type 1 stripe 0 devid 1 offset 17448304640
length 1073741824 used 58093568 used_pct 5
chunk vaddr 41047556096 type 1 stripe 0 devid 1 offset 18522046464
length 1073741824 used 119701504 used_pct 11

The only things this host does is
  1) being a webserver for a small internal debian packages repository
  2) running low-volume mailman with a few lists, no archive-gzipping
mega cronjobs or anything enabled.
  3) some little legacy php thingies

Interesting fact is that most of the 1GiB increases happen at the same
time as cron.daily runs. However, there's only a few standard things in
there. An occasional package upgrade by unattended-upgrade, or some
logrotate. The total contents of /var/log/ together is only 66MB...
Graphs show only less than about 100 MB reads/writes in total around
this time.

As you can see in the graph the amount of used space is even decreasing,
because I cleaned up a bunch of old packages in the repository, and
still, btrfs keeps allocating new data chunks like a hungry beast.

Why would this happen?

Hans van Kranenburg
--
To unsubscribe from this list: