Re: [PATCH] btrfs: scrub: use do_div() for 64-by-32 division

2017-04-08 Thread Adam Borowski
On Sat, Apr 08, 2017 at 11:07:37PM +0200, Adam Borowski wrote:
> Unbreaks ARM and possibly other 32-bit architectures.

Turns out those "other 32-bit architectures" happen to include i386.

A modular build:

ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

With the patch, i386 builds fine.

> Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub
> sometimes works, but, like most operations, randomly dies with some badness
> that doesn't look related: io_schedule, kunmap_high.  That badness wasn't
> there in 4.11-rc5, needs investigating, but since it's not connected to our
> issue at hand, I consider this patch sort-of tested.

Looks like current -next is pretty broken: while amd64 is ok, on an i386 box
(non-NX Pentium 4) it hangs very early during boot, way before filesystem
modules would be loaded.  Qemu boots but has random hangs.

So it looks like it's compile only for now...

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Kai Krakow
Am Sun, 9 Apr 2017 02:21:19 +0200
schrieb Hans van Kranenburg :

> On 04/08/2017 11:55 PM, Peter Grandi wrote:
> >> [ ... ] This post is way too long [ ... ]  
> > 
> > Many thanks for your report, it is really useful, especially the
> > details.  
> 
> Thanks!
> 
> >> [ ... ] using rsync with --link-dest to btrfs while still
> >> using rsync, but with btrfs subvolumes and snapshots [1]. [
> >> ... ]  Currently there's ~35TiB of data present on the example
> >> filesystem, with a total of just a bit more than 9
> >> subvolumes, in groups of 32 snapshots per remote host (daily
> >> for 14 days, weekly for 3 months, montly for a year), so
> >> that's about 2800 'groups' of them. Inside are millions and
> >> millions and millions of files. And the best part is... it
> >> just works. [ ... ]  
> > 
> > That kind of arrangement, with a single large pool and very many
> > many files and many subdirectories is a worst case scanario for
> > any filesystem type, so it is amazing-ish that it works well so
> > far, especially with 90,000 subvolumes.  
> 
> Yes, this is one of the reasons for this post. Instead of only hearing
> about problems all day on the mailing list and IRC, we need some more
> reports of success.
> 
> The fundamental functionality of doing the cow snapshots, moo, and the
> related subvolume removal on filesystem trees is so awesome. I have no
> idea how we would have been able to continue this type of backup
> system when btrfs was not available. Hardlinks and rm -rf was a total
> dead end road.

I'm absolutely no expert with arrays of sizes that you use but I also
stopped using the hardlink-and-remove approach: It was slow to manage
(rsync works slow for it, rm works slow for it) and it was error-probe
(due to the nature of hardlinks). I used btrfs with snapshots and rsync
for a while in my personal testbed, and experienced great slowness over
time: rsync started to become slower and slower, full backup took 4
hours with huge %IO usage, maintaining the backup history was also slow
(removing backups took a while), rebalancing was needed due to huge
wasted space. I used rsync with --inplace and --no-whole-file to waste
as few space as possible.

What I first found was an adaptive rebalancer script which I still use
for the main filesystem:

https://www.spinics.net/lists/linux-btrfs/msg52076.html
(thanks to Lionel)

It works pretty well and has no such big IO overhead due to the
adaptive multi-pass approach.

But it still did not help the slowness. I now tested borgbackup for a
while, and it's fast: It does the same job in 30 minutes or less
instead of 4 hours, and it has much better backup density and comes
with easy history maintenance, too. I can now store much more backup
history in the same space. Full restore time is about the same as
copying back with rsync.

For a professional deployment I'm planning to use XFS as the storage
backend and borgbackup as the backup frontend, because my findings
showed that XFS allocation groups are spanning diagonally across the
disk array, that is if you'd use simple JBOD of your iSCSI LUNs, XFS
will spread writes across all the LUNs without you needing to do normal
RAID striping, which should eliminate the need to migrate when adding
more LUNs, and the underlaying storage layer on the NetApp side will
probably already do RAID for redundancy anyways. Just feed more space to
XFS using LVM.

Borgbackup can do everything that btrfs can do for you but is
targetting the job of doing backups only: It can compress, deduplicate,
encrypt and do history thinning. The only downside I found is that only
one backup job at a time can access the backup repository. So you'd
have to use one backup repo per source machine. That way you cannot
benefit from deduplication across multiple sources. But I'm sure NetApp
can do that. OTOH, maybe backup duration drops to a point that you
could serialize the backup of some machines.

> OTOH, what we do with btrfs (taking a bulldozer and drive across all
> the boundaries of sanity according to all recommendations and
> warnings) on this scale of individual remotes is something that the
> NetApp people should totally be jealous of. Backups management
> (manual create, restore etc on top of the nightlies) is self service
> functionality for our customers, and being able to implement the
> magic behind the APIs with just a few commands like a btrfs sub snap
> and some rsync gives the right amount of freedom and flexibility we
> need.

This is something I'm planning here, too: Self-service backups, do a
btrfs snap, but then use borgbackup for archiving purposes.

BTW: I think the 2M size comes from the assumption that SSDs manage
their storage in groups of erase block sizes. The optimization here
would be that btrfs deallocates (and maybe trims) only whole erase
blocks which typically are 2M. This has a performance benefit. But if
your underlying storage layer is RAID anyways, this no longer maps

Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Hans van Kranenburg
On 04/09/2017 02:21 AM, Hans van Kranenburg wrote:
> [...]
> Notice that everyone who has rotational 0 in /sys is experiencing this
> behaviour right now, when removing snapshots... [...]

Eh, 1

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Hans van Kranenburg
On 04/08/2017 11:55 PM, Peter Grandi wrote:
>> [ ... ] This post is way too long [ ... ]
> 
> Many thanks for your report, it is really useful, especially the
> details.

Thanks!

>> [ ... ] using rsync with --link-dest to btrfs while still
>> using rsync, but with btrfs subvolumes and snapshots [1]. [
>> ... ]  Currently there's ~35TiB of data present on the example
>> filesystem, with a total of just a bit more than 9
>> subvolumes, in groups of 32 snapshots per remote host (daily
>> for 14 days, weekly for 3 months, montly for a year), so
>> that's about 2800 'groups' of them. Inside are millions and
>> millions and millions of files. And the best part is... it
>> just works. [ ... ]
> 
> That kind of arrangement, with a single large pool and very many
> many files and many subdirectories is a worst case scanario for
> any filesystem type, so it is amazing-ish that it works well so
> far, especially with 90,000 subvolumes.

Yes, this is one of the reasons for this post. Instead of only hearing
about problems all day on the mailing list and IRC, we need some more
reports of success.

The fundamental functionality of doing the cow snapshots, moo, and the
related subvolume removal on filesystem trees is so awesome. I have no
idea how we would have been able to continue this type of backup system
when btrfs was not available. Hardlinks and rm -rf was a total dead end
road.

The growth has been slow but steady (oops, fast and steady, I
immediately got corrected by our sales department), but anyway, steady.
This makes it possible to just let it do its thing every day and spot
small changes in behaviour over time, detect patterns that could be a
ticking time bomb and then deal with them in a way that allows conscious
decisions, well-tested changes and continous measurements of the result.

But, ok, it's surely not for the faint of heart, and the devil is in the
details. If it breaks, you keep the pieces. Using the NetApp hardware is
one of the relevant decisions made here. The shameful state of the most
basic case of recovering (or not be able to recover) a failure in a two
disk btrfs RAID1 is enough of a sign that the whole multi-disk handling
is a nice idea, but didn't get the amount of attention yet that it would
deserve to be something to be able to rely on (for me). Having the data
safe in my NetApp filer gives me the opportunity to do regular (like,
monthly) snapshots of the complete thing, so that I have something to go
back to if disaster would strike in linux land. Yes, it's a bit
inconvenient because I want to umount for a few minutes in a silent
moment of the week, but it's worth the effort, since I can keep the eggs
in a shadow basket.

OTOH, what we do with btrfs (taking a bulldozer and drive across all the
boundaries of sanity according to all recommendations and warnings) on
this scale of individual remotes is something that the NetApp people
should totally be jealous of. Backups management (manual create, restore
etc on top of the nightlies) is self service functionality for our
customers, and being able to implement the magic behind the APIs with
just a few commands like a btrfs sub snap and some rsync gives the right
amount of freedom and flexibility we need.

And, monitoring of trends is so. super. important. It's not a secret
that when I work with technology, I want to see what's going on in
there, crack the black box open and try to understand why the lights are
blinking in a specific pattern. What does this balance -dusage=75 mean?
Why does it know what's 75% full and I don't? Where does it get that
information from? The open source kernel code and the IOCTL API is a
source for many hours of happy hacking, because it allows all of this to
be done.

> As I mentioned elsewhere
> I would rather do a rotation of smaller volumes, to reduce risk,
> like "Duncan" also on this mailing list likes to do (perhaps to
> the opposite extreme).

Well, like seen in my 'keeps allocating new chunks for no apparent
reason' thread... even small filesystems can have really weird problems. :)

> As to the 'ssd'/'nossd' issue that is as described in 'man 5
> btrfs' (and I wonder whether 'ssd_spread' was tried too) but it
> is not at all obvious it should impact so much metadata
> handling. I'll add a new item in the "gotcha" list.

I suspect that the -o ssd behaviour is a decent source of the "help! my
filesystem is full but df says it's not" problems we see about every
week. But, I can't just argue that. Apart from that it was the very same
problem being the first thing that btrfs greeted me with when trying it
out for the first time a few years ago, (and it still is one of the
first problems people who start using btrfs encounter) I haven't spent
time to debug the behaviour when running fully allocated.

OTOH the two-step allocation process is also a nice thing, because I
*know* when I still have unallocated space available, which makes for
example the free space fragmentation debugging process much more 

[PATCH] btrfs-progs: Fix missing newline in man 5 btrfs

2017-04-08 Thread Hans van Kranenburg
The text compress_lzo:: would show up directly after 'bigger than the
page size' on the same line.
---
 Documentation/btrfs-man5.asciidoc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/btrfs-man5.asciidoc 
b/Documentation/btrfs-man5.asciidoc
index c8ef1c96..90f16057 100644
--- a/Documentation/btrfs-man5.asciidoc
+++ b/Documentation/btrfs-man5.asciidoc
@@ -455,6 +455,7 @@ big_metadata::
 (since: 3.4)
 +
 the filesystem uses 'nodesize' bigger than the page size
+
 compress_lzo::
 (since: 2.6.38)
 +
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Peter Grandi
> [ ... ] This post is way too long [ ... ]

Many thanks for your report, it is really useful, especially the
details.

> [ ... ] using rsync with --link-dest to btrfs while still
> using rsync, but with btrfs subvolumes and snapshots [1]. [
> ... ]  Currently there's ~35TiB of data present on the example
> filesystem, with a total of just a bit more than 9
> subvolumes, in groups of 32 snapshots per remote host (daily
> for 14 days, weekly for 3 months, montly for a year), so
> that's about 2800 'groups' of them. Inside are millions and
> millions and millions of files. And the best part is... it
> just works. [ ... ]

That kind of arrangement, with a single large pool and very many
many files and many subdirectories is a worst case scanario for
any filesystem type, so it is amazing-ish that it works well so
far, especially with 90,000 subvolumes. As I mentioned elsewhere
I would rather do a rotation of smaller volumes, to reduce risk,
like "Duncan" also on this mailing list likes to do (perhaps to
the opposite extreme).

As to the 'ssd'/'nossd' issue that is as described in 'man 5
btrfs' (and I wonder whether 'ssd_spread' was tried too) but it
is not at all obvious it should impact so much metadata
handling. I'll add a new item in the "gotcha" list. 

It is sad that 'ssd' is used by default in your case, and it is
quite perplexing that tghe "wandering trees" problem (that is
"write amplification") is so large with 64KiB write clusters for
metadata (and 'dup' profile for metadata).

* Probably the metadata and data cluster sizes should be create
  or mount parameters instead of being implicit in the 'ssd'
  option.
* A cluster size of 2MiB for metadata and/or data presumably
  has some downsides, otrherwise it would be the default. I
  wonder whether the downsides related to barriers...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: scrub: use do_div() for 64-by-32 division

2017-04-08 Thread Adam Borowski
Unbreaks ARM and possibly other 32-bit architectures.

Fixes: 7d0ef8b4d: Btrfs: update scrub_parity to use u64 stripe_len
Reported-by: Icenowy Zheng 
Signed-off-by: Adam Borowski 
---
You'd probably want to squash this with Liu's commit, to be nice to future
bisects.

Tested on amd64 where all is fine, and on arm (Odroid-U2) where scrub
sometimes works, but, like most operations, randomly dies with some badness
that doesn't look related: io_schedule, kunmap_high.  That badness wasn't
there in 4.11-rc5, needs investigating, but since it's not connected to our
issue at hand, I consider this patch sort-of tested.

 fs/btrfs/scrub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b6fe1cd08048..95372e3679f3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2407,7 +2407,7 @@ static inline void __scrub_mark_bitmap(struct 
scrub_parity *sparity,
 
start -= sparity->logic_start;
start = div64_u64_rem(start, sparity->stripe_len, );
-   offset /= sectorsize;
+   do_div(offset, sectorsize);
nsectors = (int)len / sectorsize;
 
if (offset + nsectors <= sparity->nsectors) {
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-08 Thread David Howells
Eryu Guan  wrote:

> > Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al.
> > Ext4 uses nlink = 1 for directories with more than 32K subdirs
> > (EXT4_FEATURE_RO_COMPAT_DIR_NLINK).
> > 
> > But in both those fs newly created directories will have nlink = 2.
> 
> Is there a conclusion on this? Seems the test should be updated
> accordingly?

I've dropped the nlink check on directories.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Hans van Kranenburg
So... today a real life story / btrfs use case example from the trenches
at work...

tl;dr 1) btrfs is awesome, but you have to carefully choose which parts
of it you want to use or avoid 2) improvements can be made, but at least
the problems relevant for this use case are managable and behaviour is
quite predictable.

This post is way too long, but I hope it's a fun read for a lazy sunday
afternoon. :) Otherwise, skip some sections, they have headers.

...

The example filesystem for this post is one of the backup server
filesystems we have, running btrfs for the data storage.

== About ==

In Q4 2014, we converted all our backup storage from ext4 and using
rsync with --link-dest to btrfs while still using rsync, but with btrfs
subvolumes and snapshots [1]. For every new backup, it creates a
writable snapshot of the previous backup and then uses rsync on the file
tree to get changes from the remote.

Currently there's ~35TiB of data present on the example filesystem, with
a total of just a bit more than 9 subvolumes, in groups of 32
snapshots per remote host (daily for 14 days, weekly for 3 months,
montly for a year), so that's about 2800 'groups' of them. Inside are
millions and millions and millions of files.

And the best part is... it just works. Well, almost, given the title of
the post. But, the effort needed for creating all backups and doing
subvolume removal for expiries scales linearly with the amount of them.

== Hardware and filesystem setup ==

The actual disk storage is done using NetApp storage equipment, in this
case a FAS2552 with 1.2T SAS disks and some extra disk shelves. Storage
is exported over multipath iSCSI over ethernet, and then grouped
together again with multipathd and LVM, striping (like, RAID0) over
active/active controllers. We've been using this setup for years now in
different places, and it works really well. So, using this, we keep the
whole RAID / multiple disks / hardware disk failure part outside the
reach of btrfs. And yes, checksums are done twice, but who cares. ;]

Since the maximum iSCSI lun size is 16TiB, the maximum block device size
that we use by combining two is 32TiB. This filesystem is already
bigger, so at some point we added two new luns in a new LVM volume
group, and added the result to the btrfs filesystem (yay!):

Total devices 2 FS bytes used 35.10TiB
devid1 size 29.99TiB used 29.10TiB path /dev/xvdb
devid2 size 12.00TiB used 11.29TiB path /dev/xvdc

Data, single: total=39.50TiB, used=34.67TiB
System, DUP: total=40.00MiB, used=6.22MiB
Metadata, DUP: total=454.50GiB, used=437.36GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Yes, DUP metadata, more about that later...

I can also umount the filesystem for a short time, take a snapshot on
NetApp level from the luns, clone them and then have a writable clone of
a 40TiB btrfs filesystem, to be able to do crazy things and tests before
really doing changes, like kernel version or things like converting to
the free space tree etc.

>From end 2014 to september 2016, we used the 3.16 LTS kernel from Debian
Jessie. Since september 2016, it's 4.7.5, after torturing it for two
weeks on such a clone, replaying the daily workload on it.

== What's not so great... Allocated but unused space... ==

Since the beginning it showed that the filesystem had a tendency to
accumulate allocated but unused space that didn't get reused again by
writes.

In the last months of using kernel 3.16 the situation worsened, ending
up with about 30% allocated but unused space (11TiB...), while the
filesystem kept allocating new space all the time instead of reusing it:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png

Using balance with the 3.16 kernel and space cache v1 to fight this was
almost not possible because of the amount of scattered out metadata
writes + amplification (1:40 overall read/write ratio during balance)
and writing space cache information over and over again on every commit.

When making the switch to the 4.7 kernel I also switched to the free
space tree, eliminating the space cache flush problems and did a
mega-balance operation which brought it back down quite a bit.

Here's what it looked like for the last 6 months:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q4-17-Q1.png

This is not too bad, but also not good enough. I want my picture to
become brighter white than this:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-03-14-backups-heatmap-chunks.png

The picture shows that the unused space is scattered all around the
whole filesystem.

So about a month ago, I continued searching kernel code for the cause of
this behaviour. This is a fun, but time consuming and often mind
boggling activity, because you run into 10 different interesting things
at the same time and want to start to find out about all of them at the
same time etc. :D

The two first things I found out about were:
  1) the 'free space cluster' code, which is responsible 

Re: Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code

2017-04-08 Thread Adam Borowski
On Sat, Apr 08, 2017 at 02:45:34PM -0300, Fabio Estevam wrote:
> On Sat, Apr 8, 2017 at 1:02 PM, Icenowy Zheng  wrote:
> > Hello everyone,
> > Today I tried to build a kernel with btrfs enabled on ARM, then when linking
> > I met such an error:
> >
> > ```
> > fs/built-in.o: In function `scrub_bio_end_io_worker':
> > acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod'
> > fs/built-in.o: In function `scrub_extent_for_parity':
> > acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod'
> > fs/built-in.o: In function `scrub_raid56_parity':
> > acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod'
> > acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod'
> > ```
> >
> > These functions are found at fs/btrfs/scrub.c .
> >
> > After disabling btrfs the kernel is successfully built.
> 
> I see the same error with ARM imx_v6_v7_defconfig + btrfs support.
> 
> Looks like it is caused by commit 7d0ef8b4dbbd220 ("Btrfs: update
> scrub_parity to use u64 stripe_len").

+1, my bisect just finished, same bad commit.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] audit: Use timespec64 to represent audit timestamps

2017-04-08 Thread Deepa Dinamani
> I have no problem merging this patch into audit/next for v4.12, would
> you prefer me to do that so at least this patch is merged?

This would be fine.
But, I think whoever takes the last 2 deletion patches should also take them.
I'm not sure how that part works out.

> It would probably make life a small bit easier for us in the audit
> world too as it would reduce the potential merge conflict.  However,
> that's a relatively small thing to worry about.

-Deepa
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code

2017-04-08 Thread Fabio Estevam
On Sat, Apr 8, 2017 at 1:02 PM, Icenowy Zheng  wrote:
> Hello everyone,
> Today I tried to build a kernel with btrfs enabled on ARM, then when linking
> I met such an error:
>
> ```
> fs/built-in.o: In function `scrub_bio_end_io_worker':
> acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod'
> fs/built-in.o: In function `scrub_extent_for_parity':
> acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod'
> fs/built-in.o: In function `scrub_raid56_parity':
> acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod'
> acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod'
> ```
>
> These functions are found at fs/btrfs/scrub.c .
>
> After disabling btrfs the kernel is successfully built.

I see the same error with ARM imx_v6_v7_defconfig + btrfs support.

Looks like it is caused by commit 7d0ef8b4dbbd220 ("Btrfs: update
scrub_parity to use u64 stripe_len").
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux next-20170407 failed to build on ARM due to usage of mod in btrfs code

2017-04-08 Thread Icenowy Zheng

Hello everyone,
Today I tried to build a kernel with btrfs enabled on ARM, then when 
linking I met such an error:


```
fs/built-in.o: In function `scrub_bio_end_io_worker':
acl.c:(.text+0x2f0450): undefined reference to `__aeabi_uldivmod'
fs/built-in.o: In function `scrub_extent_for_parity':
acl.c:(.text+0x2f0bcc): undefined reference to `__aeabi_uldivmod'
fs/built-in.o: In function `scrub_raid56_parity':
acl.c:(.text+0x2f12a8): undefined reference to `__aeabi_uldivmod'
acl.c:(.text+0x2f15c4): undefined reference to `__aeabi_uldivmod'
```

These functions are found at fs/btrfs/scrub.c .

After disabling btrfs the kernel is successfully built.

For this problem, see also [1], which used to be a similar bug in PL330 
driver code.


[1] https://patchwork.kernel.org/patch/5299081/

Thanks,
Icenowy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does btrfs get nlink on directories wrong? -- was Re: [PATCH 2/4] xfstests: Add first statx test [ver #5]

2017-04-08 Thread Eryu Guan
On Wed, Apr 05, 2017 at 03:32:30PM +0300, Amir Goldstein wrote:
> On Wed, Apr 5, 2017 at 3:30 PM, David Sterba  wrote:
> > On Wed, Apr 05, 2017 at 11:53:41AM +0100, David Howells wrote:
> >> I've added a test to xfstests that exercises the new statx syscall.  
> >> However,
> >> it fails on btrfs:
> >>
> >>  Test statx on a directory
> >> +[!] stx_nlink differs, 1 != 2
> >> +Failed
> >> +stat_test failed
> >>
> >> because a new directory it creates has an nlink of 1, not 2.  Is this a 
> >> case
> >> of my making an incorrect assumption or is it an fs bug?
> >
> > Afaik nlink == 1 means that there's no accounting of subdirectories, and
> > it's a valid value. The 'find' utility can use nlink to optimize
> > directory traversal but otherwise I'm not aware of other usage.
> >
> > All directories in btrfs have nlink == 1.
> 
> FYI,
> 
> Overlayfs uses nlink = 1 for merge dirs to silence 'find' et al.
> Ext4 uses nlink = 1 for directories with more than 32K subdirs
> (EXT4_FEATURE_RO_COMPAT_DIR_NLINK).
> 
> But in both those fs newly created directories will have nlink = 2.

Is there a conclusion on this? Seems the test should be updated
accordingly?

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12] audit: Use timespec64 to represent audit timestamps

2017-04-08 Thread Paul Moore
On Fri, Apr 7, 2017 at 8:57 PM, Deepa Dinamani  wrote:
> struct timespec is not y2038 safe.
> Audit timestamps are recorded in string format into
> an audit buffer for a given context.
> These mark the entry timestamps for the syscalls.
> Use y2038 safe struct timespec64 to represent the times.
> The log strings can handle this transition as strings can
> hold upto 1024 characters.
>
> Signed-off-by: Deepa Dinamani 
> Reviewed-by: Arnd Bergmann 
> Acked-by: Paul Moore 
> Acked-by: Richard Guy Briggs 
> ---
>  include/linux/audit.h |  4 ++--
>  kernel/audit.c| 10 +-
>  kernel/audit.h|  2 +-
>  kernel/auditsc.c  |  6 +++---
>  4 files changed, 11 insertions(+), 11 deletions(-)

I have no problem merging this patch into audit/next for v4.12, would
you prefer me to do that so at least this patch is merged?

It would probably make life a small bit easier for us in the audit
world too as it would reduce the potential merge conflict.  However,
that's a relatively small thing to worry about.

> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 6fdfefc..f830508 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -332,7 +332,7 @@ static inline void audit_ptrace(struct task_struct *t)
> /* Private API (for audit.c only) */
>  extern unsigned int audit_serial(void);
>  extern int auditsc_get_stamp(struct audit_context *ctx,
> - struct timespec *t, unsigned int *serial);
> + struct timespec64 *t, unsigned int *serial);
>  extern int audit_set_loginuid(kuid_t loginuid);
>
>  static inline kuid_t audit_get_loginuid(struct task_struct *tsk)
> @@ -511,7 +511,7 @@ static inline void __audit_seccomp(unsigned long syscall, 
> long signr, int code)
>  static inline void audit_seccomp(unsigned long syscall, long signr, int code)
>  { }
>  static inline int auditsc_get_stamp(struct audit_context *ctx,
> - struct timespec *t, unsigned int *serial)
> + struct timespec64 *t, unsigned int *serial)
>  {
> return 0;
>  }
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 2f4964c..fcbf377 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1625,10 +1625,10 @@ unsigned int audit_serial(void)
>  }
>
>  static inline void audit_get_stamp(struct audit_context *ctx,
> -  struct timespec *t, unsigned int *serial)
> +  struct timespec64 *t, unsigned int *serial)
>  {
> if (!ctx || !auditsc_get_stamp(ctx, t, serial)) {
> -   *t = CURRENT_TIME;
> +   ktime_get_real_ts64(t);
> *serial = audit_serial();
> }
>  }
> @@ -1652,7 +1652,7 @@ struct audit_buffer *audit_log_start(struct 
> audit_context *ctx, gfp_t gfp_mask,
>  int type)
>  {
> struct audit_buffer *ab;
> -   struct timespec t;
> +   struct timespec64 t;
> unsigned int uninitialized_var(serial);
>
> if (audit_initialized != AUDIT_INITIALIZED)
> @@ -1705,8 +1705,8 @@ struct audit_buffer *audit_log_start(struct 
> audit_context *ctx, gfp_t gfp_mask,
> }
>
> audit_get_stamp(ab->ctx, , );
> -   audit_log_format(ab, "audit(%lu.%03lu:%u): ",
> -t.tv_sec, t.tv_nsec/100, serial);
> +   audit_log_format(ab, "audit(%llu.%03lu:%u): ",
> +(unsigned long long)t.tv_sec, t.tv_nsec/100, 
> serial);
>
> return ab;
>  }
> diff --git a/kernel/audit.h b/kernel/audit.h
> index 0f1cf6d..cdf96f4 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -112,7 +112,7 @@ struct audit_context {
> enum audit_statestate, current_state;
> unsigned intserial; /* serial number for record */
> int major;  /* syscall number */
> -   struct timespec ctime;  /* time of syscall entry */
> +   struct timespec64   ctime;  /* time of syscall entry */
> unsigned long   argv[4];/* syscall arguments */
> longreturn_code;/* syscall return code */
> u64 prio;
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index e59ffc7..a2d9217 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -1532,7 +1532,7 @@ void __audit_syscall_entry(int major, unsigned long a1, 
> unsigned long a2,
> return;
>
> context->serial = 0;
> -   context->ctime  = CURRENT_TIME;
> +   ktime_get_real_ts64(>ctime);
> context->in_syscall = 1;
> context->current_state  = state;
> context->ppid   = 0;
> @@ -1941,13 +1941,13 @@ EXPORT_SYMBOL_GPL(__audit_inode_child);
>  /**
>   * auditsc_get_stamp - get local copies of audit_context values
>   * @ctx: 

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Hans van Kranenburg
On 04/08/2017 01:16 PM, Hans van Kranenburg wrote:
> On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
>> Ok, I'm going to revive a year old mail thread here with interesting new
>> info:
>>
>> [...]
>>
>> Now, another surprise:
>>
>> From the exact moment I did mount -o remount,nossd on this filesystem,
>> the problem vanished.
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
>>
>> I don't have a new video yet, but I'll set up a cron tonight and post it
>> later.
>>
>> I'm going to send another mail specifically about the nossd/ssd
>> behaviour and other things I found out last week, but that'll probably
>> be tomorrow.
> 
> Well, there it is:
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4
> 
> Amazing... :) I'll update the file later with extra frames.

By the way,

1. For the log files in /var/log... logrotate behaves as a defrag tool
of course. The small free space gaps left behind when scraping the
current log file together and rewriting it as 1 big gzipped file can be
reused throughout the next day or whatever interval by the slow writes
again.

2. For the /var/spool/postfix... small files come and go, and that's
fine now.

3. For the mailman mbox files, which get appended all the time... They
can either stay where they are, having some more extents scattered
around, or, an entry in the monthly cron to point defrag at the files of
last month (which will never change again) will solve that efficiently.

All of that doesn't sound like abnormal things to do when punishing the
filesystem with a 'slow small write' workload.

I'm happy to be able to keep this thing on btrfs. When moving all the
mailman stuff over from a previous VM, I first made it ext4 again, then
immediately ended up with no inodes left (of course!) while copying the
mailman archive, and then thought .. arg .. mkfs.btrfs, yay, unlimited
inodes! :) I was almost at the point of converting it back to ext4 after
all because of the exploding unused free space problems, but now that's
prevented just in time. :D

Moo,
-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Hans van Kranenburg
On 04/07/2017 11:25 PM, Hans van Kranenburg wrote:
> Ok, I'm going to revive a year old mail thread here with interesting new
> info:
> 
> [...]
> 
> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.
> 
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png
> 
> I don't have a new video yet, but I'll set up a cron tonight and post it
> later.
> 
> I'm going to send another mail specifically about the nossd/ssd
> behaviour and other things I found out last week, but that'll probably
> be tomorrow.

Well, there it is:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

Amazing... :) I'll update the file later with extra frames.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: During a btrfs balance nearly all quotas of the subvolumes became exceeded

2017-04-08 Thread Duncan
Markus Baier posted on Fri, 07 Apr 2017 16:17:10 +0200 as excerpted:

> Hello btrfs-list,
> 
> today a strange behaviour appered during the btrfs balance process.
> 
> I started a btrfs balance operation on the /home subvolume that
> contains, as childs, all the subvolumes for the home directories of the
> users, every subvolume with it's own quota.
> 
> A short time after the start of the balance process no user was able to
> write into his homedirectory anymore.
> All users got the "your disc quota exceeded" message.
> 
> The I checked the qgroups and got the following result:
> 
> btrfs qgroup show -r /home/
> qgroupidrferexcl max_rfer
>  
> 0/5 16.00KiB16.00KiBnone
> 0/257   16.00KiB16.00KiBnone
> 0/258   16.00EiB16.00EiB200.00GiB
> 0/259   16.00EiB16.00EiB200.00GiB
> 0/260   16.00EiB16.00EiB200.00GiB
> 0/261   16.00EiB16.00EiB200.00GiB
> 0/267   28.00KiB28.00KiB200.00GiB
> 
> 1/1 16.00EiB16.00EiB900.00GiB
> 
> For most of the subvolumes btrfs calculated 16.00EiB (I think this is
> the maximum possible size of the filesystem)
> as the amount of used space.
> A few subvolumes, all of them are nearly empty like the 0/267,
> were not afected and showed the normal size of 28.00KiB
> 
> I was able to fix the problem with the:
> btrfs quota rescan /home command.
> But my question is, is this a already known bug and what can I do to
> prevent this problem during the next balance run?
> 
> uname -a
> Linux condor-control 4.4.39-gentoo [...]

Known bug.  The btrfs quota subsystem remains somewhat buggy and 
unstable, with negative quota (IIRC, 16 EiB is the unsigned 64-bit 
integer representation of a signed-int negative, I believe -1) issues 
being one of the continuing problems.  Tho it's actively being worked on 
and you may well find that the latest current kernel release (4.10) is 
better in this regard, tho I'd still not entirely trust it and there 
remain quota-fix patches in the active submission queue (just check the 
list).

Note that quotas seriously increase btrfs scaling issues as well, 
typically increasing balance times multi-fold, particularly as they 
interact with snapshots, which have scaling issues of their own, such 
that a cap of a couple hundred snapshots per subvolume is strongly 
recommended, even without quotas on top of it.  Both memory usage and 
processing time are affected, primarily for balance and check.

As a result of btrfs-quota's long continuing accuracy issues in addition 
to the scaling issues, my recommendation has long been the following:

Generally, quota users fall into three categories, described here with my 
recommendations for each:

1) Those who know the quota issues and are actively working with the devs 
to test and correct them, helping to eventually stabilize this feature 
into practical usability, tho it has taken some years and the job, while 
getting closer to finished, remains yet unfinished.

Bless them!  Keep it up! =:^)

2) Those who may find the quota feature generally useful, but don't 
actually require it for their use-case.

I recommend that these users turn off quotas until such time as they've 
been generally demonstrated to be reliable and stable.  At this point 
they're simply not worth the hassle.  Even then, the scaling issues may 
remain.

3) Those who actually depend on quotas working correctly as a part of 
their use-case.

These users should really consider a more mature and stable filesystem 
where the quota feature is known to work as reliably as their use-case 
requires.  Btrfs is certainly stabilizing and maturing, but it's simply 
not there yet for this use-case.

One /possible/ alternative if staying with btrfs for its other features 
is desired, is the pre-quota solution of creating multiple independent 
filesystems on top of lvm or partitions, and using the size of the 
filesystems to enforce restrictions that quotas would otherwise be used 
for.  Of course independent VM images is a more complicated variant of 
this.


Unfortunately, given that you apparently have multiple users and are 
using quotas as resource-sharing enforcement, you may well fall into this 
third category. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-08 Thread Duncan
Hans van Kranenburg posted on Fri, 07 Apr 2017 23:25:29 +0200 as
excerpted:

> So, this is why putting your /var/log, /var/lib/mailman and /var/spool
> on btrfs is a terrible idea.
> 
> Because the allocator keeps walking forward every file that is created
> and then removed leaves a blank spot behind.
> 
> Autodefrag makes the situation only a little bit better, changing the
> resulting pattern from a sky full of stars into a snowstorm. The result
> of taking a few small writes and rewriting them again is that again the
> small parts of free space are left behind.

> [... B]ecause of the pattern we end
> up with, a large write apparently fails (the files downloaded when doing
> apt-get update by daily cron) which causes a new chunk allocation. This
> is clearly visible in the videos. Directly after that, the new chunk
> gets filled with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc...

> Now, another surprise:
> 
> From the exact moment I did mount -o remount,nossd on this filesystem,
> the problem vanished.

That large write in the middle of small writes pattern might be why I've 
not seen the problem on my btrfs', on ssd, here.

Remember, I'm the guy who keeps advocating multiple independent small 
btrfs on partitioned-up larger devices, with the splits between 
independent btrfs' based on tasks.

So I have a quite tiny sub-GiB independent log btrfs handling those slow 
incremental writes to generally smaller files, a separate / with the main 
system on it that's mounted read-only unless I'm actively updating it, a 
separate home with my reasonably small size but written at-once non-media 
user files, a separate media partition/fs with my much larger but very 
seldom rewritten media files, and a separate update partition/fs with the 
local cache of the distro tree and overlays, sources (since it's gentoo), 
built binpkg cache, etc, with small to medium-large files that are 
comparatively frequently replaced.

So the relatively small slow-written and frequently rotated log files are 
isolated to their own partition/fs, undisturbed by the much larger update-
writes to the updates and / partitions/fs, isolating them from the update-
trigger that triggers the chunk allocations on your larger single general 
purpose filesystem/image, amongst all those fragmenting slow logfile 
writes.

Very interesting and informative thread, BTW.  I'm learning quite a bit. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html