Re: some free space cache corruptions

2016-12-28 Thread Duncan
Christoph Anton Mitterer posted on Thu, 29 Dec 2016 04:43:35 +0100 as
excerpted:

> On Mon, 2016-12-26 at 00:12 +, Duncan wrote:
>> By themselves, free-space cache warnings are minor and not a serious
>> issue at all -- the cache is just that, a cache, designed to speed
>> operation but not actually necessary, and btrfs can detect and route
>> around space-cache corruption on-the-fly so by itself it's not a big
>> deal.
> Well... sure about that? Haven't we had recently that serious bug in the
> FST, which could cause data corruption as btrfs used space as free,
> while it wasn't?

Well, the free-space-tree (FST) itself remains experimental and not 
recommended for general use yet.  The btrfs (5) manpage (as of -progs-4.9 
at least) calls space_cache=v1 the safe default, and the wiki status page 
lists v2 (tree) as orange level (/mostly/ OK).

And note that I said free-space _cache_, not free-space _tree_.

Of course that's not to (unwisely) claim there are no bugs in the free-
space _cache_ (aka v1), but rather, to claim that its status is exactly 
the same as that of btrfs in general, stabilizing but not fully stable, 
workable in general for daily use as long as you keep your backups 
updated and ready to use, and stay away from the known to be less stable 
features... which do /not/ include the free-space cache (v1), but /do/ 
include the free-space tree (v2).

And that cache (as opposed to tree) functionality really /is/ quite 
stable, as it has been rather heavily tested by now.  The only exception 
would be the usual one for new code over old, where the new code hasn't 
been well tested, but that's a given for projects at this stage, so has 
little need to be explicitly stated.

> 
>> These warnings are however hints that something out of the routine has
>> happened
> Which again just likely means that there was/is some bug in btrfs...
> other than that, why should it suddenly get some corrupted cache, when
> only ro-snapshots were removed in bewtween?

That wasn't plain to me in the message I replied to.  What I had in mind 
with that out of the routine reference was an ungraceful shutdown or 
crash, which /does/ commonly leave the free-space-cache in an 
inconsistent state, that btrfs routinely detects and deals with, 
invalidating and not using the section of cache that doesn't match what 
it knows to be the case from the other trees.

And in such an ungraceful shutdown situation, exactly as I stated, the 
free-space-cache warning is expected and dealt with routinely, but it's a 
hint that something else might have gone wrong in the event as well, that 
isn't necessarily so easily fixed, and that very well may /not/ be fixed 
automatically, and further, that continuing to use the filesystem with 
that problem still lurking can potentially cause further damage.

>> 2) It recently came to the attention of the devs that the existing
>> btrfs mount-option method of clearing the free-space cache only clears
>> it for block-groups/chunks it encounters on-the-fly.  It doesn't do a
>> systematic beginning-to-end clear.

> So that calls for fixing the documentation as well?!

It's documented already (in -progs 4.9) in the btrfs-check manpage, but 
you are correct in that it's not documented in the btrfs (5) manpage, 
which covers the mount options themselves.

On the wiki the manpages apparently haven't been regenerated from git 
recently, so they're missing the 4.9 content mentioned above, unless you 
follow the link in the warning at the top of each one, to the git 
version.  The git version of the manpages appears to have the same status 
as the 4.9 manpages, given above.

Of course if people are following this list as recommended, they'll know 
about it as well, because they will have seen the recent discussion.  Tho 
of course that's not going to help people who will be starting to 
investigate btrfs in some weeks' time, unless they read the list archive 
back far enough to see the discussion.  So it definitely needs documented 
in the btrfs (5) manpage ASAP, with the wiki manpage versions regenerated 
after it hits git.

>> 3) As a result of #2, the devs only very recently added support in
>> btrfs check for a /full/ space-cache-v1 clear, using the new
>> --clear-space-cache option.  But your btrfs-progs v4.7.3 is too old to
>> support it.  I know it's in the v4.9 I just upgraded to... checking the
>> wiki it appears the option was added in btrfs-progs v4.8.3 (v4.8.4 for
>> v2 cache).
> 
> And is the new option stable?! ;-)

The btrfs check option should be reasonably stable, yes.  Because it's a 
full clear on an unmounted filesystem, which has far less ways to go 
wrong than attempting to do a partial clear on a mounted filesystem.

Additionally, it has been there since 4.8.3, so thru that, 4.8.4, 4.8.5, 
and now into 4.9.0, without noted problems.  So it should be reasonably 
stable.

Put it this way, unlike most of the non-read-only options in btrfs check, 
I'd be quite willing to use it on 

Re: Btrfs send does not preserve reflinked files within subvolumes.

2016-12-28 Thread Qu Wenruo

Hi,

I tried just what you did, and use "btrfs receive --dump" to exam the 
send stream.


And things works quite well:

$ sudo mount /dev/sda6  /mnt/btrfs/
$ sudo btrfs subvolume create /mnt/btrfs/subv1
$ sudo xfs_io -f -c "pwrite 0 2M" /mnt/btrfs/subv1/file1
$ sudo xfs_io -f -c "reflink /mnt/btrfs/subv1/file1 0 0 2M" 
/mnt/btrfs/subv1/file1.ref

$ sudo btrfs subv snap -r /mnt/btrfs/subv1/ /mnt/btrfs/ro_snap
$ sudo btrfs send /mnt/btrfs/ro_snap/ > /tmp/output
$ sudo btrfs receive --dump < /tmp/output

And the output shows like this:
subvol  ./ro_snap 
uuid=e788bb6e-8ec6-dd47-a452-e26196d22699 transid=9

chown   ./ro_snap/  gid=0 uid=0
chmod   ./ro_snap/  mode=755
utimes  ./ro_snap/ 
atime=2016-12-29T13:46:00+0800 mtime=2016-12-29T13:46:50+0800 
ctime=2016-12-29T13:46:50+0800
mkfile  ./ro_snap/o257-8-0rename  ./ro_snap/o257-8-0 
 dest=./ro_snap/file1
utimes  ./ro_snap/ 
atime=2016-12-29T13:46:00+0800 mtime=2016-12-29T13:46:50+0800 
ctime=2016-12-29T13:46:50+0800

write   ./ro_snap/file1 offset=0 len=49152
write   ./ro_snap/file1 offset=49152 len=49152

write   ./ro_snap/file1 offset=2064384 len=32768
truncate./ro_snap/file1 size=2097152
chown   ./ro_snap/file1 gid=0 uid=0
chmod   ./ro_snap/file1 mode=600
utimes  ./ro_snap/file1 
atime=2016-12-29T13:46:24+0800 mtime=2016-12-29T13:46:24+0800 
ctime=2016-12-29T13:46:24+0800
mkfile  ./ro_snap/o258-8-0rename  ./ro_snap/o258-8-0 
 dest=./ro_snap/file1.ref
utimes  ./ro_snap/ 
atime=2016-12-29T13:46:00+0800 mtime=2016-12-29T13:46:50+0800 
ctime=2016-12-29T13:46:50+0800
clone   ./ro_snap/file1.ref offset=0 len=2097152 
from=./ro_snap/file1 clone_offset=0

^^^ Here is the clone operation

truncate./ro_snap/file1.ref size=2097152
chown   ./ro_snap/file1.ref gid=0 uid=0
chmod   ./ro_snap/file1.ref mode=600
utimes  ./ro_snap/file1.ref 
atime=2016-12-29T13:46:50+0800 mtime=2016-12-29T13:47:07+0800 
ctime=2016-12-29T13:47:07+0800


And in fact, btrfs send can even handle reflink to parent subvolume.
(Although this behavior can be deadly for heavily reflinked files)


So, would you please upload the send stream for us to check?

Thanks,
Qu

At 12/29/2016 10:44 AM, Glenn Washburn wrote:

I'm having a hard time getting btrfs receive to create reflinked files
and have a trivial example that I believe *should* work but doesn't.
I've attached a script that I used to perform this test, so others can
try to reproduce.  The text file is the output of the shell script
except the last command, which is a tool I wrote to print the extent
info from FIEMAP.  "btrfs fi du" would work just as well, but I'm on
Ubuntu 16.04, whose btrfs progs doesn't have that command yet.  I've
also tested on Ubuntu 16.10 with similar results, except that "btrfs fi
du" is on that version and confirms what my tool displays.

So, can send not do what I'm trying to get it to do? If it can now,
when did that feature get introduced (must have been after kernel
4.8)?  I'm very surprised that this feature wouldn't have already
been done and if not that no one seems to be complaining about it. I've
done a decent amount of searching on this and have come up with
nothing.  Any help would be greatly appreciated.

Thanks,
Glenn





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: some free space cache corruptions

2016-12-28 Thread Christoph Anton Mitterer
On Mon, 2016-12-26 at 00:12 +, Duncan wrote:
> By themselves, free-space cache warnings are minor and not a serious 
> issue at all -- the cache is just that, a cache, designed to speed 
> operation but not actually necessary, and btrfs can detect and route 
> around space-cache corruption on-the-fly so by itself it's not a big
> deal.
Well... sure about that? Haven't we had recently that serious bug in
the FST, which could cause data corruption as btrfs used space as free,
while it wasn't?


> These warnings are however hints that something out of the routine
> has 
> happened
Which again just likely means that there was/is some bug in btrfs...
other than that, why should it suddenly get some corrupted cache, when
only ro-snapshots were removed in bewtween?


> unless the filesystem itself, or a scrub, etc, has fixed things
> in 
> the mean time.  (And as I said, the space-cache is only a cache,
> designed 
> to speed things up, cache corruption is fairly common and btrfs can
> and 
> does deal with it without issue.
When finishing the most recent backups, the fs in question got pretty
fully and the error message I've spottet during btrfs check appeared in
the kernel log as well:
Dec 29 03:03:11 heisenberg kernel: BTRFS warning (device dm-1): block group 
5431552376832 has wrong amount of free space
Dec 29 03:03:11 heisenberg kernel: BTRFS warning (device dm-1): failed to load 
free space cache for block group 5431552376832, rebuilding it now
(fs was NOT mounted with clear_cache)

which implies it was now rebuilt

However, after a subsquent fsck, the same error occurs there again:
# btrfs check /dev/mapper/data-a2 ; echo $?
Checking filesystem on /dev/mapper/data-a2
UUID: f8acb432-7604-46ba-b3ad-0abe8e92c4db
checking extents
checking free space cache
block group 5431552376832 has wrong amount of free space
failed to load free space cache for block group 5431552376832
checking fs roots
checking csums
checking root refs
found 7571911602176 bytes used err is 0
total csum bytes: 7381752972
total tree bytes: 11145035776
total fs tree bytes: 2100396032
total extent tree bytes: 1137082368
btree space waste bytes: 996179488
file data blocks allocated: 7560766566400
 referenced 7681157672960
0


> 2) It recently came to the attention of the devs that the existing
> btrfs 
> mount-option method of clearing the free-space cache only clears it
> for 
> block-groups/chunks it encounters on-the-fly.  It doesn't do a
> systematic 
> beginning-to-end clear.
So that calls for fixing the documentation as well?!



> 3) As a result of #2, the devs only very recently added support in
> btrfs 
> check for a /full/ space-cache-v1 clear, using the new
> --clear-space-cache option.  But your btrfs-progs v4.7.3 is too old
> to 
> support it.  I know it's in the v4.9 I just upgraded to... checking
> the 
> wiki it appears the option was added in btrfs-progs v4.8.3 (v4.8.4
> for v2 
> cache).

And is the new option stable?! ;-)

> Tho if you haven't recently run a scrub, I'd do that as well
Well I did a full verification using my own checksum (i.e. every
regular file in the fs has SHA512 sums attached as XATTR)... since that
caused all data to be read, this should be identical to a scrub (at
least as for the regular files data (no necessarily metadata),
shouldn't it? 


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Btrfs send does not preserve reflinked files within subvolumes.

2016-12-28 Thread Glenn Washburn
I'm having a hard time getting btrfs receive to create reflinked files
and have a trivial example that I believe *should* work but doesn't.
I've attached a script that I used to perform this test, so others can
try to reproduce.  The text file is the output of the shell script
except the last command, which is a tool I wrote to print the extent
info from FIEMAP.  "btrfs fi du" would work just as well, but I'm on
Ubuntu 16.04, whose btrfs progs doesn't have that command yet.  I've
also tested on Ubuntu 16.10 with similar results, except that "btrfs fi
du" is on that version and confirms what my tool displays.

So, can send not do what I'm trying to get it to do? If it can now,
when did that feature get introduced (must have been after kernel
4.8)?  I'm very surprised that this feature wouldn't have already
been done and if not that no one seems to be complaining about it. I've
done a decent amount of searching on this and have come up with
nothing.  Any help would be greatly appreciated.

Thanks,
Glenn


sending-reflinked-files.sh
Description: application/shellscript
+ BVOL1=/media/tmp1
+ BVOL2=/media/tmp2
+ uname -a
Linux crass-Ideapad-Z570 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 
2016 x86_64 x86_64 x86_64 GNU/Linux
+ btrfs --version
btrfs-progs v4.4
+ cd /media/tmp1
+ btrfs sub create C
Create subvolume './C'
+ dd if=/dev/urandom of=C/test count=100
100+0 records in
100+0 records out
51200 bytes (51 kB, 50 KiB) copied, 0.00485396 s, 10.5 MB/s
+ cp --reflink=always -vp C/test C/test.ref
'C/test' -> 'C/test.ref'
+ btrfs property set C ro true
+ btrfs send -v C
+ btrfs receive /media/tmp2
At subvol C
BTRFS_IOC_SEND returned 0
joining genl thread
At subvol C
+ ~/development/reflink_tools.git/fmaptool.py print -P C/test C/test.ref 
/media/tmp2/C/test /media/tmp2/C/test.ref
C/test
--
0+53248: 12845056
C/test.ref
--
0+53248: 12845056
/media/tmp2/C/test
--
0+49152: NULL
49152+4096: 12845056
/media/tmp2/C/test.ref
--
0+49152: NULL
49152+4096: 12849152


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> Hi,
> could you try to run with the following patch on top of the previous
> one? I do not think it will make a large change in your workload but
> I think we need something like that so some testing under which is known
> to make a high lowmem pressure would be really appreciated. If you have
> more time to play with it then running with and without the patch with
> mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> whether it make any difference at all.
> 
> I would also appreciate if Mel and Johannes had a look at it. I am not
> yet sure whether we need the same thing for anon/file balancing in
> get_scan_count. I suspect we need but need to think more about that.
> 
> Thanks a lot again!
> ---
> From b51f50340fe9e40b68be198b012f8ab9869c1850 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Tue, 27 Dec 2016 16:28:44 +0100
> Subject: [PATCH] mm, vmscan: consider eligible zones in get_scan_count
> 
> get_scan_count considers the whole node LRU size when
> - doing SCAN_FILE due to many page cache inactive pages
> - calculating the number of pages to scan
> 
> in both cases this might lead to unexpected behavior especially on 32b
> systems where we can expect lowmem memory pressure very often.
> 
> A large highmem zone can easily distort SCAN_FILE heuristic because
> there might be only few file pages from the eligible zones on the node
> lru and we would still enforce file lru scanning which can lead to
> trashing while we could still scan anonymous pages.

Nit:
It doesn't make thrashing because isolate_lru_pages filter out them
but I agree it makes pointless CPU burning to find eligible pages.

> 
> The later use of lruvec_lru_size can be problematic as well. Especially
> when there are not many pages from the eligible zones. We would have to
> skip over many pages to find anything to reclaim but shrink_node_memcg
> would only reduce the remaining number to scan by SWAP_CLUSTER_MAX
> at maximum. Therefore we can end up going over a large LRU many times
> without actually having chance to reclaim much if anything at all. The
> closer we are out of memory on lowmem zone the worse the problem will
> be.
> 
> Signed-off-by: Michal Hocko 
> ---
>  mm/vmscan.c | 30 --
>  1 file changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c98b1a585992..785b4d7fb8a0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -252,6 +252,32 @@ unsigned long lruvec_zone_lru_size(struct lruvec 
> *lruvec, enum lru_list lru, int
>  }
>  
>  /*
> + * Return the number of pages on the given lru which are eligibne for the
eligible
> + * given zone_idx
> + */
> +static unsigned long lruvec_lru_size_zone_idx(struct lruvec *lruvec,
> + enum lru_list lru, int zone_idx)

Nit:

Although there is a comment, function name is rather confusing when I compared
it with lruvec_zone_lru_size.

lruvec_eligible_zones_lru_size is better?


> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + unsigned long lru_size;
> + int zid;
> +
> + lru_size = lruvec_lru_size(lruvec, lru);
> + for (zid = zone_idx + 1; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> + unsigned long size;
> +
> + if (!managed_zone(zone))
> + continue;
> +
> + size = lruvec_zone_lru_size(lruvec, lru, zid);
> + lru_size -= min(size, lru_size);
> + }
> +
> + return lru_size;
> +}
> +
> +/*
>   * Add a shrinker callback to be called from the vm.
>   */
>  int register_shrinker(struct shrinker *shrinker)
> @@ -2207,7 +2233,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>* system is under heavy pressure.
>*/
>   if (!inactive_list_is_low(lruvec, true, sc) &&
> - lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
> + lruvec_lru_size_zone_idx(lruvec, LRU_INACTIVE_FILE, 
> sc->reclaim_idx) >> sc->priority) {
>   scan_balance = SCAN_FILE;
>   goto out;
>   }
> @@ -2274,7 +2300,7 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>   unsigned long size;
>   unsigned long scan;
>  
> - size = lruvec_lru_size(lruvec, lru);
> + size = lruvec_lru_size_zone_idx(lruvec, lru, 
> sc->reclaim_idx);
>   scan = size >> sc->priority;
>  
>   if (!scan && pass && force_scan)
> -- 
> 2.10.2

Nit:

With this patch, inactive_list_is_low can use lruvec_lru_size_zone_idx rather 
than
own custom calculation to filter out non-eligible pages. 

Anyway, I think this patch does right things so I suppose this.

Acked-by: Minchan Kim 

--

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Thu, Dec 29, 2016 at 09:31:54AM +0900, Minchan Kim wrote:
> On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> > On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > > 
> > > > Nils, even though this is still highly experimental, could you give it a
> > > > try please?
> > > 
> > > Yes, no problem! So I kept the very first patch you sent but had to
> > > revert the latest version of the debugging patch (the one in
> > > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > > memory cgroups enabled again, and the first thing that strikes the eye
> > > is that I get this during boot:
> > > 
> > > [1.568174] [ cut here ]
> > > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > > mem_cgroup_update_lru_size+0x118/0x130
> > > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > > not empty
> > 
> > Ohh, I can see what is wrong! a) there is a bug in the accounting in
> > my patch (I double account) and b) the detection for the empty list
> > cannot work after my change because per node zone will not match per
> > zone statistics. The updated patch is below. So I hope my brain already
> > works after it's been mostly off last few days...
> > ---
> > From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Fri, 23 Dec 2016 15:11:54 +0100
> > Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests 
> > when
> >  memcg is enabled
> > 
> > Nils Holland has reported unexpected OOM killer invocations with 32b
> > kernel starting with 4.8 kernels
> > 
> > kworker/u4:5 invoked oom-killer: 
> > gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> > oom_score_adj=0
> > kworker/u4:5 cpuset=/ mems_allowed=0
> > CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
> > [...]
> > Mem-Info:
> > active_anon:58685 inactive_anon:90 isolated_anon:0
> >  active_file:274324 inactive_file:281962 isolated_file:0
> >  unevictable:0 dirty:649 writeback:0 unstable:0
> >  slab_reclaimable:40662 slab_unreclaimable:17754
> >  mapped:7382 shmem:202 pagetables:351 bounce:0
> >  free:206736 free_pcp:332 free_cma:0
> > Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> > inactive_file:1127848kB unevictable:0kB isolated(anon):0kB 
> > isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB 
> > shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB 
> > unstable:0kB pages_scanned:0 all_unreclaimable? no
> > DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> > inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> > writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> > slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> > pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> > lowmem_reserve[]: 0 813 3474 3474
> > Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> > active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> > unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> > mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> > kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> > local_pcp:340kB free_cma:0kB
> > lowmem_reserve[]: 0 0 21292 21292
> > HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> > active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> > inactive_file:1127804kB unevictable:0kB writepending:2592kB 
> > present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB 
> > slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB 
> > free_pcp:800kB local_pcp:608kB free_cma:0kB
> > 
> > the oom killer is clearly pre-mature because there there is still a
> > lot of page cache in the zone Normal which should satisfy this lowmem
> > request. Further debugging has shown that the reclaim cannot make any
> > forward progress because the page cache is hidden in the active list
> > which doesn't get rotated because inactive_list_is_low is not memcg
> > aware.
> > It simply subtracts per-zone highmem counters from the respective
> > memcg's lru sizes which doesn't make any sense. We can simply end up
> > always seeing the resulting active and inactive counts 0 and return
> > false. This issue is not limited to 32b kernels but in practice the
> > effect on systems without CONFIG_HIGHMEM would be much harder to notice
> > because we do not invoke the OOM killer for allocations requests
> > targeting < ZONE_NORMAL.
> > 
> > Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> > and subtract per-memcg highmem counts when memcg is enabled. Introduce
> > helper 

Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Minchan Kim
On Mon, Dec 26, 2016 at 01:48:40PM +0100, Michal Hocko wrote:
> On Fri 23-12-16 23:26:00, Nils Holland wrote:
> > On Fri, Dec 23, 2016 at 03:47:39PM +0100, Michal Hocko wrote:
> > > 
> > > Nils, even though this is still highly experimental, could you give it a
> > > try please?
> > 
> > Yes, no problem! So I kept the very first patch you sent but had to
> > revert the latest version of the debugging patch (the one in
> > which you added the "mm_vmscan_inactive_list_is_low" event) because
> > otherwise the patch you just sent wouldn't apply. Then I rebooted with
> > memory cgroups enabled again, and the first thing that strikes the eye
> > is that I get this during boot:
> > 
> > [1.568174] [ cut here ]
> > [1.568327] WARNING: CPU: 0 PID: 1 at mm/memcontrol.c:1032 
> > mem_cgroup_update_lru_size+0x118/0x130
> > [1.568543] mem_cgroup_update_lru_size(f4406400, 2, 1): lru_size 0 but 
> > not empty
> 
> Ohh, I can see what is wrong! a) there is a bug in the accounting in
> my patch (I double account) and b) the detection for the empty list
> cannot work after my change because per node zone will not match per
> zone statistics. The updated patch is below. So I hope my brain already
> works after it's been mostly off last few days...
> ---
> From 397adf46917b2d9493180354a7b0182aee280a8b Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 23 Dec 2016 15:11:54 +0100
> Subject: [PATCH] mm, memcg: fix the active list aging for lowmem requests when
>  memcg is enabled
> 
> Nils Holland has reported unexpected OOM killer invocations with 32b
> kernel starting with 4.8 kernels
> 
>   kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
>   kworker/u4:5 cpuset=/ mems_allowed=0
>   CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>   [...]
>   Mem-Info:
>   active_anon:58685 inactive_anon:90 isolated_anon:0
>active_file:274324 inactive_file:281962 isolated_file:0
>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
>   Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
> shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no
>   DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>   lowmem_reserve[]: 0 813 3474 3474
>   Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> local_pcp:340kB free_cma:0kB
>   lowmem_reserve[]: 0 0 21292 21292
>   HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB
> 
> the oom killer is clearly pre-mature because there there is still a
> lot of page cache in the zone Normal which should satisfy this lowmem
> request. Further debugging has shown that the reclaim cannot make any
> forward progress because the page cache is hidden in the active list
> which doesn't get rotated because inactive_list_is_low is not memcg
> aware.
> It simply subtracts per-zone highmem counters from the respective
> memcg's lru sizes which doesn't make any sense. We can simply end up
> always seeing the resulting active and inactive counts 0 and return
> false. This issue is not limited to 32b kernels but in practice the
> effect on systems without CONFIG_HIGHMEM would be much harder to notice
> because we do not invoke the OOM killer for allocations requests
> targeting < ZONE_NORMAL.
> 
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled. Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are loosing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: 

Incremental send receive of snapshot fails

2016-12-28 Thread Rene Wolf

Hi all


I have a problem with incremental snapshot send receive in btrfs. May be 
my google-fu is weak, but I couldn't find any pointers, so here goes.



A few words about my setup first:

I have multiple clients that back up to a central server. All clients 
(and the server) are running a (K)Ubuntu 16.10 64Bit on btrfs. Backing 
up works with btrfs send / receive, either full or incremental, 
depending on whats available on the server side. All clients have the 
usual (Ubuntu) btrfs layout: 2 subvolumes, one for / and one for /home; 
explicit entries in fstab; root volume not mounted anywhere. For further 
details see the P.s. at the end.



Here's what happens:

In general I stick to the example form 
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . Backing up 
is done daily by a script, and it works successfully on all of my 
clients except one (called "lab").


I start with the first snapshot on "lab" and do a full send to the 
server. This works as expected (sending takes some hours as it is done 
over wifi+ssh). After that is done I send an incremental snapshot based 
on the previous parent. This also works as expected (no error etc). 
Sending deltas then happens once a day, with the script always keeping 
the last two snapshots on the client and many more on the server. Also 
after each run of the script I do a bit of "house keeping" to prevent 
"disk full" etc (see below p.s. for commands).


I can't exactly say when, but after some time (possibly the next day) 
snapshot sending fails with an error on the receiving end:

ERROR: unlink some/file failed. No such file or directory

Some searching around lead me to this 
https://bugzilla.kernel.org/show_bug.cgi?id=60673 . So I checked to make 
sure my script doesn't use the wrong parent; and it does not. But to 
make really sure I tried a send / receive directly on "lab" without the 
server:


# btrfs subvol snap -r / /.back/new_snap

Create a readonly snapshot of '/' in '/.back/new_snap'


# btrfs subv show /.back/last_snap_by_script

/.back/last_snap_by_script
Name:   last_snap_by_script
UUID:   b4634a8b-b74b-154a-9f17-1115f6d07524
Parent UUID:b5f9a301-69f7-0646-8cf1-ba29e0c24fac
Received UUID:  196a0866-cd05-d24e-bac6-84e8e5eb037a
Creation time:  2016-12-27 17:55:10 +0100
Subvolume ID:   486
Generation: 52036
Gen at creation:51524
Parent ID:  257
Top level ID:   257
Flags:  readonly
Snapshot(s):


# btrfs subv show /.back/new_snap

/.back/new_snap
Name:   new_snap
UUID:   fca51929-8101-db45-8df6-f25935c04f98
Parent UUID:b5f9a301-69f7-0646-8cf1-ba29e0c24fac
Received UUID:  196a0866-cd05-d24e-bac6-84e8e5eb037a
Creation time:  2016-12-28 11:51:43 +0100
Subvolume ID:   506
Generation: 52271
Gen at creation:52271
Parent ID:  257
Top level ID:   257
Flags:  readonly
Snapshot(s):


# btrfs send -p /.back/last_snap_by_script /.back/new_snap > delta

At subvol /.back/new_snap


# btrfs subvol del /.back/new_snap

Delete subvolume (no-commit): '/.back/new_snap'


# cat delta | btrfs receive /.back/

At snapshot new_snap
ERROR: unlink some/file failed. No such file or directory


And the receive always fails with some ERROR similar to the above! What 
I find a bit odd is the identical "Received UUID", even before new_snap 
was sent / received ... but maybe that's normal?


If instead of "last_snap_by_script" I also create a new read only 
snapshot and send the delta between these two "new" ones, everything 
works as expected. But then there's little differences between the two 
new snaps ...


I tried to look for differences between the "lab" client and another one 
("navi") where backing up works. So far I couldn't really find anything. 
I did create both file systems at different points in time (possibly 
with different kernels). All fs were created as btrfs and not 
"converted" from ext. "lab" has an SSD, "navi" a spinning disc. Both 
systems run on Intel CPUs in 64Bit ...



So now I have a snapshot on "lab" which I cannot use as a parent, but 
why? What did I do wrong? The whole procedure does work on my other 
clients (with the exact same script), why not on the "lab" client? And 
this is a re-occuring problem: I tried deleting all of the snaps (on 
both ends) and start all over again ... it will again end up with a 
"broken" snapshot eventually.



Up until now using btrfs has been a great experience and I always could 
resolve my troubles quite quickly, but this time I don't know what to do?
Thanks in advance for any suggestions and feel free to ask for other / 
missing details :-)



Regards
Rene


P.s.: 

Re: [PATCH] recursive defrag cleanup

2016-12-28 Thread Janos Toth F.
I still find the defrag tool a little bit confusing from a user perspective:
- Does the recursive defrag (-r) also defrag the specified directory's
extent tree or should one run two separate commands for completeness
(one with -r and one without -r)?
- What's the target scope of the extent tree defragmentation? Is it
recursive on the tree (regardless of the -r option) and thus
defragments all the extent trees in case one targets the root
subvolume?

In other words: What is the exact sequence of commands if one wishes
to defragment the whole filesystem as extensively as possible (all
files and extent trees included)?

There used to be a scrip floating around on various wikis (for
example, the Arch Linux wiki) which used the "find" tool to feed each
and every directory to the defrag command. I always thought that must
be overkill and now it's gone, but I don't see further explanations
and/or new scripts in place (other than a single command with the -r
option).


It's also a little mystery for me if balancing the metadata chunks is
supposed to be effectively defragmenting the metadata or not and what
the best practice regarding that issue is.
In my personal experience Btrfs filesystems tend to get slower over
time, up to the point where it takes several minutes to mount them or
to delete some big files (observed on HDDs, not on SSDs where the
sheer speed might masks the problem and filesystem tends to be smaller
anyway). When it gets really bad, Gentoo's localmount script starts to
time out on boot and Samba based network file deletions tend to freeze
the client Windows machine's file explorer.
It only takes 3-6 months and/or roughly 10-20 times of the total
disk(s) capacity's worth of write load to get there. Defrag doesn't
seem to help with that but running a balance on each and every
metadata blocks (data and system blocks can be skipped) seems to
"renew" it (no more timeouts or noticeable delays on mount, metadata
operation are as fast as expected, it works like a young
filesystem...).

One might expect that targeting the root subvolume with a recursive
defrag will take care of metadata fragmentation as well but it doesn't
seem to be the case and I don't see anybody recommending regular
matadata balancing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] btrfs fixes and cleanups

2016-12-28 Thread Qu Wenruo

Hi Liu,

At 12/15/2016 03:13 PM, Liu Bo wrote:

Hi David,

This is the collection of my patches targetting 4.10, I've
dropped patch "Btrfs: adjust len of writes if following a
preallocated extent" because of the deadlock caused by this
commit.

Patches are based on v4.9-rc8, and test against fstests with
default mount options has been taken to make sure it doesn't
break anything.

I haven't got a kernel.org git repo, so this is mainly for
tracking purpose and for testing git flow.

(cherry-pick patches might be the only way at this moment...sorry
for the inconvenience.)

Anyway, patches can be found at

https://github.com/liubogithub/btrfs-work.git for-dave

Thanks,
liubo

Liu Bo (9):
  Btrfs: add 'inode' for extent map tracepoint
  Btrfs: add truncated_len for ordered extent tracepoints
  Btrfs: use down_read_nested to make lockdep silent
  Btrfs: fix lockdep warning about log_mutex
  Btrfs: fix truncate down when no_holes feature is enabled
  Btrfs: fix btrfs_ordered_update_i_size to update disk_i_size properly
  Btrfs: fix comment in btrfs_page_mkwrite
  Btrfs: clean up btrfs_ordered_update_i_size


While testing David's for-next-20161219 branch, I found btrfs/06[0-5] 
will cause the following kernel panic when ran them in a row.


[ 4207.963063] assertion failed: disk_i_size < i_size, file: 
fs/btrfs//ordered-data.c, line: 1041

[ 4207.963722] [ cut here ]
[ 4207.964008] kernel BUG at fs/btrfs//ctree.h:3418!
[ 4207.964008] invalid opcode:  [#1] SMP
[ 4207.964008] Modules linked in: btrfs(O) netconsole ext4 jbd2 mbcache 
xor zlib_deflate raid6_pq xfs [last unloaded: btrfs]
[ 4207.964008] CPU: 0 PID: 3829 Comm: kworker/u4:5 Tainted: G 
O4.9.0+ #60
[ 4207.964008] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.1-20161122_114906-anatol 04/01/2014

[ 4207.964008] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
[ 4207.964008] task: 88000bbf8040 task.stack: c90006598000
[ 4207.964008] RIP: 0010:[]  [] 
assfail.constprop.10+0x1c/0x1e [btrfs]


[ 4207.964008] Call Trace:
[ 4207.964008]  [] 
btrfs_ordered_update_i_size+0x2b1/0x2e0 [btrfs]
[ 4207.964008]  [] btrfs_finish_ordered_io+0x335/0x6b0 
[btrfs]

[ 4207.964008]  [] finish_ordered_fn+0x15/0x20 [btrfs]
[ 4207.964008]  [] btrfs_scrubparity_helper+0xef/0x610 
[btrfs]
[ 4207.964008]  [] btrfs_endio_write_helper+0xe/0x10 
[btrfs]

[ 4207.964008]  [] process_one_work+0x2af/0x720
[ 4207.964008]  [] ? process_one_work+0x22b/0x720
[ 4207.964008]  [] worker_thread+0x4b/0x4f0
[ 4207.964008]  [] ? process_one_work+0x720/0x720
[ 4207.964008]  [] ? process_one_work+0x720/0x720
[ 4207.964008]  [] kthread+0xf3/0x110
[ 4207.964008]  [] ? kthread_park+0x60/0x60
[ 4207.964008]  [] ret_from_fork+0x27/0x40
[ 4207.964008] Code: c7 00 e4 46 a0 48 89 e5 e8 c8 3c d8 e0 0f 0b 55 89 
f1 48 c7 c2 83 90 46 a0 48 89 fe 48 c7 c7 b0 e4 46 a0 48 89 e5 e8 aa 3c 
d8 e0 <0f> 0b 55 89 f1 48 c7 c2 fb 90 46 a0 48 89 fe 48 c7 c7 e8 e5 46
[ 4207.964008] RIP  [] assfail.constprop.10+0x1c/0x1e 
[btrfs]

[ 4207.964008]  RSP 
[ 4207.964008] ---[ end trace f7759d2fce14da9f ]---

Not sure if it's related to patch or just it exposed some bug we don't 
find before.


Hopes it will help.

Thanks,
Qu


  Btrfs: fix another race between truncate and lockless dio write

 fs/btrfs/extent-tree.c   |  3 ++-
 fs/btrfs/inode.c | 43 +++
 fs/btrfs/ordered-data.c  | 42 --
 fs/btrfs/tree-log.c  | 13 ++---
 include/trace/events/btrfs.h | 16 
 5 files changed, 83 insertions(+), 34 deletions(-)




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, memcg: fix (Re: OOM: Better, but still there on)

2016-12-28 Thread Michal Hocko
On Tue 27-12-16 20:33:09, Nils Holland wrote:
> On Tue, Dec 27, 2016 at 04:55:33PM +0100, Michal Hocko wrote:
> > Hi,
> > could you try to run with the following patch on top of the previous
> > one? I do not think it will make a large change in your workload but
> > I think we need something like that so some testing under which is known
> > to make a high lowmem pressure would be really appreciated. If you have
> > more time to play with it then running with and without the patch with
> > mm_vmscan_direct_reclaim_{start,end} tracepoints enabled could tell us
> > whether it make any difference at all.
> 
> Of course, no problem!
> 
> First, about the events to trace: mm_vmscan_direct_reclaim_start
> doesn't seem to exist, but mm_vmscan_direct_reclaim_begin does. I'm
> sure that's what you meant and so I took that one instead.

yes, sorry about the confusion

> Then I have to admit in both cases (once without the latest patch,
> once with) very little trace data was actually produced. In the case
> without the patch, the reclaim was started more often and reclaimed a
> smaller number of pages each time, in the case with the patch it was
> invoked less often, and with the last time it was invoked it reclaimed
> a rather big number of pages. I have no clue, however, if that
> happened "by chance" or if it was actually causes by the patch and
> thus an expected change.

yes that seems to be a variation of the workload I would say because if
anything the patch should reduce the number of scanned pages.

> In both cases, my test case was: Reboot, setup logging, do "emerge
> firefox" (which unpacks and builds the firefox sources), then, when
> the emerge had come so far that the unpacking was done and the
> building had started, switch to another console and untar the latest
> kernel, libreoffice and (once more) firefox sources there. After that
> had completed, I aborted the emerge build process and stopped tracing.
> 
> Here's the trace data captured without the latest patch applied:
> 
> khugepaged-22[000]    566.123383: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000] .N..   566.165520: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    587.515424: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    587.596035: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1029
> khugepaged-22[001]    599.879536: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    601.000812: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1100
> khugepaged-22[001]    601.228137: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    601.309952: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1081
> khugepaged-22[001]    694.935267: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001] .N..   695.081943: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1071
> khugepaged-22[001]    701.370707: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    701.372798: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1089
> khugepaged-22[001]    764.752036: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    771.047905: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1039
> khugepaged-22[000]    781.760515: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    781.826543: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.595575: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[000]    782.638591: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    782.930455: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    782.993608: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> khugepaged-22[001]    783.330378: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    783.369653: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1040
> 
> And this is the same with the patch applied:
> 
> khugepaged-22[001]    523.57: mm_vmscan_direct_reclaim_begin: 
> order=9 may_writepage=1 gfp_flags=GFP_TRANSHUGE classzone_idx=3
> khugepaged-22[001]    523.683110: mm_vmscan_direct_reclaim_end: 
> nr_reclaimed=1092
> khugepaged-22[001]    535.345477: 

Re: [PATCH] mm, vmscan: consider eligible zones in get_scan_count

2016-12-28 Thread Michal Hocko
On Wed 28-12-16 00:28:38, kbuild test robot wrote:
> Hi Michal,
> 
> [auto build test ERROR on mmotm/master]
> [also build test ERROR on v4.10-rc1 next-20161224]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Michal-Hocko/mm-vmscan-consider-eligible-zones-in-get_scan_count/20161228-000917
> base:   git://git.cmpxchg.org/linux-mmotm.git master
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>mm/vmscan.c: In function 'lruvec_lru_size_zone_idx':
> >> mm/vmscan.c:264:10: error: implicit declaration of function 
> >> 'lruvec_zone_lru_size' [-Werror=implicit-function-declaration]
>   size = lruvec_zone_lru_size(lruvec, lru, zid);

this patch depends on the previous one
http://lkml.kernel.org/r/20161226124839.gb20...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html