Re: File system is oddly full after kernel upgrade, balance doesn't help

2017-01-27 Thread Duncan
MegaBrutal posted on Fri, 27 Jan 2017 19:45:00 +0100 as excerpted:

> Hi,
> 
> Not sure if it caused by the upgrade, but I only encountered this
> problem after I upgraded to Ubuntu Yakkety, which comes with a 4.8
> kernel.
> Linux vmhost 4.8.0-34-generic #36-Ubuntu SMP Wed Dec 21 17:24:18 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> This is the 2nd file system which showed these symptoms, so I thought
> it's more than happenstance. I don't remember what I did with the first
> one, but I somehow managed to fix it with balance, if I remember
> correctly, but it doesn't help with this one.
> 
> FS state before any attempts to fix:
> Filesystem  1M-blocks   Used Available Use% Mounted on
> [...]curlybrace  1024   1024 0 100% /tmp/mnt/curlybrace
> 
> Resized LV, run „btrfs filesystem resize max /tmp/mnt/curlybrace”:
> [...]curlybrace  2048   1303 0 100% /tmp/mnt/curlybrace
> 
> Notice how the usage magically jumped up to 1303 MB, and despite the FS
> size is 2048 MB, the usage is still displayed as 100%.
> 
> Tried full balance (other options with -dusage had no result):
> root@vmhost:~# btrfs balance start -v /tmp/mnt/curlybrace

> Starting balance without any filters.
> ERROR: error during balancing '/tmp/mnt/curlybrace':
> No space left on device

> No space left on device? How?
> 
> But it changed the situation:
> [...]curlybrace  2048   1302   190  88% /tmp/mnt/curlybrace
> 
> This is still not acceptable. I need to recover at least 50% free space
> (since I increased the FS to the double).
> 
> A 2nd balance attempt resulted in this:
> [...]curlybrace  2048   1302   162  89% /tmp/mnt/curlybrace
> 
> So... it became slightly worse.
> 
> What's going on? How can I fix the file system to show real data?

Something seems off, yes, but...

https://btrfs.wiki.kernel.org/index.php/FAQ

Reading the whole thing will likely be useful, but especially 1.3/1.4 and 
4.6-4.9 discussing the problem of space usage, reporting, and (primarily 
in some of the other space related FAQs beyond the specific ones above) 
how to try and fix it when space runes out, on btrfs.

If you read them before, read them again, because you didn't post the 
btrfs free-space reports covered in 4.7, instead posting what appears to 
be the standard (non-btrfs) df report, which for all the reasons 
explained in the FAQ, is at best only an estimate on btrfs.  That 
estimate is obviously behaving unexpectedly in your case, but without the 
btrfs specific reports, it's nigh impossible to even guess with any 
chance at accuracy what's going on, or how to fix it.

A WAG would be that part of the problem might be that you were into 
global reserve before the resize, so after the filesystem got more space 
to use, the first thing it did was unload that global reserve usage, 
thereby immediately upping apparent usage.  That might explain that 
initial jump in usage after the resize.  But that's just a WAG.  Without 
at least btrfs filesystem usage, or btrfs filesystem df plus btrfs 
filesystem show, from before the resize, after, and before and after the 
balances, a WAG is what it remains.  And again, without those reports, 
there's no way to say whether balance can be expected to help, or not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-27 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
excerpted:

> On 2017-01-27 06:01, Oliver Freyermuth wrote:
>>> I'm also running 'memtester 12G' right now, which at least tests 2/3
>>> of the memory. I'll leave that running for a day or so, but of course
>>> it will not provide a clear answer...
>>
>> A small update: while the online memtester is without any errors still,
>> I checked old syslogs from the machine and found something intriguing.

>> kernel: Corrupted low memory at 88009000 (9000 phys) = 00098d39
>> kernel: Corrupted low memory at 88009000 (9000 phys) = 00099795
>> kernel: Corrupted low memory at 88009000 (9000 phys) = 000dd64e

0x9000 = 36K...

>> This seems to be consistently happening from time to time (I have low
>> memory corruption checking compiled in).
>> The numbers always consistently increase, and after a reboot, start
>> fresh from a small number again.
>>
>> I suppose this is a BIOS bug and it's storing some counter in low
>> memory. I am unsure whether this could have triggered the BTRFS
>> corruption, nor do I know what to do about it (are there kernel quirks
>> for that?). The vendor does not provide any updates, as usual.
>>
>> If someone could confirm whether this might cause corruption for btrfs
>> (and maybe direct me to the correct place to ask for a kernel quirk for
>> this device - do I ask on MM, or somewhere else?), that would be much
>> appreciated.

> It is a firmware bug, Linux doesn't use stuff in that physical address
> range at all.  I don't think it's likely that this specific bug caused
> the corruption, but given that the firmware doesn't have it's
> allocations listed correctly in the e820 table (if they were listed
> correctly, you wouldn't be seeing this message), it would not surprise
> me if the firmware was involved somehow.

Correct me if I'm wrong (I'm no kernel expert, but I've been building my 
own kernel for well over a decade now so having a working familiarity 
with the kernel options, of which the following is my possibly incorrect 
read), but I believe that's only "fact check: mostly correct" (mostly as 
in yes it's the default, but there's a mainline kernel option to change 
it).

I was just going over the related kernel options again a couple days ago, 
so they're fresh in my head, and AFAICT...

There are THREE semi-related kernel options (config UI option location is 
based on the mainline 4.10-rc5+ git kernel I'm presently running):

DEFAULT_MMAP_MIN_ADDR

Config location: Processor type and features:
Low address space to protect from user allocation

This one is virtual memory according to config help, so likely not 
directly related, but similar idea.

X86_CHECK_BIOS_CORRUPTION

Location: Same section, a few lines below the first one:
Check for low memory corruption

I guess this is the option you (OF) have enabled.  Note that according to 
help, in addition to enabling this in options, a runtime kernel 
commandline option must be given as well, to actually enable the checks.

X86_RESERVE_LOW

Location: Same section, immediately below the check option:
Amount of low memory, in kilobytes, to reserve for the BIOS

Help for this one suggests enabling the check bios corruption option 
above if there are any doubts, so the two are directly related.

All three options apparently default to 64K (as that's what I see here 
and I don't believe I've changed them), but can be changed.  See the 
kernel options help and where it points for more.

My read of the above is that yes, by default the kernel won't use 
physical 0x9000 (36K), as it's well within the 64K default reserve area, 
but a blanket "Linux doesn't use stuff in that physical address range at 
all" is incorrect, as if the defaults have been changed it /could/ use 
that space (#3's minimum is 1 page, 4K, leaving that 36K address 
uncovered) -- there's a mainline-official option to do so, so it doesn't 
even require patching.

Meanwhile, since the defaults cover it, no quirk should be necessary (tho 
I might increase the reserve and test coverage area to the maximum 640K 
and run for awhile to be sure it's not going above the 64K default), but 
were it outside the default 64K coverage area, I would probably file it 
as a bug (my usual method for confirmed bugs), and mark it initially as 
an arch-x86 bug, tho they may switch it to something else, later.  But 
the devs would probably suggest further debugging, possibly giving you 
debug patches to try, etc, to nail down the specific device, before 
setting up a quirk for it.  Because the problem could be an expansion 
card or something, not the mobo/factory-default-machine, too, and it'd be 
a shame to setup a quirk for the wrong hardware.

>> Additionally, I found that "btrfs restore" works on this broken FS. I
>> will take an external backup of the content within the next 24 hours
>> using that, then I am ready to try anything you suggeest.

> FWIW the fact that btrfs restore works is a good 

[GIT PULL] Btrfs

2017-01-27 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has some fixes that we've collected from the list.  We still have one 
more pending to nail down a regression in lzo compression, but I wanted 
to get this batch out the door.

Omar Sandoval (3) commits (+2/-6):
 Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations (+0/-2)
 Btrfs: remove old tree_root case in btrfs_read_locked_inode() (+1/-4)
 Btrfs: disable xattr operations on subvolume directories (+1/-0)

Liu Bo (1) commits (+12/-1):
 Btrfs: fix truncate down when no_holes feature is enabled

Chandan Rajendra (1) commits (+2/-2):
 Btrfs: Fix deadlock between direct IO and fast fsync

Wang Xiaoguang (1) commits (+1/-0):
 btrfs: fix false enospc error when truncating heavily reflinked file

Total: (6) commits (+17/-9)

  fs/btrfs/inode.c | 26 +-
  1 file changed, 17 insertions(+), 9 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-01-27 Thread Adam Borowski
On Fri, Jan 27, 2017 at 03:03:18PM -0500, Austin S. Hemmelgarn wrote:
> On 2017-01-27 11:47, Hans Deragon wrote:
> > However, as a user, I am seeking for an easy, no maintenance raid
> > solution.  I wish that if a drive fails, the btrfs filesystem still
> > mounts rw and leaves the OS running, but warns the user of the failing
> > disk and easily allow the addition of a new drive to reintroduce
> > redundancy.

> Before I make any suggestions regarding this, I should point out that
> mounting read-write when a device is missing is what caused this issue in
> the first place.  Doing so is extremely dangerous in any RAID setup,
> regardless of your software stack.  The filesystem is expected to store
> things reliably when a write succeeds, and if you've got a broken RAID
> array, claiming that you can store things reliably is generally a lie. MD
> and LVM both have things in place to mitigate most of the risk, but even
> there it's still risky.  Yes, it's not convenient to have to deal with a
> system that won't boot, but it's at least a whole lot easier from Linux than
> it is in most other operating systems.

Now, now.  Other RAID implementations already have this feature that you're
clamoring for!  When it is degraded, they will continue without a hitch, and
perform their duties not even bothering the user.  Then a couple years
later, the other disk will fail.  Obviously, there are no backups -- "we
have RAID".  This is when I get a call.

> The second is proper monitoring.  A well set up monitoring system will let
> you know when the disk is failing before it gets to the point of just
> disappearing from the system most of the time.

No problem, the second busted disk I mentioned above will include a full
mbox with a mail from mdadm for every single day.  They were either unread,
or read by an admin who ignored them and perhaps even wrote a filter to send
them to /dev/null.  Because the system still works, what's the hurry?


Meow!
-- 
Autotools hint: to do a zx-spectrum build on a pdp11 host, type:
  ./configure --host=zx-spectrum --build=pdp11
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-01-27 Thread Austin S. Hemmelgarn

On 2017-01-27 11:47, Hans Deragon wrote:

On 2017-01-24 14:48, Adam Borowski wrote:


On Tue, Jan 24, 2017 at 01:57:24PM -0500, Hans Deragon wrote:


If I remove 'ro' from the option, I cannot get the filesystem mounted
because of the following error: BTRFS: missing devices(1) exceeds the
limit(0), writeable mount is not allowed So I am stuck. I can only
mount the filesystem as read-only, which prevents me to add a disk.


A known problem: you get only one shot at fixing the filesystem, but
that's
not because of some damage but because the check whether the fs is in a
shape is good enough to mount is oversimplistic.

Here's a patch, if you apply it and recompile, you'll be able to mount
degraded rw.

Note that it removes a safety harness: here, the harness got tangled
up and
keeps you from recovering when it shouldn't, but it _has_ valid uses
that.

Meow!


Greetings,

Ok, that solution will solve my problem in the short run, i.e. getting
my raid1 up again.

However, as a user, I am seeking for an easy, no maintenance raid
solution.  I wish that if a drive fails, the btrfs filesystem still
mounts rw and leaves the OS running, but warns the user of the failing
disk and easily allow the addition of a new drive to reintroduce
redundancy.  Are there any plans within the btrfs community to implement
such a feature?  In a year from now, when the other drive will fail,
will I hit again this problem, i.e. my OS failing to start, booting into
a terminal, and cannot reintroduce a new drive without recompiling the
kernel?
Before I make any suggestions regarding this, I should point out that 
mounting read-write when a device is missing is what caused this issue 
in the first place.  Doing so is extremely dangerous in any RAID setup, 
regardless of your software stack.  The filesystem is expected to store 
things reliably when a write succeeds, and if you've got a broken RAID 
array, claiming that you can store things reliably is generally a lie. 
MD and LVM both have things in place to mitigate most of the risk, but 
even there it's still risky.  Yes, it's not convenient to have to deal 
with a system that won't boot, but it's at least a whole lot easier from 
Linux than it is in most other operating systems.


Now, the first step to reliable BTRFS usage is using up-to-date kernels. 
 If you're actually serious about using BTRFS, you should be doing this 
anyway though.  Assuming you're keeping up-to-date on the kernel, then 
you won't hit this same problem again (or at least you shouldn't, since 
multiple people now have checks for this in their regression testing 
suites for BTRFS).


The second is proper monitoring.  A well set up monitoring system will 
let you know when the disk is failing before it gets to the point of 
just disappearing from the system most of the time.  There is currently 
no specific monitoring tool for BTRFS, but it's really easy to set up 
automated monitoring for stuff like this.  It's impractical for me to 
cover exact configuration here, since I don't know how much background 
you have dealing with stuff like this (and you're probably using systemd 
since it's Ubnutu, and I have near zero background dealing with 
recurring task scheduling with that).  I can however cover a list of 
what you should be monitoring and roughly how often:
1. SMART status from the storage devices.  You'll need smartmontools for 
this.  In general, I'd suggest using smartctl through cron or a systemd 
timer unit to monitor this instead of smartd.  Basic command-line that 
will work on all modern SATA disks to perform the checks you want is:

smartctl -H /dev/sda
You'll need one call for each disk, just replace /dev/sda with each 
device.  Note that this should be the device itself, not the partitions. 
 If that command spits out a warning (or returns with an exit code 
other than 0), something's wrong and you should at least investigate 
(and possibly look at replacing the disk).  I would suggest checking 
SMART status at least daily, and potentially much more frequently. 
When the self-checks in the disk firmware start failing (which is what 
this is checking), it generally means that failure is imminent, usually 
within a couple of days at most.
2. BTRFS scrub.  if you're serious about data safety, you should be 
running a scrub on the filesystem regularly.  As a general rule, once a 
week is reasonable unless you have marginal hardware or are seriously 
paranoid.  Make sure to check the results later with the 'btrfs scrub 
status' command.  It will tell you if it found any errors, and how many 
it was able to fix.  Isolated single errors are generally not a sign of 
imminent failure, it's when they start happening regularly or you see a 
whole lot at once that you're in trouble.  Scrub will also fix most 
synchronization issues between devices in a RAID set.
3. BTRFS device stats.  BTRFS stores per-device error counters in the 
filesystem.  These track cumulative errors since the last time they were 

File system is oddly full after kernel upgrade, balance doesn't help

2017-01-27 Thread MegaBrutal
Hi,

Not sure if it caused by the upgrade, but I only encountered this
problem after I upgraded to Ubuntu Yakkety, which comes with a 4.8
kernel.
Linux vmhost 4.8.0-34-generic #36-Ubuntu SMP Wed Dec 21 17:24:18 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

This is the 2nd file system which showed these symptoms, so I thought
it's more than happenstance. I don't remember what I did with the
first one, but I somehow managed to fix it with balance, if I remember
correctly, but it doesn't help with this one.

FS state before any attempts to fix:
Filesystem 1M-blocks   Used Available Use%
Mounted on
/dev/mapper/vmdata--vg-lxc--curlybrace  1024   1024 0 100%
/tmp/mnt/curlybrace

Resized LV, run „btrfs filesystem resize max /tmp/mnt/curlybrace”:
/dev/mapper/vmdata--vg-lxc--curlybrace  2048   1303 0 100%
/tmp/mnt/curlybrace

Notice how the usage magically jumped up to 1303 MB, and despite the
FS size is 2048 MB, the usage is still displayed as 100%.

Tried full balance (other options with -dusage had no result):
root@vmhost:~# btrfs balance start -v /tmp/mnt/curlybrace
Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x0): balancing
  METADATA (flags 0x0): balancing
  SYSTEM (flags 0x0): balancing
WARNING:

Full balance without filters requested. This operation is very
intense and takes potentially very long. It is recommended to
use the balance filters to narrow down the balanced data.
Use 'btrfs balance start --full-balance' option to skip this
warning. The operation will start in 10 seconds.
Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.
ERROR: error during balancing '/tmp/mnt/curlybrace': No space left on device
There may be more info in syslog - try dmesg | tail

No space left on device? How?

But it changed the situation:
/dev/mapper/vmdata--vg-lxc--curlybrace  2048   1302   190  88%
/tmp/mnt/curlybrace

This is still not acceptable. I need to recover at least 50% free
space (since I increased the FS to the double).

A 2nd balance attempt resulted in this:
/dev/mapper/vmdata--vg-lxc--curlybrace  2048   1302   162  89%
/tmp/mnt/curlybrace

So... it became slightly worse.

What's going on? How can I fix the file system to show real data?


Regards,
MegaBrutal
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xfstests: btrfs/047: check btrfs-convert with extent and non-extent source

2017-01-27 Thread Lakshmipathi.G
This is used to check the source which contains combination of Ext3 files
in non-extent format and  Ext4 extent-files. And validate the file md5sums
before and after conversion.

btrfs/012: BTRFS_CONVERT_PROG,E2FSCK_PROG definitions reused from common/config

Signed-off-by: Lakshmipathi.G 
---
 common/config   |   3 ++
 tests/btrfs/012 |   3 --
 tests/btrfs/047 | 122 
 tests/btrfs/047.out |   2 +
 tests/btrfs/group   |   1 +
 5 files changed, 128 insertions(+), 3 deletions(-)
 create mode 100755 tests/btrfs/047
 create mode 100644 tests/btrfs/047.out

diff --git a/common/config b/common/config
index 0706aca..fa89f42 100644
--- a/common/config
+++ b/common/config
@@ -240,11 +240,14 @@ case "$HOSTOS" in
 export DUMP_F2FS_PROG="`set_prog_path dump.f2fs`"
 export BTRFS_UTIL_PROG="`set_prog_path btrfs`"
 export BTRFS_SHOW_SUPER_PROG="`set_prog_path btrfs-show-super`"
+   export BTRFS_CONVERT_PROG="`set_prog_path btrfs-convert`"
 export XFS_FSR_PROG="`set_prog_path xfs_fsr`"
 export MKFS_NFS_PROG="false"
 export MKFS_CIFS_PROG="false"
 export MKFS_OVERLAY_PROG="false"
 export MKFS_REISER4_PROG="`set_prog_path mkfs.reiser4`"
+   export E2FSCK_PROG="`set_prog_path e2fsck`"
+   export TUNE2FS_PROG="`set_prog_path tune2fs`"
 ;;
 esac
 
diff --git a/tests/btrfs/012 b/tests/btrfs/012
index 6a3cb81..85c82f0 100755
--- a/tests/btrfs/012
+++ b/tests/btrfs/012
@@ -54,9 +54,6 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_nocheck
 
-BTRFS_CONVERT_PROG="`set_prog_path btrfs-convert`"
-E2FSCK_PROG="`set_prog_path e2fsck`"
-
 _require_command "$BTRFS_CONVERT_PROG" btrfs-convert
 _require_command "$MKFS_EXT4_PROG" mkfs.ext4
 _require_command "$E2FSCK_PROG" e2fsck
diff --git a/tests/btrfs/047 b/tests/btrfs/047
new file mode 100755
index 000..d349d12
--- /dev/null
+++ b/tests/btrfs/047
@@ -0,0 +1,122 @@
+#! /bin/bash
+# FS QA Test 047
+#
+# Test btrfs-convert
+#
+# 1) create ext3 filesystem & populate it.
+# 2) upgrade ext3 filesystem to ext4.
+# 3) populate data.
+# 4) source has combination of non-extent and extent files.
+# 5) convert it to btrfs, mount and verify contents.
+#---
+# Copyright (c) 2017 Lakshmipathi.G  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_nocheck
+
+_require_command "$BTRFS_CONVERT_PROG" btrfs-convert
+_require_command "$MKFS_EXT4_PROG" mkfs.ext4
+_require_command "$E2FSCK_PROG" e2fsck
+_require_command "$TUNE2FS_PROG" tune2fs
+
+rm -f $seqres.full
+
+BLOCK_SIZE=`_get_block_size $TEST_DIR`
+EXT_MD5SUM="$tmp.ext43"
+BTRFS_MD5SUM="$tmp.btrfs"
+
+_populate_data(){
+   data_path=$1
+   mkdir -p $data_path
+   args=`_scale_fsstress_args -p 20 -n 100 $FSSTRESS_AVOID -d $data_path`
+   echo "Run fsstress $args" >>$seqres.full
+   $FSSTRESS_PROG $args >/dev/null 2>&1 &
+   fsstress_pid=$!
+   wait $fsstress_pid
+}
+
+# Create & populate an ext3 filesystem
+$MKFS_EXT4_PROG -F -t ext3 -b $BLOCK_SIZE $SCRATCH_DEV > $seqres.full 2>&1 || \
+   _notrun "Could not create ext3 filesystem"
+
+# mount and populate non-extent file
+mount -t ext3 $SCRATCH_DEV $SCRATCH_MNT
+_populate_data "$SCRATCH_MNT/ext3_ext4_data/ext3"
+_scratch_unmount
+
+# Upgrade it to ext4.
+$TUNE2FS_PROG -O extents,uninit_bg,dir_index $SCRATCH_DEV >> $seqres.full 2>&1
+# After Conversion, its highly recommended to run e2fsck.
+$E2FSCK_PROG -fyD $SCRATCH_DEV >> $seqres.full 2>&1
+
+# mount and populate extent file
+mount -t ext4 $SCRATCH_DEV $SCRATCH_MNT
+_populate_data "$SCRATCH_MNT/ext3_ext4_data/ext4"
+
+# Compute md5 of ext3,ext4 files.
+find "$SCRATCH_MNT/ext3_ext4_data" 

Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-27 Thread Theodore Ts'o
On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1: cannot add disk to replace faulty because can only mount fs as read-only.

2017-01-27 Thread Hans Deragon

On 2017-01-24 14:48, Adam Borowski wrote:


On Tue, Jan 24, 2017 at 01:57:24PM -0500, Hans Deragon wrote:

If I remove 'ro' from the option, I cannot get the filesystem mounted 
because of the following error: BTRFS: missing devices(1) exceeds the 
limit(0), writeable mount is not allowed So I am stuck. I can only 
mount the filesystem as read-only, which prevents me to add a disk.


A known problem: you get only one shot at fixing the filesystem, but 
that's

not because of some damage but because the check whether the fs is in a
shape is good enough to mount is oversimplistic.

Here's a patch, if you apply it and recompile, you'll be able to mount
degraded rw.

Note that it removes a safety harness: here, the harness got tangled up 
and
keeps you from recovering when it shouldn't, but it _has_ valid uses 
that.


Meow!


Greetings,

Ok, that solution will solve my problem in the short run, i.e. getting 
my raid1 up again.


However, as a user, I am seeking for an easy, no maintenance raid 
solution.  I wish that if a drive fails, the btrfs filesystem still 
mounts rw and leaves the OS running, but warns the user of the failing 
disk and easily allow the addition of a new drive to reintroduce 
redundancy.  Are there any plans within the btrfs community to implement 
such a feature?  In a year from now, when the other drive will fail, 
will I hit again this problem, i.e. my OS failing to start, booting into 
a terminal, and cannot reintroduce a new drive without recompiling the 
kernel?


Best regards,
Hans Deragon

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH v2 4/4] vfs: wrap write f_ops with file_{start,end}_write()

2017-01-27 Thread Darrick J. Wong
[adding mfasheh & btrfs list to cc]

On Fri, Jan 27, 2017 at 06:20:12PM +0200, Amir Goldstein wrote:
> On Fri, Jan 27, 2017 at 1:50 PM, Amir Goldstein  wrote:
> > On Fri, Jan 27, 2017 at 1:09 PM, Miklos Szeredi  wrote:
> >> On Mon, Jan 23, 2017 at 8:43 PM, Amir Goldstein  wrote:
> >>> Before calling write f_ops, call file_start_write() instead
> >>> of sb_start_write().
> >>>
> >>> This ensures freeze protection for both overlay and upper fs
> >>> when file is open from an overlayfs mount.
> >>>
> >>> Replace {sb,file}_start_write() for {copy,clone}_file_range() and
> >>> for fallocate().
> >>>
> >>> For dedup_file_range() there is no need for mnt_want_write_file().
> >>> File is already open for write, so we already have mnt_want_write()
> >>> and we only need file_start_write().
> >>
> >> Being opened for write is not verified if capable(CAP_SYS_ADMIN).
> >> Ugly special case, don't ask me why it's done...
> >>
> >
> > Christoph, Darrick, is that by design?
> 
> Anyway, whether is makes sense or not, that's a legacy from
> BTRFS_IOC_FILE_EXTENT_SAME, we probably have to live with.
> 
> Michael, I recon man page needs updating.
> 
> I'll remove this hunk from the patch.

I /think/ that behavior (CAP_SYS_ADMIN not requiring destfd to be open
for writes in order to dedupe) was intentional; it seems to date back to
the original ioctl in 2013.  My guess of the justification is that we're
not really writing to dest, so if the admin comes along with an O_RDONLY
destfd it's ok?

 Let's see if we get any bites from the btrfs developers. :)

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs progs release 4.9.1

2017-01-27 Thread David Sterba
Hi,

btrfs-progs version 4.9.1 have been released.

Changes:
* check:
  * use correct inode number for lost+found files
  * lowmem mode: fix false alert on dropped leaf
* size reports: negative numbers might appear in size reports during device
  deletes (previously in EiB units)
* mkfs: print device being trimmed
* defrag: v1 ioctl support dropped
* quota: print message before starting to wait for rescan
* qgroup show: new option to sync before printing the stats
* other:
  * corrupt-block enhancements
  * backtrace and co. cleanups
  * doc fixes

Changes since rc1:
* change name of one test directory, duplicate numbers

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (18):
  btrfs-progs: make negative number pretty printing optional
  btrfs-progs: enable negative numbers for unallocated device space
  btrfs-progs: mkfs: print device name while trimming
  btrfs-progs: defrag: force using v2 defrag ioctl and make default 32M 
threshold actually work
  btrfs-progs: defrag: remove v1 ioctl support
  btrfs-progs: qgroups show: clean up errno passing
  btrfs-progs: qgroup show: refine error messages
  btrfs-progs: tests: 005-qgroup-show
  btrfs-progs: kerncompat: add separate trace print for BUG_ON
  btrfs-progs: kerncompat: pass exact condition value from ASSERT
  btrfs-progs: kerncompat: disconnect assert and warning messages
  btrfs-progs: kerncompat: simplify warning_trace
  btrfs-progs: qgroup show: do not error if sync fails
  btrfs-progs: tests: add variable quotation to cli-tests
  btrfs-progs: tests: use built binaries for 004-send-parent-multi-subvol
  btrfs-progs: mkfs/convert: separate the convert part from make_btrfs
  btrfs-progs: update CHANGES for v4.9.1
  Btrfs progs v4.9.1

Esteve Fernandez (1):
  btrfs-progs: docs: fix typo in btrfs-subvolume

Goldwyn Rodrigues (2):
  btrfs-progs: check: get the highest inode for lost+found
  btrfs-progs: sanitize - Use correct source for memcpy

Jeff Mahoney (1):
  btrfs-progs: quota: fix printing during wait mode

Lakshmipathi.G (2):
  btrfs-progs: corrupt-block: Include inode nlink field
  btrfs-progs: corrupt-block: Include more inode fields

Nicholas D Steeves (1):
  btrfs-progs: Fix spelling/typos in user-facing strings

Qu Wenruo (2):
  btrfs-progs: check: fix false alert on dropped leaf in lowmem mode
  btrfs-progs: Fix disable backtrace assert error

Tsutomu Itoh (3):
  btrfs-progs: qgroup: add sync option to 'qgroup show'
  btrfs-progs: qgroup: change the value of sort option
  btrfs-progs: tests: add test for --sync option of qgroup show

Zygo Blaxell (1):
  btrfs-progs: utils: negative numbers are more plausible than sizes over 8 
EiB

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-27 Thread Austin S. Hemmelgarn

On 2017-01-27 06:01, Oliver Freyermuth wrote:

I'm also running 'memtester 12G' right now, which at least tests 2/3 of the 
memory. I'll leave that running for a day or so, but of course it will not 
provide a clear answer...


A small update: while the online memtester is without any errors still, I 
checked old syslogs from the machine and found something intriguing.
Jan 16 10:03:11 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 00098d39
Jan 16 10:18:33 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 00099795
Jan 16 17:35:48 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 000dd64e
This seems to be consistently happening from time to time (I have low memory 
corruption checking compiled in).
The numbers always consistently increase, and after a reboot, start fresh from 
a small number again.

I suppose this is a BIOS bug and it's storing some counter in low memory. I am 
unsure whether this could have triggered the BTRFS corruption,
nor do I know what to do about it (are there kernel quirks for that?).
The vendor does not provide any updates, as usual.

If someone could confirm whether this might cause corruption for btrfs (and 
maybe direct me to the correct place to ask for a kernel quirk for this device 
- do I ask on MM, or somewhere else?), that would be much appreciated.
It is a firmware bug, Linux doesn't use stuff in that physical address 
range at all.  I don't think it's likely that this specific bug caused 
the corruption, but given that the firmware doesn't have it's 
allocations listed correctly in the e820 table (if they were listed 
correctly, you wouldn't be seeing this message), it would not surprise 
me if the firmware was involved somehow.



   We can probably talk you through fixing this by hand with a decent
hex editor. I've done it before...


That would be nice! Is it fine via the mailing list?
Potentially, the instructions could be helpful for future reference, and "real" 
IRC is not accessible from my current location.

Do you have suggestions for a decent hexeditor for this job? Until now, I have 
been mainly using emacs,
classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's 
graphical!), but of course these were made for a few MiB of files and are not 
so well suited for a block device.

The first thing to do would then probably just be to jump to the offset where 
0xd89500014da12000 is written (can I get that via inspect-internal, or do I 
have to search for it?), fix that to read
0x00a800014da12000
(if I understood correctly) and then probably adapt a checksum?


Additionally, I found that "btrfs restore" works on this broken FS. I will take 
an external backup of the content within the next 24 hours using that, then I am ready to 
try anything you suggeest.
FWIW< the fact that btrfs restore works is a good sign, it means that 
the filesystem is almost certainly repairable (even though the tools 
might not be able to repair it themselves).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-27 Thread Michal Hocko
On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us 
> > > > whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess   so there still is one way this could screw us, and it's this 
> reason for GFP_NOFS:
> 
> - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264 static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520 static
./mm/vmscan.c:2946  try_to_free_pages   216 static
./mm/vmscan.c:2753  do_try_to_free_pages304 static
./mm/vmscan.c:2517  shrink_node 352 static
./mm/vmscan.c:2317  shrink_node_memcg   560 static
./mm/vmscan.c:1692  shrink_inactive_list688 static
./mm/vmscan.c:908   shrink_page_list608 static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264 static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520 static
./mm/vmscan.c:2946  try_to_free_pages   216 static
./mm/vmscan.c:2753  do_try_to_free_pages304 static
./mm/vmscan.c:2517  shrink_node 352 static
./mm/vmscan.c:427   shrink_slab 336 static
./fs/super.c:56 super_cache_scan104 static << here we have 
the NOFS protection
./fs/dcache.c:1089  prune_dcache_sb 152 static
./fs/dcache.c:939   shrink_dentry_list  96  static
./fs/dcache.c:509   __dentry_kill   72  static
./fs/dcache.c:323   dentry_unlink_inode 64  static
./fs/inode.c:1527   iput80  static
./fs/inode.c:532evict   72  static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs recovery

2017-01-27 Thread Oliver Freyermuth
> I'm also running 'memtester 12G' right now, which at least tests 2/3 of the 
> memory. I'll leave that running for a day or so, but of course it will not 
> provide a clear answer... 

A small update: while the online memtester is without any errors still, I 
checked old syslogs from the machine and found something intriguing. 
Jan 16 10:03:11 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 00098d39
Jan 16 10:18:33 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 00099795
Jan 16 17:35:48 xxx kernel: Corrupted low memory at 88009000 (9000 
phys) = 000dd64e
This seems to be consistently happening from time to time (I have low memory 
corruption checking compiled in). 
The numbers always consistently increase, and after a reboot, start fresh from 
a small number again. 

I suppose this is a BIOS bug and it's storing some counter in low memory. I am 
unsure whether this could have triggered the BTRFS corruption, 
nor do I know what to do about it (are there kernel quirks for that?). 
The vendor does not provide any updates, as usual. 

If someone could confirm whether this might cause corruption for btrfs (and 
maybe direct me to the correct place to ask for a kernel quirk for this device 
- do I ask on MM, or somewhere else?), that would be much appreciated. 

>>We can probably talk you through fixing this by hand with a decent
>> hex editor. I've done it before...
>>
> That would be nice! Is it fine via the mailing list? 
> Potentially, the instructions could be helpful for future reference, and 
> "real" IRC is not accessible from my current location. 
> 
> Do you have suggestions for a decent hexeditor for this job? Until now, I 
> have been mainly using emacs, 
> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's 
> graphical!), but of course these were made for a few MiB of files and are not 
> so well suited for a block device. 
> 
> The first thing to do would then probably just be to jump to the offset where 
> 0xd89500014da12000 is written (can I get that via inspect-internal, or do I 
> have to search for it?), fix that to read 
> 0x00a800014da12000
> (if I understood correctly) and then probably adapt a checksum? 
> 
Additionally, I found that "btrfs restore" works on this broken FS. I will take 
an external backup of the content within the next 24 hours using that, then I 
am ready to try anything you suggeest. 

Cheers and thanks!
Oliver
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html