Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Chris Murphy
For raid5 it's different. No single chunks are created while copying
files to a degraded volume.

And the scrub produces very noisy kernel messages. Looks like there's
a message for each missing block (or stripe?), thousands per file. And
also many uncorrectable errors like this:

[267466.792060] f23s.localdomain kernel: BTRFS error (device dm-8):
unable to fixup (regular) error at logical 3760582656 on dev /dev/dm-7
[267467.508588] f23s.localdomain kernel: scrub_handle_errored_block:
401 callbacks suppressed

[root@f23s ~]# btrfs scrub start /mnt/1/
ERROR: there are uncorrectable errors

[root@f23s ~]# btrfs scrub status /mnt/1/
scrub status for 51e1efb0-7df3-44d5-8716-9ed4bdadc93e
scrub started at Fri Apr  8 14:35:25 2016 and finished after 00:11:26
total bytes scrubbed: 3.21GiB with 45186 errors
error details: read=95 super=2 verify=8 csum=45081
corrected errors: 44935, uncorrectable errors: 249, unverified errors: 0

Subsequent balance and scrub have no messages at all. So...
uncorrectable? Really? That's confusing.

FYI, for a scrub with no errors it's 4m24s, but with the same data and
1/2 of it needing to be rebuilt during the scrub took 16m4s, so about
4x longer to reconstruct. Seems excessive.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace

2016-04-08 Thread Yauhen Kharuzhy
On Sat, Apr 02, 2016 at 09:30:48AM +0800, Anand Jain wrote:
> Hot replace / auto replace is important volume manager feature
> and is critical to the data center operations, so that the degraded
> volume can be brought back to a healthy state at the earliest and
> without manual intervention.
> 
> This modifies the existing replace code to suite the need of auto
> replace, in the long run I hope both the codes to be merged.
> 
> Signed-off-by: Anand Jain 
> Tested-by: Austin S. Hemmelgarn 
> ---
>  fs/btrfs/dev-replace.c | 43 +++
>  fs/btrfs/dev-replace.h |  1 +
>  2 files changed, 44 insertions(+)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 2b926867d136..ceab4c51db32 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info 
> *fs_info)
>_info->fs_state));
>   }
>  }
> +
> +int btrfs_auto_replace_start(struct btrfs_root *root,
> + struct btrfs_device *src_device)
> +{
> + int ret;
> + char *tgt_path;
> + char *src_path;
> + struct btrfs_fs_info *fs_info = root->fs_info;
> +
> + if (fs_info->sb->s_flags & MS_RDONLY)
> + return -EROFS;
> +
> + btrfs_dev_replace_lock(_info->dev_replace, 0);
> + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
> + btrfs_dev_replace_unlock(_info->dev_replace, 0);
> + return -EBUSY;
> + }
> + btrfs_dev_replace_unlock(_info->dev_replace, 0);
> +
> + if (btrfs_get_spare_device(_path)) {
> + btrfs_err(root->fs_info,
> + "No spare device found/configured in the kernel");
> + return -EINVAL;
> + }
> +
> + rcu_read_lock();
> + src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC);
> + rcu_read_unlock();
> + if (!src_path) {
> + kfree(tgt_path);
> + return -ENOMEM;
> + }
> + ret = btrfs_dev_replace_start(root, tgt_path,
> + src_device->devid, src_path,
> + BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID);
> + if (ret)
> + btrfs_put_spare_device(tgt_path);
> +
> + kfree(tgt_path);
> + kfree(src_path);
> +
> + return 0;
> +}

Without of fs_info->mutually_exclusive_operation_running flag set in
btrfs_auto_replace_start(), device add/remove/balance etc. can be
started in parralel with autoreplace. Should this scenarios be permitted?

> diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
> index e922b42d91df..b918b9d6e5df 100644
> --- a/fs/btrfs/dev-replace.h
> +++ b/fs/btrfs/dev-replace.h
> @@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t 
> *stat_value)
>  {
>   atomic64_inc(stat_value);
>  }
> +int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device 
> *src_device);
>  #endif
> -- 
> 2.7.0

-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs

2016-04-08 Thread Chris Mason
Hi Linus

We have some fixes queued up in my for-linus-4.6 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.6

These are bug fixes, including a really old fsync bug, and a few
trace points to help us track down problems in the quota code.

Mark Fasheh (2) commits (+129/-23):
btrfs: handle non-fatal errors in btrfs_qgroup_inherit() (+32/-22)
btrfs: Add qgroup tracing (+97/-1)

Filipe Manana (1) commits (+137/-0):
Btrfs: fix file/data loss caused by fsync after rename and new inode

Liu Bo (1) commits (+1/-0):
Btrfs: fix invalid reference in replace_path

Yauhen Kharuzhy (1) commits (+2/-0):
btrfs: Reset IO error counters before start of device replacing

Qu Wenruo (1) commits (+19/-2):
btrfs: Output more info for enospc_debug mount option

Davide Italiano (1) commits (+6/-3):
Btrfs: Improve FL_KEEP_SIZE handling in fallocate

Josef Bacik (1) commits (+1/-1):
Btrfs: don't use src fd for printk

David Sterba (1) commits (+8/-4):
btrfs: fallback to vmalloc in btrfs_compare_tree

Total: (9) commits (+303/-33)

 fs/btrfs/ctree.c |  12 ++--
 fs/btrfs/dev-replace.c   |   2 +
 fs/btrfs/extent-tree.c   |  21 ++-
 fs/btrfs/file.c  |   9 ++-
 fs/btrfs/ioctl.c |   2 +-
 fs/btrfs/qgroup.c|  63 +---
 fs/btrfs/relocation.c|   1 +
 fs/btrfs/tree-log.c  | 137 +++
 include/trace/events/btrfs.h |  89 +++-
 9 files changed, 303 insertions(+), 33 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Chris Murphy
On Fri, Apr 8, 2016 at 1:27 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-04-08 14:30, Chris Murphy wrote:
>>
>> On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
>>  wrote:
>>>
>>> On 2016-04-08 14:05, Chris Murphy wrote:


 On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
  wrote:

> I entirely agree.  If the fix doesn't require any kind of decision to
> be
> made other than whether to fix it or not, it should be trivially
> fixable
> with the tools.  TBH though, this particular issue with devices
> disappearing
> and reappearing could be fixed easier in the block layer (at least,
> there
> are things that need to be fixed WRT it in the block layer).



 Another feature needed for transient failures with large storage, is
 some kind of partial scrub, along the lines of md partial resync when
 there's a bitmap write intent log.

>>> In this case, I would think the simplest way to do this would be to have
>>> scrub check if generation matches and not further verify anything that
>>> does
>>> (I think we might be able to prune anything below objects whose
>>> generation
>>> matches, but I'm not 100% certain about how writes cascade up the trees).
>>> I
>>> hadn't really thought about this before, but now that I do, it kind of
>>> surprises me that we don't have something to do this.
>>>
>>
>> And I need to better qualify this: this scrub (or balance) needs to be
>> initiated automatically, perhaps have some reasonable delay after the
>> block layer informs Btrfs the missing device as reappeared. Both the
>> requirement of a full scrub as well as it being a manual scrub, are
>> pretty big gotchas.
>>
> We would still ideally want some way to initiate it manually because:
> 1. It would make it easier to test.
> 2. We should have a way to do it on filesystems that have been reassembled
> after a reboot, not just ones that got the device back in the same boot (or
> it was missing on boot and then appeared).

I'm OK with a mount option, 'autoraidfixup' (not a proposed name!),
that permits the mechanism to happen, but which isn't yet the default.
However, one day I think it should be, because right now we already
allow mounts of devices with different generations and there is no
message indicating this at all, even though the superblocks clearly
show a discrepancy in generation.

mount with one device missing

[264466.609093] BTRFS: has skinny extents
[264912.547199] BTRFS info (device dm-6): disk space caching is enabled
[264912.547267] BTRFS: has skinny extents
[264912.606266] BTRFS: failed to read chunk tree on dm-6
[264912.621829] BTRFS: open_ctree failed

mount -o degraded

[264953.758518] BTRFS info (device dm-6): allowing degraded mounts
[264953.758794] BTRFS info (device dm-6): disk space caching is enabled
[264953.759055] BTRFS: has skinny extents

copy 800MB file
umount
lvchange -ay
mount

[265082.859201] BTRFS info (device dm-6): disk space caching is enabled
[265082.859474] BTRFS: has skinny extents

btrfs scrub start

[265260.024267] BTRFS error (device dm-6): bdev /dev/dm-7 errs: wr 0,
rd 0, flush 0, corrupt 0, gen 1

# btrfs scrub status /mnt/1
scrub status for b01b3922-4012-4de1-af42-63f5b2f68fc3
scrub started at Fri Apr  8 14:01:41 2016 and finished after 00:00:18
total bytes scrubbed: 1.70GiB with 1 errors
error details: super=1
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

After scrubbing and fixing everything and zeroing out the counters, if
I fail the device again, I can no longer mount degraded:

[265502.432444] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed

because of this nonsense:

[root@f23s ~]# btrfs fi df /mnt/1
Data, RAID1: total=1.00GiB, used=458.06MiB
Data, single: total=1.00GiB, used=824.00MiB
System, RAID1: total=64.00MiB, used=16.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=2.00GiB, used=576.00KiB
Metadata, single: total=256.00MiB, used=912.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

a.) the device I'm mounting degraded contains the single chunks, it's
not like the single chunks are actually missing
b.) the manual scrub only fixed the supers, it did not replicate the
newly copied data since it was placed in new single chunks rather than
existing raid1 chunks.
c.) this requires a manual balance convert,soft to actually get
everything back to raid1.

Very non-obvious.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4.0 - no space left with >1.7 TB free space left

2016-04-08 Thread Duncan
Roman Mamedov posted on Fri, 08 Apr 2016 16:53:32 +0500 as excerpted:

> It's not in 4.4.6 either. I don't know why it doesn't get included, or
> what we need to do. Last time I asked, it was queued:
> http://www.spinics.net/lists/linux-btrfs/msg52478.html But maybe that
> meant 4.5 or 4.6 only? While the bug is affecting people on 4.4.x today.

Patches must make it to the current development kernel before they're 
eligible for stable.  Additionally, they need to be cced to stable as 
well, in ordered to be queued there.

So check 4.5 and 4.6-rc.  If it's in neither of those, it's not going to 
be in stable yet.  Once it's in the development kernel, see if it was cced 
to stable and if needed, ask the author and btrfs devs to cc it to stable.

Tho sometimes stable can get a backlog as well.  I know earlier this year 
they were dealing with one, but I follow release or development, not 
stable, and don't know what stable's current status is.

If it gets to stable, and it wasn't for a bug introduced /after/ 4.4, it 
should eventually get into 4.4, as that's an LTS kernel.  But it might 
take awhile, as the above discussion hints.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Missing device handling (was: 'unable to mount btrfs pool...')

2016-04-08 Thread Yauhen Kharuzhy
On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
> On 2016-04-08 12:17, Chris Murphy wrote:
> 
> I would personally suggest adding a per-filesystem node in sysfs to handle
> both 2 and 5. Having it open tells BTRFS to not automatically attempt
> countermeasures when degraded, select/epoll on it will return when state
> changes, reads will return (at minimum): what devices comprise the FS, per
> disk state (is it working, failed, missing, a hot-spare, etc), and what
> effective redundancy we have (how many devices we can lose and still be
> mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for
> raid0/single/dup, possibly higher for n-way replication (n-1), n-order
> parity (n), or erasure coding). This would make it trivial to write a daemon
> to monitor the filesystem, react when something happens, and handle all the
> policy decisions.

Hm, good proposal. Personally I tried to use uevents for this but they
cause locking troubles, and I didn't continue this attempt.

In any case we need have interface for btrfs-progs to passing FS state
information (presence and IDs of missing devices, for example,
degraded/good state of RAID etc.).

For testing as first attempt I implemented following interface. It still seems
not good for me but acceptable as a starting point. Additionally to this, I 
changed
missing device name reported in btrfs_ioctl_dev_info() to 'missing' for 
avoiding of
interferences with block devices inserted after closing of failed device 
(adding of
'missing' field to the struct btrfs_ioctl_dev_info_args may be more right way). 
So,
your opinion?

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d9b147f..f9a2fa6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2716,12 +2716,17 @@ static long btrfs_ioctl_fs_info(struct btrfs_root 
*root, void __user *arg)
 
mutex_lock(_devices->device_list_mutex);
fi_args->num_devices = fs_devices->num_devices;
+   fi_args->missing_devices = fs_devices->missing_devices;
+   fi_args->open_devices = fs_devices->open_devices;
+   fi_args->rw_devices = fs_devices->rw_devices;
+   fi_args->total_devices = fs_devices->total_devices;
memcpy(_args->fsid, root->fs_info->fsid, sizeof(fi_args->fsid));
 
list_for_each_entry(device, _devices->devices, dev_list) {
if (device->devid > fi_args->max_id)
fi_args->max_id = device->devid;
}
+   fi_args->state = root->fs_info->fs_state;
mutex_unlock(_devices->device_list_mutex);
 
fi_args->nodesize = root->fs_info->super_copy->nodesize;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..6808bf2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -186,8 +186,12 @@ struct btrfs_ioctl_fs_info_args {
__u32 nodesize; /* out */
__u32 sectorsize;   /* out */
__u32 clone_alignment;  /* out */
-   __u32 reserved32;
-   __u64 reserved[122];/* pad to 1k */
+   __u32 state;/* out */
+   __u64 missing_devices;  /* out */
+   __u64 open_devices; /* out */
+   __u64 rw_devices;   /* out */
+   __u64 total_devices;/* out */
+   __u64 reserved[118];/* pad to 1k */
 };
 
 struct btrfs_ioctl_feature_flags {




-- 
Yauhen Kharuzhy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Austin S. Hemmelgarn

On 2016-04-08 14:30, Chris Murphy wrote:

On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
 wrote:

On 2016-04-08 14:05, Chris Murphy wrote:


On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
 wrote:


I entirely agree.  If the fix doesn't require any kind of decision to be
made other than whether to fix it or not, it should be trivially fixable
with the tools.  TBH though, this particular issue with devices
disappearing
and reappearing could be fixed easier in the block layer (at least, there
are things that need to be fixed WRT it in the block layer).



Another feature needed for transient failures with large storage, is
some kind of partial scrub, along the lines of md partial resync when
there's a bitmap write intent log.


In this case, I would think the simplest way to do this would be to have
scrub check if generation matches and not further verify anything that does
(I think we might be able to prune anything below objects whose generation
matches, but I'm not 100% certain about how writes cascade up the trees).  I
hadn't really thought about this before, but now that I do, it kind of
surprises me that we don't have something to do this.



And I need to better qualify this: this scrub (or balance) needs to be
initiated automatically, perhaps have some reasonable delay after the
block layer informs Btrfs the missing device as reappeared. Both the
requirement of a full scrub as well as it being a manual scrub, are
pretty big gotchas.


We would still ideally want some way to initiate it manually because:
1. It would make it easier to test.
2. We should have a way to do it on filesystems that have been 
reassembled after a reboot, not just ones that got the device back in 
the same boot (or it was missing on boot and then appeared).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Missing device handling (was: 'unable to mount btrfs pool...')

2016-04-08 Thread Austin S. Hemmelgarn

On 2016-04-08 12:17, Chris Murphy wrote:

On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
 wrote:


I entirely agree.  If the fix doesn't require any kind of decision to be
made other than whether to fix it or not, it should be trivially fixable
with the tools.  TBH though, this particular issue with devices disappearing
and reappearing could be fixed easier in the block layer (at least, there
are things that need to be fixed WRT it in the block layer).


Right. The block layer needs a way to communicate device missing to
Btrfs and Btrfs needs to have some tolerance for transience.


Being notified when a device disappears _shouldn't_ be that hard. A 
uevent gets sent already, and we should be able to associate some kind 
of callback with that happening for devices we have mounted. The bigger 
issue is going to be handling the devices _reappearing_ (if we still 
hold a reference to the device, it appears under a different 
name/major/minor, and if it's more than one device and we have no 
references, they may appear in a different order than they were 
originally), and there is where we really need to fix things. A device 
disappearing forever is bad and all, but a device losing connection and 
reconnecting completely ruining the FS is exponentially worse.


Overall, to provide true reliability here, we need:
1. Some way for userspace to disable writeback caching per-device (this 
is needed for other reasons as well, but those are orthogonal to this 
discussion). This then needs to be used on all removable devices by 
default (Windows and OS X do this, it's part of why small transfers 
appear to complete faster on Linux, and then the disk takes _forever_ to 
unmount). This would reduce the possibility of data loss when a device 
disappears.
2. A way for userspace to be notified (instead of having to poll) of 
state changes in BTRFS. Currently, the only ways for userspace to know 
something is wrong are either parsing dmesg or polling the filesystem 
flags (and based both personal experience, and statements I've seen here 
and elsewhere, polling the FS flags is not reliable for this). Most 
normal installations are going to want to trigger handlers for specific 
state changes (be it e-mail to an admin, or some other notification 
method, or even doing some kind of maintenance on the FS automatically), 
and we need some kind of notification if we want to give userspace the 
ability to properly manage things.
3. A way to tell that a device is gone _when it happens_, not when we 
try to write to it next, not when a write fails, but the moment the 
block layer knows it's not there, we need to know as well. This is a 
prerequisite for the next two items. Sadly, we're probably the only 
thing that would directly benefit from this (LVM uses uevents and 
monitoring daemons to handle this, we don't exactly have that luxury), 
which means it may be hard to get something like this merged.
4. Transparent handling of short, transient loss of a device. This goes 
together to a certain extent with 1, if something disappears for long 
enough that the kernel notices, but it reappears before we have any I/O 
to do on it again, we shouldn't lose our lunch unless userspace tells us 
to (because we told userspace that it's gone due to item 2). In theory, 
we should be able to cache a small number of internal pending writes for 
when it reappears (so for example, if a transaction is being committed, 
and the USB disk disappears for a second, we should be able to pick up 
where we left off (after verifying the last write we sent)). We should 
also have an automatic re-sync if it's a short enough period it's gone 
for. The max timeout here should probably be configurable, but probably 
could just be one tunable for the whole system.
5. Give userspace the option to handle degraded states how it wants to, 
and keep our default of remount RO when degraded when userspace doesn't 
want to handle it itself. This needs to be configured at run-time (not 
stored on the media), and it needs to be per-filesystem, otherwise we 
open up all kinds of other issues. This is a core concept in LVM and 
many other storage management systems; namely, userspace can choose to 
handle a degraded RAID array however the hell it wants, and we'll 
provide a couple of sane default handlers for the common cases.


I would personally suggest adding a per-filesystem node in sysfs to 
handle both 2 and 5. Having it open tells BTRFS to not automatically 
attempt countermeasures when degraded, select/epoll on it will return 
when state changes, reads will return (at minimum): what devices 
comprise the FS, per disk state (is it working, failed, missing, a 
hot-spare, etc), and what effective redundancy we have (how many devices 
we can lose and still be mountable, so 1 for raid1, raid10, and raid5, 2 
for raid6, and 0 for raid0/single/dup, possibly higher for n-way 
replication (n-1), n-order parity (n), or erasure coding). This would 

FROM: MR. OLIVER SENO!!

2016-04-08 Thread AKINWUMI
Dear Sir.

I bring you greetings. My name is Mr.Oliver Seno Lim, I am a staff of Abbey 
National Plc. London and heading our regional office in West Africa. Our late 
customer named Engr.Ben W.westland, made a fixed deposit amount of 
US$7Million.He did not declare any next of kin in any of his paper work, I want 
you as a foreigner to stand as the beneficiary to transfer this funds out of my 
bank into your account, after the successful transfer, we shall share in the 
ratio of 30% for you, 70%for me. Should you be interested please send me your 
information:

1,Full names.
2,current residential address.
3,Tele/Fax numbers./your work.
 
   
All I need from you is your readiness, trustworthiness and edication. Please 
email me directly on my private email address: officeose...@yahoo.com) so we 
can begin arrangements and I would give you more information on how we would 
handle this venture and once i hear from you i will give you information of the 
bank for the transferring funds on your name.

Regards,
Mr.Oliver Seno Lim 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Mark Fasheh
On Fri, Apr 08, 2016 at 03:10:35PM +0200, Holger Hoffstätte wrote:
> [cc: Mark and Qu]
> 
> On 04/08/16 13:51, Holger Hoffstätte wrote:
> > On 04/08/16 13:14, Filipe Manana wrote:
> >> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
> >> patches, it didn't reproduce here:
> > 
> > Great, that's good to know (sort of :). Thanks also to Liu Bo.
> > 
> >> Are you sure that you are not using some patches not in 4.6?
> 
> We have a bingo!
> 
> Reverting "qgroup: Fix qgroup accounting when creating snapshot"
> from last Wednesday immediately fixes the problem.

Not surprising, I had some issues testing it out too. I'm pretty sure this
patch is corrupting memory, I just haven't found where yet though my
educated guess is that the transaction is being reused improperly.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Volume stuck after Checking UUID tree

2016-04-08 Thread Johnathan Falk
johnathan falk  gmail.com> writes:

> 
> The drive mounts perfectly fine when you mount RO, but when you mount
> it rw it gives this (and eventually locks up the system as I can't
> restart it cleanly):
> 
> kernel: BTRFS info (device sdb1): disk space caching is enabled
> Apr 06 17:09:16 kernel: BTRFS error (device sdb1): qgroup generation
> mismatch, marked as inconsistent
> Apr 06 17:09:17 kernel: BTRFS: checking UUID tree
> Apr 06 17:12:30 kernel: INFO: task btrfs-transacti:5701 blocked for
> more than 120 seconds.
> 
> Bug reported on launch pad:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1555828
> 
> When I mount the disk in journalctl -f:
> http://pastebin.com/1p6XppRb
> 
> Went through Marc Merlin's suggestions to fix my drive and the btrfsck
> -S1/2/3 never finishes and gets stuck on checking quota groups. I also
> created a btrfs-image first thing.
> 
> Any suggestions would be welcome as I don't know what to do.
> 

Paste of journalctl:
kernel: Call Trace:
Apr 02 18:55:03 theark kernel:  
[] schedule+0x35/0x80
Apr 02 18:55:03 theark kernel:  
[] 
btrfs_commit_transaction+0x382/0xa90 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] 
? wake_atomic_t_function+0x60/0x60
Apr 02 18:55:03 theark kernel:  
[] transaction_kthread+0x229/0x240 
[btrfs]
Apr 02 18:55:03 theark kernel:  
[] ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] kthread+0xd8/0xf0
Apr 02 18:55:03 theark kernel: 
 [] ret_from_fork+0x22/0x40
Apr 02 18:55:03 theark kernel:  [] ? 
kthread_create_on_node+0x1a0/0x1a0
Apr 02 18:55:03 theark kernel: INFO: task umount:4672 blocked for more than 120 
seconds.
Apr 02 18:55:03 theark kernel:   
Not tainted 4.6.0-040600rc1-generic #201603261930
Apr 02 18:55:03 theark kernel: 
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Apr 02 18:55:03 theark kernel: 
umount  D 88008e323cd8 0  4672   3434 
0x
Apr 02 18:55:03 theark kernel:  
88008e323cd8  81e0d540 
880343711c80
Apr 02 18:55:03 theark kernel:  
88008e324000 88040b4579f0 88040b457800 
88040b4579f0
Apr 02 18:55:03 theark kernel:  
 88008e323cf0 81835e45 
8800bbd561b0
Apr 02 18:55:03 theark kernel: Call Trace:
Apr 02 18:55:03 theark kernel:  
[] schedule+0x35/0x80
Apr 02 18:55:03 theark kernel:  
[] 
wait_current_trans.isra.22+0xd3/0x120 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] ? wake_atomic_t_function+0x60/0x60
Apr 02 18:55:03 theark kernel:  
[] start_transaction+0x27b/0x4c0 
[btrfs]
Apr 02 18:55:03 theark kernel:  [] 
btrfs_attach_transaction_barrier+0x1d/0x50 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] btrfs_sync_fs+0x42/0x110 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] sync_filesystem+0x71/0xa0
Apr 02 18:55:03 theark kernel:  
[] generic_shutdown_super+0x27/0x100
Apr 02 18:55:03 theark kernel:  
[] kill_anon_super+0x12/0x20
Apr 02 18:55:03 theark kernel:  
[] btrfs_kill_super+0x18/0x110 [btrfs]
Apr 02 18:55:03 theark kernel:  
[] deactivate_locked_super+0x43/0x70
Apr 02 18:55:03 theark kernel:  
[] deactivate_super+0x5c/0x60
Apr 02 18:55:03 theark kernel:  
[] cleanup_mnt+0x3f/0x90
Apr 02 18:55:03 theark kernel:  
[] __cleanup_mnt+0x12/0x20
Apr 02 18:55:03 theark kernel:  
[] task_work_run+0x73/0x90
Apr 02 18:55:03 theark kernel:  
[] exit_to_usermode_loop+0xc2/0xd0
Apr 02 18:55:03 theark kernel:  
[] syscall_return_slowpath+0x4e/0x60
Apr 02 18:55:03 theark kernel:  [] 
entry_SYSCALL_64_fastpath+0xa6/0xa8


Currently a btrfsck using btrfs-progs 4.5.1 has run for 
3 days now and is currently using:
pri ni virt res shr s cpu% mem% time
20 0 22.1G 14.36G 188 D 1.9 95.5 47:16.21 btrfs check -s 2 /dev/sdb


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Chris Murphy
On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-04-08 14:05, Chris Murphy wrote:
>>
>> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> I entirely agree.  If the fix doesn't require any kind of decision to be
>>> made other than whether to fix it or not, it should be trivially fixable
>>> with the tools.  TBH though, this particular issue with devices
>>> disappearing
>>> and reappearing could be fixed easier in the block layer (at least, there
>>> are things that need to be fixed WRT it in the block layer).
>>
>>
>> Another feature needed for transient failures with large storage, is
>> some kind of partial scrub, along the lines of md partial resync when
>> there's a bitmap write intent log.
>>
> In this case, I would think the simplest way to do this would be to have
> scrub check if generation matches and not further verify anything that does
> (I think we might be able to prune anything below objects whose generation
> matches, but I'm not 100% certain about how writes cascade up the trees).  I
> hadn't really thought about this before, but now that I do, it kind of
> surprises me that we don't have something to do this.
>


And I need to better qualify this: this scrub (or balance) needs to be
initiated automatically, perhaps have some reasonable delay after the
block layer informs Btrfs the missing device as reappeared. Both the
requirement of a full scrub as well as it being a manual scrub, are
pretty big gotchas.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Austin S. Hemmelgarn

On 2016-04-08 14:05, Chris Murphy wrote:

On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
 wrote:


I entirely agree.  If the fix doesn't require any kind of decision to be
made other than whether to fix it or not, it should be trivially fixable
with the tools.  TBH though, this particular issue with devices disappearing
and reappearing could be fixed easier in the block layer (at least, there
are things that need to be fixed WRT it in the block layer).


Another feature needed for transient failures with large storage, is
some kind of partial scrub, along the lines of md partial resync when
there's a bitmap write intent log.

In this case, I would think the simplest way to do this would be to have 
scrub check if generation matches and not further verify anything that 
does (I think we might be able to prune anything below objects whose 
generation matches, but I'm not 100% certain about how writes cascade up 
the trees).  I hadn't really thought about this before, but now that I 
do, it kind of surprises me that we don't have something to do this.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Chris Murphy
On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
 wrote:

> I entirely agree.  If the fix doesn't require any kind of decision to be
> made other than whether to fix it or not, it should be trivially fixable
> with the tools.  TBH though, this particular issue with devices disappearing
> and reappearing could be fixed easier in the block layer (at least, there
> are things that need to be fixed WRT it in the block layer).

Another feature needed for transient failures with large storage, is
some kind of partial scrub, along the lines of md partial resync when
there's a bitmap write intent log.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive using generation number as source

2016-04-08 Thread Chris Murphy
On Fri, Apr 8, 2016 at 5:01 AM, Martin Steigerwald <mar...@lichtvoll.de> wrote:
> Hello!
>
> As far as I understood, for differential btrfs send/receive – I didn´t use it
> yet – I need to keep a snapshot on the source device to then tell btrfs send
> to send the differences between the snapshot and the current state.
>
> Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep
> any snapshots except for one during rsync or borgbackup script run-time.
>
> Is it possible to tell btrfs send to use generation number xyz to calculate
> the difference? This way, I wouldn´t have to keep a snapshot around, I
> believe.
>
> I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot
> of the same state on the destination as well. Well on the destination I let
> the script make a snapshot after the backup so… what I would need is to
> remember the generation number of the source snapshot that the script creates
> to backup from and then tell btrfs send that generation number + the
> destination snapshots.
>
> Well, or get larger SSDs or get rid of some data on them.

Well if you can't even keep one ro snapshot around, it suggests you
need more space. Otherwise the minimal strategy is:


Yesterday's source has subvols:
root.current
root.20160406

So you'd do

btrfs sub snap -r root.current root.20160407
btrfs send -p root.20160406 root.20160407  | btrfs receive xxx
btrfs sub del root.20160406

Today it's

btrfs sub snap -r root.current root.20160408
btrfs send -p root.20160407 root.20160408  | btrfs receive xxx
btrfs sub del root.20160407


Tomorrow:

btrfs sub snap -r root.current root.20160409
btrfs send -p root.20160408 root.20160409  | btrfs receive xxx
btrfs sub del root.20160408


Locally you always have one snapshot to rollback to or make selective
reflink copies of files from.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive using generation number as source

2016-04-08 Thread Henk Slager
On Fri, Apr 8, 2016 at 1:01 PM, Martin Steigerwald  wrote:
> Hello!
>
> As far as I understood, for differential btrfs send/receive – I didn´t use it
> yet – I need to keep a snapshot on the source device to then tell btrfs send
> to send the differences between the snapshot and the current state.

During the incremental send operation you need 2 ro snapshots
available (a parent and a current snapshot) and after that, you just
need to keep the current one and promote that to parent snapshot and
keep it around until the next incremental send. So indeed that locks
space and you might run out of free space if there is a long time
before the next incremental send|receive and changes in the filesystem
are large in volume.

Alternatively, you could do non-incremental send, if the fs is
relatively small and you have some method to dedupe on the receiving
filesystem. But the rsync method is by far preferred in this case I
would say.

> Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep
> any snapshots except for one during rsync or borgbackup script run-time.
>
> Is it possible to tell btrfs send to use generation number xyz to calculate
> the difference? This way, I wouldn´t have to keep a snapshot around, I
> believe.
>
> I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot
> of the same state on the destination as well. Well on the destination I let
> the script make a snapshot after the backup so… what I would need is to

You can use -p for incremental send and you can also send back (new)
increments from backup to master.

> remember the generation number of the source snapshot that the script creates
> to backup from and then tell btrfs send that generation number + the
> destination snapshots.
>
> Well, or get larger SSDs or get rid of some data on them.

I switched from ext4 to btrfs rootfs on an old netbook which has only
4G soldered flash and no option for extension  (except via USB/SDcard
which turned out to be not reliable enough over a longer period of
time).
Basically compress=lzo mount option extents the lifetime of this
netbook while still using a modern full-sized linux distro. But I
guess you have already compressed/compacted what is possible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Chris Murphy
On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
 wrote:

>> I can see this being happening automatically with up to 2 device
>> failures, so that all subsequent writes are fully intact stripe
>> writes. But the instant there's a 3rd device failure, there's a rather
>> large hole in the file system that can't be reconstructed. It's an
>> invalid file system. I'm not sure what can be gained by allowing
>> writes to continue, other than tying off loose ends (so to speak) with
>> full stripe metadata writes for the purpose of making recovery
>> possible and easier, but after that metadata is written - poof, go
>> read only.
>
> I don't mean writing partial stripes, I mean writing full stripes with a
> reduced width (so in an 8 device filesystem, if 3 devices fail, we can still
> technically write a complete stripe across 5 devices, but it will result in
> less total space we can use).

I understand what you mean, it was clear before. The problem is that
once its below the critical number of drives, the previously existing
file system is busted. So it should go read only. But it can't because
it doesn't yet have the concept of faulty devices, *and* also an
understanding of how many faulty devices can be tolerated before
there's a totally untenable hole in the file system.




>Whether or not this behavior is correct is
> another argument, but that appears to be what we do currently.  Ideally,
> this should be a mount option, as strictly speaking, it's policy, which
> therefore shouldn't be in the kernel.

I think we can definitely agree the current behavior is suboptimal
because in fact whatever it wrote to 16 drives was sufficiently
confusing that mounting all 20 drives again isn't possible no matter
what option is used.




>> I think considering the idea of Btrfs is to be more scalable than past
>> storage and filesystems have been, it needs to be able to deal with
>> transient failures like this. In theory all available information is
>> written on all the disks. This was a temporary failure. Once all
>> devices are made available again, the fs should be able to figure out
>> what to do, even so far as salvaging the writes that happened after
>> the 4 devices went missing if those were successful full stripe
>> writes.
>
> I entirely agree.  If the fix doesn't require any kind of decision to be
> made other than whether to fix it or not, it should be trivially fixable
> with the tools.  TBH though, this particular issue with devices disappearing
> and reappearing could be fixed easier in the block layer (at least, there
> are things that need to be fixed WRT it in the block layer).

Right. The block layer needs a way to communicate device missing to
Btrfs and Btrfs needs to have some tolerance for transience.

>>
>>

 Of course it is possible there's corruption problems with those four
 drives having vanished while writes were incomplete. But if you're
 lucky, data write happen first, then metadata writes second, and only
 then is the super updated. So the super should point to valid metadata
 and that should point to valid data. If that order is wrong, then it's
 bad news and you have to look at backup roots. But *if* you get all
 the supers correct and on the same page, you can access the backup
 roots by using -o recovery if corruption is found with a normal mount.
>>>
>>>
>>> This though is where the potential issue is.  -o recovery will only go
>>> back
>>> so many generations before refusing to mount, and I think that may be why
>>> it's not working now..
>>
>>
>> It also looks like none of the tools are considering the stale supers
>> on the formerly missing 4 devices. I still think those are the best
>> chance to recover because even if their most current data is wrong due
>> to reordered writes not making it to stable storage, one of the
>> available backups in those supers should be good.
>>
> Depending on utilization on the other devices though, they may not point to
> complete roots either.  In this case, they probably will because of the low
> write frequency.  In other cases, they may not though, because we try to
> reuse space in chunks before allocating new chunks.

Based on the superblock posted, I think the *38 generation tree might
be incomplete, but there's a *37 and *36 generation that should be
intact. Chunk generation is the same.

What complicates the rollback is any deletions were happening at the
time. If it's just file additions, I think a rollback has a good
chance of working. It's just tedious.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4.0 - no space left with >1.7 TB free space left

2016-04-08 Thread Tomasz Chmielewski

On 2016-04-08 20:53, Roman Mamedov wrote:


> Do you snapshot the parent subvolume which holds the databases? Can you
> correlate that perhaps ENOSPC occurs at the time of snapshotting? If
> yes, then
> you should try the patch https://patchwork.kernel.org/patch/7967161/
>
> (Too bad this was not included into 4.4.1.)

By the way - was it included in any later kernel? I'm running 4.4.5 on
that server, but still hitting the same issue.


It's not in 4.4.6 either. I don't know why it doesn't get included, or 
what

we need to do. Last time I asked, it was queued:
http://www.spinics.net/lists/linux-btrfs/msg52478.html
But maybe that meant 4.5 or 4.6 only? While the bug is affecting people 
on

4.4.x today.


Does it mean 4.5 also doesn't have it yet?


Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive using generation number as source

2016-04-08 Thread Martin Steigerwald
On Freitag, 8. April 2016 11:12:54 CEST Hugo Mills wrote:
> On Fri, Apr 08, 2016 at 01:01:03PM +0200, Martin Steigerwald wrote:
> > Hello!
> > 
> > As far as I understood, for differential btrfs send/receive – I didn´t use
> > it yet – I need to keep a snapshot on the source device to then tell
> > btrfs send to send the differences between the snapshot and the current
> > state.
> > 
> > Now the BTRFS filesystems on my SSDs are often quite full, thus I do not
> > keep any snapshots except for one during rsync or borgbackup script
> > run-time.
> > 
> > Is it possible to tell btrfs send to use generation number xyz to
> > calculate
> > the difference? This way, I wouldn´t have to keep a snapshot around, I
> > believe.
> 
>btrfs sub find-new
> 
>BUT that will only tell you which files have been added or updated.
> It won't tell you which files have been deleted. It's also unrelated
> to send/receive, so you'd have to roll your own solution.

I am aware of this one.

> > I bet not, at the time cause -c wants a snapshot. Ah and it wants a
> > snapshot of the same state on the destination as well. Well on the
> > destination I let the script make a snapshot after the backup so…
> > what I would need is to remember the generation number of the source
> > snapshot that the script creates to backup from and then tell btrfs
> > send that generation number + the destination snapshots.
> > 
> > Well, or get larger SSDs or get rid of some data on them.
> 
>Those are the other options, of course.

Hm, I see.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Holger Hoffstätte
[cc: Mark and Qu]

On 04/08/16 13:51, Holger Hoffstätte wrote:
> On 04/08/16 13:14, Filipe Manana wrote:
>> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
>> patches, it didn't reproduce here:
> 
> Great, that's good to know (sort of :). Thanks also to Liu Bo.
> 
>> Are you sure that you are not using some patches not in 4.6?

We have a bingo!

Reverting "qgroup: Fix qgroup accounting when creating snapshot"
from last Wednesday immediately fixes the problem.

Was quite easy to find - the triggered WARN_ON was the second one that
complained about a mismatch between roots. The only patch that even
remotely did something in that area was said qgroup fix.

Looks like something is missing there. Suggestions welcome. :)

Holger

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4.0 - no space left with >1.7 TB free space left

2016-04-08 Thread Roman Mamedov
On Fri, 08 Apr 2016 20:36:26 +0900
Tomasz Chmielewski  wrote:

> On 2016-02-08 20:24, Roman Mamedov wrote:
> 
> >> Linux 4.4.0 - btrfs is mainly used to host lots of test containers,
> >> often snapshots, and at times, there is heavy IO in many of them for
> >> extended periods of time. btrfs is on HDDs.
> >> 
> >> 
> >> Every few days I'm getting "no space left" in a container running 
> >> mongo
> >> 3.2.1 database. Interestingly, haven't seen this issue in containers
> >> with MySQL. All databases have chattr +C set on their directories.
> > 
> > Hello,
> > 
> > Do you snapshot the parent subvolume which holds the databases? Can you
> > correlate that perhaps ENOSPC occurs at the time of snapshotting? If 
> > yes, then
> > you should try the patch https://patchwork.kernel.org/patch/7967161/
> > 
> > (Too bad this was not included into 4.4.1.)
> 
> By the way - was it included in any later kernel? I'm running 4.4.5 on 
> that server, but still hitting the same issue.

It's not in 4.4.6 either. I don't know why it doesn't get included, or what
we need to do. Last time I asked, it was queued:
http://www.spinics.net/lists/linux-btrfs/msg52478.html
But maybe that meant 4.5 or 4.6 only? While the bug is affecting people on
4.4.x today.

Thanks

-- 
With respect,
Roman


pgpxDETSxWzY9.pgp
Description: OpenPGP digital signature


Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Holger Hoffstätte
On 04/08/16 13:14, Filipe Manana wrote:
> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
> patches, it didn't reproduce here:

Great, that's good to know (sort of :). Thanks also to Liu Bo.

> Are you sure that you are not using some patches not in 4.6?

Quite a few, but to offset that I also left out some that have diverged
too much or were not that important (block/sectorsize, device handling).
But those should not have anything to do with this particular bug.

Except for this everything works rock-solid, I use it daily.
Should be easy to track down..

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.4.0 - no space left with >1.7 TB free space left

2016-04-08 Thread Tomasz Chmielewski

On 2016-02-08 20:24, Roman Mamedov wrote:


Linux 4.4.0 - btrfs is mainly used to host lots of test containers,
often snapshots, and at times, there is heavy IO in many of them for
extended periods of time. btrfs is on HDDs.


Every few days I'm getting "no space left" in a container running 
mongo

3.2.1 database. Interestingly, haven't seen this issue in containers
with MySQL. All databases have chattr +C set on their directories.


Hello,

Do you snapshot the parent subvolume which holds the databases? Can you
correlate that perhaps ENOSPC occurs at the time of snapshotting? If 
yes, then

you should try the patch https://patchwork.kernel.org/patch/7967161/

(Too bad this was not included into 4.4.1.)


By the way - was it included in any later kernel? I'm running 4.4.5 on 
that server, but still hitting the same issue.



Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'

2016-04-08 Thread Austin S. Hemmelgarn

On 2016-04-07 15:32, Chris Murphy wrote:

On Thu, Apr 7, 2016 at 5:19 AM, Austin S. Hemmelgarn
 wrote:

On 2016-04-06 19:08, Chris Murphy wrote:


On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular  wrote:



  From the ouput of 'dmesg', the section:
[   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039
/dev/sdm
[   20.84] BTRFS: device label FSgyroA devid 10 transid 625039
/dev/sdn
[   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039
/dev/sds
[   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039
/dev/sdu

bothers me because the transid value of these four devices doesn't
match the other 16 devices in the pool {should be 625065}. In theory,
I believe these should all have the same transid value. These four
devices are all on a single USB 3.0 port and this is the link I
believe went down and came back up.



This is effectively a 4 disk failure and raid6 only allows for 2.

Now, a valid complaint is that as soon as Btrfs is seeing write
failures for 3 devices, it needs to go read-only. Specifically, it
would go read only upon 3 or more write errors affecting a single full
raid stripe (data and parity strips combined); and that's because such
a write is fully failed.


AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_
after that, it will start writing out narrower stripes across the remaining
disks if there are enough for it to maintain the data consistency (so if
there's at least 3 for raid6 (I think, I don't remember if our lower limit
is 3 (which is degenerate), or 4 (which isn't, but most other software won't
let you use it for some stupid reason))).  Based on this, if the FS does get
recovered, make sure to run a balance on it too, otherwise you might have
some sub-optimal striping for some data.


I can see this being happening automatically with up to 2 device
failures, so that all subsequent writes are fully intact stripe
writes. But the instant there's a 3rd device failure, there's a rather
large hole in the file system that can't be reconstructed. It's an
invalid file system. I'm not sure what can be gained by allowing
writes to continue, other than tying off loose ends (so to speak) with
full stripe metadata writes for the purpose of making recovery
possible and easier, but after that metadata is written - poof, go
read only.
I don't mean writing partial stripes, I mean writing full stripes with a 
reduced width (so in an 8 device filesystem, if 3 devices fail, we can 
still technically write a complete stripe across 5 devices, but it will 
result in less total space we can use).  Whether or not this behavior is 
correct is another argument, but that appears to be what we do 
currently.  Ideally, this should be a mount option, as strictly 
speaking, it's policy, which therefore shouldn't be in the kernel.






You literally might have to splice superblocks and write them to 16
drives in exactly 3 locations per drive (well, maybe just one of them,
and then delete the magic from the other two, and then 'btrfs rescue
super-recover' should then use the one good copy to fix the two bad
copies).

Sigh maybe?

In theory it's possible, I just don't know the state of the tools. But
I'm fairly sure the best chance of recovery is going to be on the 4
drives that abruptly vanished.  Their supers will be mostly correct or
close to it: and that's what has all the roots in it: tree, fs, chunk,
extent and csum. And all of those states are better farther in the
past, rather than the 16 drives that have much newer writes.


FWIW, it is actually possible to do this, I've done it before myself on much
smaller raid1 filesystems with single drives disappearing, and once with a
raid6 filesystem with a double drive failure.  It is by no means easy, and
there's not much in the tools that helps with it, but it is possible
(although I sincerely hope I never have to do it again myself).


I think considering the idea of Btrfs is to be more scalable than past
storage and filesystems have been, it needs to be able to deal with
transient failures like this. In theory all available information is
written on all the disks. This was a temporary failure. Once all
devices are made available again, the fs should be able to figure out
what to do, even so far as salvaging the writes that happened after
the 4 devices went missing if those were successful full stripe
writes.
I entirely agree.  If the fix doesn't require any kind of decision to be 
made other than whether to fix it or not, it should be trivially fixable 
with the tools.  TBH though, this particular issue with devices 
disappearing and reappearing could be fixed easier in the block layer 
(at least, there are things that need to be fixed WRT it in the block 
layer).




Of course it is possible there's corruption problems with those four
drives having vanished while writes were incomplete. But if you're
lucky, data write happen first, then metadata writes second, 

Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume

2016-04-08 Thread Filipe Manana
On Thu, Apr 7, 2016 at 5:44 PM, Holger Hoffstätte
 wrote:
> Hi,
>
> Looks like I just found an exciting new corner case.
> kernel 4.4.6 with btrfs ~4.6, so 4.6 should reproduce.

Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs
patches, it didn't reproduce here:

#!/bin/bash

dmesg -C
mkfs.btrfs -f /dev/sdi
mount /dev/sdi /mnt/sdi
cd /mnt/sdi
btrfs subvolume create foo
sync
btrfs subvolume snapshot foo foo-1
sync
mv foo-1 foo.new
btrfs subvolume delete foo.new
cd -
umount /dev/sdi
dmesg

gives:

btrfs-progs v4.5.1-dirty
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (100.00GiB) ...
Label:  (null)
UUID:   76cebc54-0ae1-4f53-91fd-3f9438bdfb50
Node size:  16384
Sector size:4096
Filesystem size:100.00GiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP   1.01GiB
  System:   DUP  12.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1   100.00GiB  /dev/sdi

Create subvolume './foo'
Create a snapshot of 'foo' in './foo-1'
Delete subvolume (no-commit): '/mnt/sdi/foo.new'
/mnt
[75015.529626] systemd-journald[578]: Sent WATCHDOG=1 notification.
[75015.756407] BTRFS: device fsid 76cebc54-0ae1-4f53-91fd-3f9438bdfb50
devid 1 transid 3 /dev/sdi
[75015.932527] BTRFS info (device sdi): disk space caching is enabled
[75015.937674] BTRFS: has skinny extents
[75015.938470] BTRFS: flagging fs with big metadata feature
[75015.962601] BTRFS: creating UUID tree

Are you sure that you are not using some patches not in 4.6?
Also tried my own integration branch, and no issue either.

>
> Try on a fresh volume:
>
> $btrfs subvolume create foo
> Create subvolume './foo'
> $sync
> $btrfs subvolume snapshot foo foo-1
> Create a snapshot of 'foo' in './foo-1'
> $sync
> $mv foo-1 foo.new
> $btrfs subvolume delete foo.new
> Delete subvolume (no-commit): '/mnt/test/foo.new'
> $dmesg
> [  226.923316] [ cut here ]
> [  226.923339] WARNING: CPU: 1 PID: 5863 at fs/btrfs/transaction.c:319 
> record_root_in_trans+0xd6/0x100 [btrfs]()
> [  226.923340] Modules linked in: auth_rpcgss oid_registry nfsv4 btrfs xor 
> raid6_pq loop nfs lockd grace sunrpc autofs4 sch_fq_codel radeon 
> snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic coretemp 
> crc32_pclmul crc32c_intel aesni_intel i2c_algo_bit uvcvideo 
> snd_hda_codec_hdmi aes_x86_64 drm_kms_helper videobuf2_vmalloc glue_helper 
> videobuf2_memops syscopyarea lrw sysfillrect gf128mul videobuf2_v4l2 
> sysimgblt snd_usb_audio fb_sys_fops ablk_helper snd_hda_intel videobuf2_core 
> ttm cryptd snd_hwdep v4l2_common usbhid snd_hda_codec snd_usbmidi_lib 
> videodev snd_rawmidi drm snd_hda_core snd_seq_device i2c_i801 snd_pcm 
> i2c_core snd_timer snd r8169 soundcore mii parport_pc parport
> [  226.923365] CPU: 1 PID: 5863 Comm: ls Not tainted 4.4.6 #1
> [  226.923366] Hardware name: Gigabyte Technology Co., Ltd. 
> P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> [  226.923367]   8800da677d20 813181a8 
> 
> [  226.923368]  a0aacdbf 8800da677d58 810507b2 
> 880601e90800
> [  226.923369]  8800dacf10a0 880601e90800 880601e909f0 
> 0001
> [  226.923371] Call Trace:
> [  226.923374]  [] dump_stack+0x4d/0x65
> [  226.923376]  [] warn_slowpath_common+0x82/0xc0
> [  226.923378]  [] warn_slowpath_null+0x1a/0x20
> [  226.923387]  [] record_root_in_trans+0xd6/0x100 [btrfs]
> [  226.923395]  [] btrfs_record_root_in_trans+0x44/0x70 
> [btrfs]
> [  226.923404]  [] start_transaction+0x9e/0x4c0 [btrfs]
> [  226.923412]  [] btrfs_join_transaction+0x17/0x20 [btrfs]
> [  226.923421]  [] btrfs_dirty_inode+0x35/0xd0 [btrfs]
> [  226.923430]  [] btrfs_update_time+0x7d/0xb0 [btrfs]
> [  226.923432]  [] touch_atime+0x88/0xa0
> [  226.923434]  [] iterate_dir+0xdb/0x120
> [  226.923435]  [] SyS_getdents+0x88/0xf0
> [  226.923437]  [] ? fillonedir+0xd0/0xd0
> [  226.923439]  [] entry_SYSCALL_64_fastpath+0x12/0x6a
> [  226.923440] ---[ end trace 9c78caf253e284fe ]---
>
> Code looks like:
>
> ..
> static int record_root_in_trans(struct btrfs_trans_handle *trans,
>struct btrfs_root *root)
> {
> if (test_bit(BTRFS_ROOT_REF_COWS, >state) &&
> root->last_trans < trans->transid) {
> WARN_ON(root == root->fs_info->extent_root);
> WARN_ON(root->commit_root != root->node);
> ..
>
> There's been a few journal/recovery/directory consistency patches recently,
> so maybe it's a corner case or an older problem. I'll try to bisect, but
> meanwhile wanted to report it for discussion.
>
> Holger
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

Re: btrfs send/receive using generation number as source

2016-04-08 Thread Hugo Mills
On Fri, Apr 08, 2016 at 01:01:03PM +0200, Martin Steigerwald wrote:
> Hello!
> 
> As far as I understood, for differential btrfs send/receive – I didn´t use it 
> yet – I need to keep a snapshot on the source device to then tell btrfs send 
> to send the differences between the snapshot and the current state.
> 
> Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep 
> any snapshots except for one during rsync or borgbackup script run-time.
> 
> Is it possible to tell btrfs send to use generation number xyz to calculate 
> the difference? This way, I wouldn´t have to keep a snapshot around, I 
> believe.

   btrfs sub find-new

   BUT that will only tell you which files have been added or updated.
It won't tell you which files have been deleted. It's also unrelated
to send/receive, so you'd have to roll your own solution.

> I bet not, at the time cause -c wants a snapshot. Ah and it wants a
> snapshot of the same state on the destination as well. Well on the
> destination I let the script make a snapshot after the backup so…
> what I would need is to remember the generation number of the source
> snapshot that the script creates to backup from and then tell btrfs
> send that generation number + the destination snapshots.

> Well, or get larger SSDs or get rid of some data on them.

   Those are the other options, of course.

   Hugo.

-- 
Hugo Mills | The trouble with you, Ibid, is you think you know
hugo@... carfax.org.uk | everything.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


btrfs send/receive using generation number as source

2016-04-08 Thread Martin Steigerwald
Hello!

As far as I understood, for differential btrfs send/receive – I didn´t use it 
yet – I need to keep a snapshot on the source device to then tell btrfs send 
to send the differences between the snapshot and the current state.

Now the BTRFS filesystems on my SSDs are often quite full, thus I do not keep 
any snapshots except for one during rsync or borgbackup script run-time.

Is it possible to tell btrfs send to use generation number xyz to calculate 
the difference? This way, I wouldn´t have to keep a snapshot around, I 
believe.

I bet not, at the time cause -c wants a snapshot. Ah and it wants a snapshot 
of the same state on the destination as well. Well on the destination I let 
the script make a snapshot after the backup so… what I would need is to 
remember the generation number of the source snapshot that the script creates 
to backup from and then tell btrfs send that generation number + the 
destination snapshots.

Well, or get larger SSDs or get rid of some data on them.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v1] block: avoid to call .bi_end_io() recursively

2016-04-08 Thread Ming Lei
There were reports about heavy stack use by
recursive calling .bi_end_io().[1][2][3]

Also these patches[1] [2] [3] were posted for
addressing the issue. And the idea is basically
similar, all serializes the recursive calling
of .bi_end_io() by percpu list.

This patch still takes the same idea, but uses
bio_list to implement it, which turns out more
simple and the code becomes more readable meantime.

xfstests(-g auto) is run with this patch and no
regression is found on ext4, but when testing
btrfs, generic/224 and generic/323 causes kernel
oops.

[1] http://marc.info/?t=12142850204=1=2
[2] http://marc.info/?l=dm-devel=139595190620008=2
[3] http://marc.info/?t=14597464411=1=2

Cc: Shaun Tancheff 
Cc: Christoph Hellwig 
Cc: Mikulas Patocka 
Signed-off-by: Ming Lei 
---
V1:
- change to RFC
- fix when unwind_bio_endio() is called recursively
- run xfstest again: no regression found on ext4,
but generic/323 and generic/224 cause kernel oops 

 block/bio.c | 44 ++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f124a0a..e2d0970 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -68,6 +68,8 @@ static DEFINE_MUTEX(bio_slab_lock);
 static struct bio_slab *bio_slabs;
 static unsigned int bio_slab_nr, bio_slab_max;
 
+static DEFINE_PER_CPU(struct bio_list *, bio_end_list) = { NULL };
+
 static struct kmem_cache *bio_find_or_create_slab(unsigned int extra_size)
 {
unsigned int sz = sizeof(struct bio) + extra_size;
@@ -1737,6 +1739,45 @@ static inline bool bio_remaining_done(struct bio *bio)
return false;
 }
 
+/* disable local irq when manipulating the percpu bio_list */
+static void unwind_bio_endio(struct bio *bio)
+{
+   struct bio_list *bl;
+   unsigned long flags;
+   bool clear_list = false;
+
+   preempt_disable();
+   local_irq_save(flags);
+
+   bl = this_cpu_read(bio_end_list);
+   if (!bl) {
+   struct bio_list bl_in_stack;
+
+   bl = _in_stack;
+   bio_list_init(bl);
+   this_cpu_write(bio_end_list, bl);
+   clear_list = true;
+   } else {
+   bio_list_add(bl, bio);
+   goto out;
+   }
+
+   while (bio) {
+   local_irq_restore(flags);
+
+   if (bio->bi_end_io)
+   bio->bi_end_io(bio);
+
+   local_irq_save(flags);
+   bio = bio_list_pop(bl);
+   }
+   if (clear_list)
+   this_cpu_write(bio_end_list, NULL);
+ out:
+   local_irq_restore(flags);
+   preempt_enable();
+}
+
 /**
  * bio_endio - end I/O on a bio
  * @bio:   bio
@@ -1765,8 +1806,7 @@ again:
goto again;
}
 
-   if (bio->bi_end_io)
-   bio->bi_end_io(bio);
+   unwind_bio_endio(bio);
 }
 EXPORT_SYMBOL(bio_endio);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html