date:20151208

[PATCH] btrfs-progs: Enhance chunk validation check

2015-12-08 Thread Qu Wenruo

Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause btrfsck error.

Reported-by: Vegard Nossum 
Signed-off-by: Qu Wenruo 
---
 disk-io.c |  2 --
 utils.h   |  7 +++
 volumes.c | 29 -
 3 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 7a63b91..83bdb27 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -40,8 +40,6 @@
 #define BTRFS_BAD_LEVEL(-3)
 #define BTRFS_BAD_NRITEMS  (-4)
 
-#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
-
 /* Calculate max possible nritems for a leaf/node */
 static u32 max_nritems(u8 level, u32 nodesize)
 {
diff --git a/utils.h b/utils.h
index 493c2e4..7740fc2 100644
--- a/utils.h
+++ b/utils.h
@@ -24,6 +24,8 @@
 #include 
 #include 
 
+#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
+
 #define BTRFS_MKFS_SYSTEM_GROUP_SIZE (4 * 1024 * 1024)
 #define BTRFS_MKFS_SMALL_VOLUME_SIZE (1024 * 1024 * 1024)
 #define BTRFS_MKFS_DEFAULT_NODE_SIZE 16384
@@ -246,6 +248,11 @@ static inline u64 div_factor(u64 num, int factor)
return num;
 }
 
+static inline int is_power_of_2(unsigned long n)
+{
+   return (n != 0 && ((n & (n - 1)) == 0));
+}
+
 int btrfs_tree_search2_ioctl_supported(int fd);
 int btrfs_check_nodesize(u32 nodesize, u32 sectorsize, u64 features);
 
diff --git a/volumes.c b/volumes.c
index 492dcd2..a94be0e 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1591,6 +1591,7 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
struct cache_extent *ce;
u64 logical;
u64 length;
+   u64 stripe_len;
u64 devid;
u8 uuid[BTRFS_UUID_SIZE];
int num_stripes;
@@ -1599,6 +1600,33 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
 
logical = key->offset;
length = btrfs_chunk_length(leaf, chunk);
+   stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+   /* Validation check */
+   if (!num_stripes) {
+   error("invalid chunk num_stripes: %u", num_stripes);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(logical, root->sectorsize)) {
+   error("invalid chunk logical %llu", logical);
+   return -EIO;
+   }
+   if (!length || !IS_ALIGNED(length, root->sectorsize)) {
+   error("invalid chunk length %llu", length);
+   return -EIO;
+   }
+   if (!is_power_of_2(stripe_len)) {
+   error("invalid chunk stripe length: %llu", stripe_len);
+   return -EIO;
+   }
+   if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+   btrfs_chunk_type(leaf, chunk)) {
+   error("unrecognized chunk type: %llu",
+ ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
+   BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+ btrfs_chunk_type(leaf, chunk));
+   return -EIO;
+   }
 
ce = search_cache_extent(_tree->cache_tree, logical);
 
@@ -1607,7 +1635,6 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
return 0;
}
 
-   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
if (!map)
return -ENOMEM;
-- 
2.6.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] Btrfs: Check metadata redundancy on balance

2015-12-08 Thread sam tygier

Resending as previous comments did not need any changes.

Currently BTRFS allows you to make bad choices of data and 
metadata levels. For example -d raid1 -m raid0 means you can
only use half your total disk space, but will loose everything
if 1 disk fails. It should give a warning in these cases.

This patch is a follow up to
[PATCH v2] btrfs-progs: check metadata redundancy
in order to cover the case of using balance to convert to such
a set of raid levels.

A simple example to hit this is to create a single device fs, 
which will default to single:dup, then to add a second device and
attempt to convert to raid1 with the command
btrfs balance start -dconvert=raid1  /mnt
this will result in a filesystem with raid1:dup, which will not
survive the loss of one drive. I personally don't see why the tools
should allow this, but in the previous thread a warning was
considered sufficient.

Changes in v2
Use btrfs_get_num_tolerated_disk_barrier_failures()

Signed-off-by: Sam Tygier 

From: Sam Tygier 
Date: Sat, 3 Oct 2015 16:43:48 +0100
Subject: [PATCH] Btrfs: Check metadata redundancy on balance

When converting a filesystem via balance check that metadata mode
is at least as redundant as the data mode. For example give warning
when:
-dconvert=raid1 -mconvert=single
---
 fs/btrfs/volumes.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6fc73586..40247e9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3584,6 +3584,12 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
}
} while (read_seqretry(_info->profiles_lock, seq));
 
+   if (btrfs_get_num_tolerated_disk_barrier_failures(bctl->meta.target) <
+   
btrfs_get_num_tolerated_disk_barrier_failures(bctl->data.target)) {
+   btrfs_info(fs_info,
+   "Warning: metatdata has lower redundancy than data\n");
+   }
+
if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
fs_info->num_tolerated_disk_barrier_failures = min(
btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
-- 
2.4.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/2] btrfs: Enhance chunk validation check

2015-12-08 Thread Qu Wenruo

Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.

Reported-by: Vegard Nossum 
Signed-off-by: Qu Wenruo 

---
v3:
  Fix a typo which forgot to return -EIO after num_stripes check.
---
 fs/btrfs/volumes.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9ea345f..bda84be 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6199,6 +6199,7 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
struct extent_map *em;
u64 logical;
u64 length;
+   u64 stripe_len;
u64 devid;
u8 uuid[BTRFS_UUID_SIZE];
int num_stripes;
@@ -6207,6 +6208,37 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
 
logical = key->offset;
length = btrfs_chunk_length(leaf, chunk);
+   stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+   /* Validation check */
+   if (!num_stripes) {
+   btrfs_err(root->fs_info, "invalid chunk num_stripes: %u",
+ num_stripes);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(logical, root->sectorsize)) {
+   btrfs_err(root->fs_info,
+ "invalid chunk logical %llu", logical);
+   return -EIO;
+   }
+   if (!length || !IS_ALIGNED(length, root->sectorsize)) {
+   btrfs_err(root->fs_info,
+   "invalid chunk length %llu", length);
+   return -EIO;
+   }
+   if (!is_power_of_2(stripe_len)) {
+   btrfs_err(root->fs_info, "invalid chunk stripe length: %llu",
+ stripe_len);
+   return -EIO;
+   }
+   if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+   btrfs_chunk_type(leaf, chunk)) {
+   btrfs_err(root->fs_info, "unrecognized chunk type: %llu",
+ ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
+   BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+ btrfs_chunk_type(leaf, chunk));
+   return -EIO;
+   }
 
read_lock(_tree->map_tree.lock);
em = lookup_extent_mapping(_tree->map_tree, logical, 1);
@@ -6223,7 +6255,6 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
em = alloc_extent_map();
if (!em)
return -ENOMEM;
-   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
if (!map) {
free_extent_map(em);
-- 
2.6.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Austin S Hemmelgarn


On 2015-12-08 01:08, Qu Wenruo wrote:



Austin S Hemmelgarn wrote on 2015/12/07 11:36 -0500:

On 2015-12-07 01:06, Qu Wenruo wrote:

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().



Passes xfstests and a handful of other things that I really should just
take the time to integrate into xfstests, so:
Tested-by: Austin S. Hemmelgarn 


Thanks for the test.

But I'm afraid you may need to test v2 patch again, as the v2 changed
some behavior.


That's OK, I should have results for you some time later today.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Austin S Hemmelgarn


On 2015-12-07 18:06, Eric Sandeen wrote:

On 12/7/15 2:54 PM, Christoph Anton Mitterer wrote:

...


2) a section that describes "ro" in btrfs-mount(5) which describes that
normal "ro" alone may cause changes on the device and which then refers
to hard-ro and/or the list of options (currently nologreplay) which are
required right now to make it truly ro.


I think this is important as an end-user probably expects "ro" to be
truly ro,


Yeah, I don't know that this is true.  It hasn't been true for over a
decade (2?), with the most widely-used filesystem in linux history, i.e.
ext3.  So if btrfs wants to go on this re-education crusade, more power
to you, but I don't know that it's really a fight worth fighting.  ;)

Actually, AFAICT, it's been at least 4.5 decades.  Last I checked, this 
dates back to the original UNIX filesystems, which still updated atimes 
even when mounted RO.


Despite this, it really isn't a widely known or well documented behavior 
outside of developers, forensic specialists, and people who have had to 
deal with the implications it has on data recovery.  There really isn't 
any way that the user would know about it without being explicitly told, 
and it's something that can have a serious impact on being able to 
recover a broken filesystem.  TBH, I really feel that _every_ 
filesystem's documentation should have something about how to make it 
mount truly read-only, even if it's just a reference to how to mark the 
block device read-only.





smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH v2] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Chandan Rajendra

On Tuesday 08 Dec 2015 14:10:33 Qu Wenruo wrote:
> Introduce a new mount option "nologreplay" to co-operate with "ro" mount
> option to get real readonly mount, like "norecovery" in ext* and xfs.
> 
> Since the new parse_options() need to check new flags at remount time,
> so add a new parameter for parse_options().
> 
> Signed-off-by: Qu Wenruo 
> ---
> v2:
>   Make RO check mandatory for btrfs_parse_options().
>   Add btrfs_show_options() support for nologreplay.
> 
>   Document for btrfs-mount(5) will follow after the patch being merged.
> ---
>  Documentation/filesystems/btrfs.txt |  7 +++
>  fs/btrfs/ctree.h|  4 +++-
>  fs/btrfs/disk-io.c  |  7 ---
>  fs/btrfs/super.c| 29 +
>  4 files changed, 39 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/btrfs.txt
> b/Documentation/filesystems/btrfs.txt index c772b47..7ad5b93 100644
> --- a/Documentation/filesystems/btrfs.txt
> +++ b/Documentation/filesystems/btrfs.txt
> @@ -168,6 +168,13 @@ Options with (*) are default options and will not show
> in the mount options. notreelog
>   Enable/disable the tree logging used for fsync and O_SYNC writes.
> 
> +  nologreplay
> + Disable the log tree replay at mount time to prevent devices get
> + modified. Must be use with 'ro' mount option.
> + A filesystem mounted with the 'nologreplay' option cannot
> + transition to a read-write mount via remount,rw - the filesystem
> + must be unmounted and remounted if read-write access is desired.
> +

May be the following is slightly better ...

Disable the log tree replay at mount time to prevent filesystem from getting
modified. Must be used with 'ro' mount option.  A filesystem mounted with the
'nologreplay' option cannot transition to a read-write mount via remount,rw -
the filesystem must be unmounted and mounted back again if read-write access
is desired.

Aside from above, everything else looks good to me.

Reviewed-by: Chandan Rajendra 

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] locks: new locks_mandatory_area calling convention

2015-12-08 Thread Christoph Hellwig

On Tue, Dec 08, 2015 at 04:05:04AM +, Al Viro wrote:
> Where the hell would truncate(2) get struct file, anyway?  IOW, the inode
> argument is _not_ pointless; re-added.

Oh, right.  Interestingly is seems like xfstests has no coverage of this
code path at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fixing recursive fault and parent transid verify failed

2015-12-08 Thread Duncan

Alistair Grant posted on Tue, 08 Dec 2015 06:55:04 +1100 as excerpted:

> On Mon, Dec 07, 2015 at 01:48:47PM +, Duncan wrote:
>> Alistair Grant posted on Mon, 07 Dec 2015 21:02:56 +1100 as excerpted:
>> 
>> > I think I'll try the btrfs restore as a learning exercise, and to
>> > check the contents of my backup (I don't trust my memory, so
>> > something could have changed since the last backup).
>> 
>> Trying btrfs restore is an excellent idea.  It'll make things far
>> easier if you have to use it for real some day.
>> 
>> Note that while I see your kernel is reasonably current (4.2 series), I
>> don't know what btrfs-progs ubuntu ships.  There have been some marked
>> improvements to restore somewhat recently, checking the wiki
>> btrfs-progs release-changelog list says 4.0 brought optional metadata
>> restore, 4.0.1 added --symlinks, and 4.2.3 fixed a symlink path check
>> off-by-one error. (And don't use 4.1.1 as its mkfs.btrfs is broken and
>> produces invalid filesystems.)  So you'll want at least progs 4.0 to
>> get the optional metadata restoration, and 4.2.3 to get full symlinks
>> restoration support.
>> 
>> 
> Ubuntu 15.10 comes with btrfs-progs v4.0.  It looks like it is easy
> enough to compile and install the latest version from
> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git so
> I'll do that.
> 
> Should I stick to 4.2.3 or use the latest 4.3.1?

I generally use the latest myself, but recommend as a general guideline 
that at minimum, a userspace version series matching that of your kernel 
be used, as if the usual kernel recommendations (within two kernel series 
of either current or LTS, so presently 4.2 or 4.3 for current or 3.18 or 
4.1 for LTS) are followed, that will keep userspace reasonably current as 
well, and the userspace of a particular version was being developed 
concurrently with the kernel of the same series, so they're relatively in 
sync.

So with a 4.2 kernel, I'd suggest at least a 4.2 userspace.  If you want 
the latest, as I generally do, and are willing to put up with occasional 
bleeding edge bugs like that broken mkfs.btrfs in 4.1.1, by all means, 
use the latest, but otherwise, the general same series as your kernel 
guideline is quite acceptable.

The exception would be if you're trying to fix or recover from a broken 
filesystem, in which case the very latest tends to have the best chance 
at fixing things, since it has fixes for (or lacking that, at least 
detection of) the latest round of discovered bugs, that older versions 
will lack.

While btrfs restore does fall into the recover from broken category, we 
know from the changelogs that nothing specific has gone into it since the 
mentioned 4.2.3 symlink off-by-one fix, so while I would recommend at 
least that since you are going to be working with restore, there's no 
urgent need for 4.3.0 or 4.3.1 if you're more comfortable with the older 
version.  (In fact, while I knew I was on 4.3.something, I just had to 
run btrfs version, to check whether it was 4.3 or 4.3.1, myself.  FWIW, 
it was 4.3.1.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Scrub: no spae left on device

2015-12-08 Thread Marc MERLIN

Howdy,

Why would scrub need space and why would it cancel if there isn't enough of
it?
(kernel 4.3)

/etc/cron.daily/btrfs-scrub:
btrfs scrub start -Bd /dev/mapper/cryptroot
scrub device /dev/mapper/cryptroot (id 1) done
scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
total bytes scrubbed: 130.84GiB with 0 errors
btrfs scrub start -Bd /dev/mapper/pool1
ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
device)
scrub device /dev/mapper/pool1 (id 1) canceled

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-08 Thread Duncan

Jon Panozzo posted on Mon, 07 Dec 2015 08:43:14 -0600 as excerpted:

[On single-device dup data]

> Thanks for the additional feedback.  Two follow-up questions to this is:
> 
> Can the --mixed option only be applied when first creating the fs, or
> can you simply add this to the balance command to take an existing
> filesystem and add this to it?

Mixed-bg mode has to be done at btrfs creation.

It changes the way btrfs handles chunks, and doing that _live_, with a 
non-zero time during which both modes are active, would be... complex and 
an invitation to all sorts of race bugs, to put it mildly.

> So it sounds like there are really three ways to enable scrub to repair
> errors on a btrfs single device (please confirm):

Yes.

> 1) mkfs.btrfs with the --mixed option

This would be my current preferred to filesystem sizes of a quarter to 
perhaps a half terabyte on spinning rust, and some people are known to 
use mixed for exactly this reason, tho it's not particularly well tested 
at the terabyte scale filesystem level, where as a result you might 
uncover some unusual bugs.

> 2) create two partitions on a single phys device,
> then present them as logical devices (maybe a loopback or something)
> and create a btrfs raid1 for both data/metadata

No special loopback, etc, required.  Btrfs deploys just fine on pretty 
much any block device as presented by the kernel, including both 
partitions and LVM volumes, the two ways single physical devices are 
likely to be presented as multiple logical devices.

In fact I use btrfs on partitions here, tho in my case it's two devices 
partitioned up identically, with raid1 across the parallel partitions on 
each device, instead of using multiple partitions on the same physical 
device, which is what we're talking about here.

This option will be rather inefficient on spinning rust as the write head 
will have to write one copy to the one partition, then reposition itself 
to write the second copy to the other partition, and that repositioning 
is non-zero time on spinning rust, but there's no such repositioning 
latency on SSDs, where it might actually be faster than mixed-mode, tho 
I'm unaware of any benchmarking to find out.

Despite the inefficiency, both partitions and btrfs raid1 are separately 
well tested and their combined use on a single device should introduce no 
race conditions that wouldn't have been found by previous separate usage, 
so this would be my current preferred at filesystem sizes over a half 
terabyte on spinning rust, or on SSDs with their zero seek times.

But writing /will/ be slow on spinning rust, particularly with partition 
sizes of a half-TiB or larger each, as that write-mode seek-time will be 
/nasty/.

That said, again, there are people known to be using this mode, and it's 
a viable choice in deployments such as laptops where physical multi-
device isn't an option, but the additional reliability of pair-copy data 
is highly desirable.

> 3) wait for the patch in process to allow for btrfs single devices to
> support dup mode for data

This should be the preferred mode in the future, tho as with any new 
btrfs feature, it'll probably take a couple kernel versions after initial 
introduction for the most critical bugs in the new feature to be found 
and duly exterminated, so I'd consider anyone using it the first kernel 
cycle or two after introduction to be volunteering as guinea pigs.  That 
said, the individual components of this feature have been in btrfs for 
some time and are well tested by now, so I'd expect the introduction of 
this feature to be rather smoother than many.  For the much more 
disruptive raid56 mode, I suggested a guinea-pig time of a year, five 
kernel cycles, for instance, and that turned out to be about right.

(Interestingly enough, that put raid56 mode feature stability at the soon 
to be released kernel 4.4, which is scheduled to be a long-term-support 
release, so the raid56 mode stability timing worked out rather well, tho 
I had no idea 4.4 would be an LTS when I originally predicted the year's 
settle-time.)

> Is that about right?

=:^)


One further caveat regarding SSDs.

On SSDs, many commonly deployed FTLs do dedup.  Sandforce firmware, where 
dedup is sold as a feature, is known for this.  If the firmware is doing 
dedup, then duplicated data /or/ metadata at the filesystem level is 
simply being deduped at the physical device firmware level, so you end up 
with only one physical copy in any case, and filesystem efforts to 
provide redundancy only end up costing CPU cycles at both the filesystem 
and device-firmware levels, all for naught.  This is a big reason why 
mkfs.btrfs on a single device defaults to single metadata if it detects 
an SSD, despite the normally preferred dup metadata default.

So if you're deploying on SSDs using sandforce firmware or otherwise 
known to do dedup at the FTL, don't bother with any of the above as the 
firmware will be simply defeating your efforts at

[PATCH] btrfs: don't use slab cache for struct btrfs_delalloc_work

2015-12-08 Thread David Sterba

Although we prefer to use separate caches for various structs, it seems
better not to do that for struct btrfs_delalloc_work. Objects of this
type are allocated rarely, when transaction commit calls
btrfs_start_delalloc_roots, requesting delayed iputs.

The objects are temporary (with some IO involved) but still allocated
and freed within __start_delalloc_inodes. Memory allocation failure is
handled.

The slab cache is empty most of the time (observed on several systems),
so if we need to allocate a new slab object, the first one has to
allocate a full page. In a potential case of low memory conditions this
might fail with higher probability compared to using the generic slab
caches.

Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 994490d5fa64..eeae851427fe 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -77,7 +77,6 @@ static const struct file_operations btrfs_dir_file_operations;
 static struct extent_io_ops btrfs_extent_io_ops;
 
 static struct kmem_cache *btrfs_inode_cachep;
-static struct kmem_cache *btrfs_delalloc_work_cachep;
 struct kmem_cache *btrfs_trans_handle_cachep;
 struct kmem_cache *btrfs_transaction_cachep;
 struct kmem_cache *btrfs_path_cachep;
@@ -9174,8 +9173,6 @@ void btrfs_destroy_cachep(void)
kmem_cache_destroy(btrfs_path_cachep);
if (btrfs_free_space_cachep)
kmem_cache_destroy(btrfs_free_space_cachep);
-   if (btrfs_delalloc_work_cachep)
-   kmem_cache_destroy(btrfs_delalloc_work_cachep);
 }
 
 int btrfs_init_cachep(void)
@@ -9210,13 +9207,6 @@ int btrfs_init_cachep(void)
if (!btrfs_free_space_cachep)
goto fail;
 
-   btrfs_delalloc_work_cachep = kmem_cache_create("btrfs_delalloc_work",
-   sizeof(struct btrfs_delalloc_work), 0,
-   SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
-   NULL);
-   if (!btrfs_delalloc_work_cachep)
-   goto fail;
-
return 0;
 fail:
btrfs_destroy_cachep();
@@ -9461,7 +9451,7 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
 {
struct btrfs_delalloc_work *work;
 
-   work = kmem_cache_zalloc(btrfs_delalloc_work_cachep, GFP_NOFS);
+   work = kmalloc(sizeof(*work), GFP_NOFS);
if (!work)
return NULL;
 
@@ -9480,7 +9470,7 @@ struct btrfs_delalloc_work 
*btrfs_alloc_delalloc_work(struct inode *inode,
 void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work)
 {
wait_for_completion(>completion);
-   kmem_cache_free(btrfs_delalloc_work_cachep, work);
+   kfree(work);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub on btrfs single device only to detect errors, not correct them?

2015-12-08 Thread Duncan

Austin S Hemmelgarn posted on Mon, 07 Dec 2015 10:39:05 -0500 as
excerpted:

> On 2015-12-07 10:12, Jon Panozzo wrote:
>> This is what I was thinking as well.  In my particular use-case, parity
>> is only really used today to reconstruct an entire device due to a
>> device failure.  I think if btrfs scrub detected errors on a single
>> device, I could do a "reverse reconstruct" where instead of syncing TO
>> the parity disk, I sync FROM the parity disk TO the btrfs single device
>> with the error, replacing physical blocks that are out of sync with
>> parity (thus repairing the scrub-found errrors).  The downside to this
>> approach is I would have to perform the reverse-sync against the entire
>> btrfs block device, which could be much more time-consuming than if I
>> could single out the specific block addresses and just sync those. 
>> That said, I guess option A is better than no option at all.
>>
>> I would be curious if any of the devs or other members of this mailing
>> list have tried to correlate btrfs internal block addresses to a true
>> block-address on the device being used.  Any interesting articles /
>> links that show how to do this?  Not expecting much, but if someone
>> does know, I'd be very grateful.

> I think there is a tool in btrfs-progs to do it, but I've never used it,
> and you would still need to get scrub to spit out actual error addresses
> for you.

btrfs-debug-tree is what you're looking for. =:^)

As I understand things, the complexity is due to btrfs' chunk 
abstraction, along with the multi-device feature.

On a normal filesystem, byte or block addresses are mapped linearly to 
absolute filesystem byte address and there's just the one device to worry 
about, so there's effectively little or no translation to be done.

On btrfs by contrast, block addresses map into chunks, also known as 
block groups, which are designed to be more or less arbitrarily 
relocatable within the filesystem using balance (originally called the 
restriper).  Further, these block groups can be single, striped across 
multiple devices (raid0 and the 0 side of raid10, duplicated on the same 
device (dup) or across multiple devices (only two devices currently, N-
way-mirroring is on the roadmap, raid1 and the 1 side of raid10), or 
striped with parity (raid5 and 6).

So while block addresses can map more or less linearly into block groups, 
btrfs has to maintain an entirely new layer of abstraction mapping in 
addition, that tells the filesystem where to look for that block group, 
that is, on what device (or across what devices if striped), and at what 
absolute bytenr offset into the device.

And again, keep in mind that even with a constant single/dup/raid mapping 
and even in the simplest single mode on single device, balance can and 
does more or less arbitrarily dynamically relocate block groups within 
the filesystem, so the mapping you see today may or may not be the 
mapping you see tomorrow, depending on whether a balance was run in the 
mean time.

Obviously the devs are going to need a tool to help them debug this 
additional complexity, and that's where btrfs-debug-tree comes in. =:^)

But for "ordinary mortal admins", yes, btrfs is open source and
btrfs-debug-tree is available for those that want to use it, but once 
they realize the complexity, most (including me) are going to simply be 
content to treat it as a black box and not worry too much about 
investigating its innards.

So while specific block and/or byte mapping can be done and there's tools 
available for and appropriate to the task, it's the type of thing most 
admins are very content to treat as a black box and leave well enough 
alone, once they understand the complexities involved.

"Btrfs, while he might use it, it ain't your grandfather's 
filesystem!" (TM) =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Lionel Bouton

Le 08/12/2015 16:06, Marc MERLIN a écrit :
> Howdy,
>
> Why would scrub need space and why would it cancel if there isn't enough of
> it?
> (kernel 4.3)
>
> /etc/cron.daily/btrfs-scrub:
> btrfs scrub start -Bd /dev/mapper/cryptroot
> scrub device /dev/mapper/cryptroot (id 1) done
>   scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>   total bytes scrubbed: 130.84GiB with 0 errors
> btrfs scrub start -Bd /dev/mapper/pool1
> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
> device)
> scrub device /dev/mapper/pool1 (id 1) canceled

I can't be sure (not-a-dev), but one possibility that comes to mind is
that if an error is detected writes must be done on the device. The
repair might not be done in-place but with CoW and even if the error is
not repaired by lack of redundancy IIRC each device tracks the number of
errors detected so I assume this is written somewhere (system or
metadata chunks most probably).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Lionel Bouton

Le 08/12/2015 16:37, Holger Hoffstätte a écrit :
> On 12/08/15 16:06, Marc MERLIN wrote:
>> Howdy,
>>
>> Why would scrub need space and why would it cancel if there isn't enough of
>> it?
>> (kernel 4.3)
>>
>> /etc/cron.daily/btrfs-scrub:
>> btrfs scrub start -Bd /dev/mapper/cryptroot
>> scrub device /dev/mapper/cryptroot (id 1) done
>>  scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>>  total bytes scrubbed: 130.84GiB with 0 errors
>> btrfs scrub start -Bd /dev/mapper/pool1
>> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
>> device)
>> scrub device /dev/mapper/pool1 (id 1) canceled
> Scrub rewrites metadata (apparently even in -r aka readonly mode), and that
> can lead to temporary metadata expansion (stuff gets COWed around); it's
> a bit surprising but makes sense if you think about it.

How long must I think about it until it makes sense? :-)

Sorry I'm not sure why metadata is rewritten if no error is detected.
I've several theories but lack information: is the fact that no error
has been detected stored somewhere? is scrub using some kind of internal
temporary snapshot(s) to avoid interfering with other operations? other
reason I didn't think about?

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Marc MERLIN

On Tue, Dec 08, 2015 at 04:46:32PM +0100, Lionel Bouton wrote:
> Le 08/12/2015 16:37, Holger Hoffstätte a écrit :
> > On 12/08/15 16:06, Marc MERLIN wrote:
> >> Howdy,
> >>
> >> Why would scrub need space and why would it cancel if there isn't enough of
> >> it?
> >> (kernel 4.3)
> >>
> >> /etc/cron.daily/btrfs-scrub:
> >> btrfs scrub start -Bd /dev/mapper/cryptroot
> >> scrub device /dev/mapper/cryptroot (id 1) done
> >>scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
> >>total bytes scrubbed: 130.84GiB with 0 errors
> >> btrfs scrub start -Bd /dev/mapper/pool1
> >> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left 
> >> on device)
> >> scrub device /dev/mapper/pool1 (id 1) canceled
> > Scrub rewrites metadata (apparently even in -r aka readonly mode), and that
> > can lead to temporary metadata expansion (stuff gets COWed around); it's
> > a bit surprising but makes sense if you think about it.
> 
> How long must I think about it until it makes sense? :-)
> 
> Sorry I'm not sure why metadata is rewritten if no error is detected.
> I've several theories but lack information: is the fact that no error
> has been detected stored somewhere? is scrub using some kind of internal
> temporary snapshot(s) to avoid interfering with other operations? other
> reason I didn't think about?

Yeah, I was also wondering why metadata should be rewritten on a single
device scrub.
Does not make sense to me.

And this is what I got:
legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1/ 
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=10
  SYSTEM (flags 0x2): balancing, usage=10
ERROR: error during balancing '/mnt/btrfs_pool1/' - No space left on device
There may be more info in syslog - try dmesg | tail

Ok, that sucks.

legolas:~# btrfs balance start -musage=0 -v /mnt/btrfs_pool1/
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=0
  SYSTEM (flags 0x2): balancing, usage=0
Done, had to relocate 0 out of 618 chunks

This worked. Mmmh, I thought this wouldn't be necessary anymore in 4.3 kernels?

legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=10
  SYSTEM (flags 0x2): balancing, usage=10
Done, had to relocate 1 out of 618 chunks

And now I'm back in business...

Still, this is a bit disappointing and at the very least very unexpected in 4.3.

legolas:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total=604.88GiB, used=520.09GiB
System, DUP: total=32.00MiB, used=96.00KiB
Metadata, DUP: total=5.00GiB, used=4.17GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
legolas:~# btrfs fi show /mnt/btrfs_pool1
Label: 'btrfs_pool1'  uuid: 5ee24229-2431-448a-868e-2c325d10bfa7
Total devices 1 FS bytes used 524.26GiB
devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1


Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Marc MERLIN

On Tue, Dec 08, 2015 at 05:24:16PM +0100, Holger Hoffstätte wrote:
> On 12/08/15 17:06, Marc MERLIN wrote:
> > Label: 'btrfs_pool1'  uuid: 5ee24229-2431-448a-868e-2c325d10bfa7
> > Total devices 1 FS bytes used 524.26GiB
> > devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1
> 
> This is what I was alluding to. You could have started a -dusage balance
> *before* the scrub so that one or several data chunks get freed.
> Balancing metadata when you're out of space accomplishes nothing and only
> will very likely fail, just as you saw. You have ~90GB usable space, but
> that space is spread over chunks with low utilisation.

Yes, my partition got a bit full, I freed up space, and unfortunately we
still don't have a background rebalance to fix this, so I did run a manual
one.
But my filesystem was usable, I was writing to it just fine. I was just very
surprised that scrub needed to rewrite blocks on a single disk device.

You could make the case that scrub and balance=0 should be run together.
In the meantime, I upgraded my script:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair
http://marc.merlins.org/linux/scripts/btrfs-scrub

I figured there is no good reason not to run a balance 20 on metadata and
data every night.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Holger Hoffstätte

On 12/08/15 16:06, Marc MERLIN wrote:
> Howdy,
> 
> Why would scrub need space and why would it cancel if there isn't enough of
> it?
> (kernel 4.3)
> 
> /etc/cron.daily/btrfs-scrub:
> btrfs scrub start -Bd /dev/mapper/cryptroot
> scrub device /dev/mapper/cryptroot (id 1) done
>   scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>   total bytes scrubbed: 130.84GiB with 0 errors
> btrfs scrub start -Bd /dev/mapper/pool1
> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
> device)
> scrub device /dev/mapper/pool1 (id 1) canceled

Scrub rewrites metadata (apparently even in -r aka readonly mode), and that
can lead to temporary metadata expansion (stuff gets COWed around); it's
a bit surprising but makes sense if you think about it. The fact that you
ENOSPCed means that the fs was probably already fully allocated.

If it bothers you, a subsequent balance with -musage=10 should vacuum things
up. Alternatively just keep using the filesystem; eventually the empty metadata
chunks should be collected, on the next remount at the latest.

tl;dr: Never allocate all the chunks. Yes, this needs more graceful handling.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Austin S Hemmelgarn


On 2015-12-08 10:06, Marc MERLIN wrote:

Howdy,

Why would scrub need space and why would it cancel if there isn't enough of
it?
(kernel 4.3)

Wild guess here, but maybe scrub unconditionally updates the error 
counters, regardless of whether any errors were found or not?





smime.p7s
Description: S/MIME Cryptographic Signature

Re: Scrub: no spae left on device

2015-12-08 Thread Holger Hoffstätte

On 12/08/15 16:46, Lionel Bouton wrote:
> Le 08/12/2015 16:37, Holger Hoffstätte a écrit :
>> On 12/08/15 16:06, Marc MERLIN wrote:
>>> Howdy,
>>>
>>> Why would scrub need space and why would it cancel if there isn't enough of
>>> it?
>>> (kernel 4.3)
>>>
>>> /etc/cron.daily/btrfs-scrub:
>>> btrfs scrub start -Bd /dev/mapper/cryptroot
>>> scrub device /dev/mapper/cryptroot (id 1) done
>>> scrub started at Mon Dec  7 01:35:08 2015 and finished after 258 seconds
>>> total bytes scrubbed: 130.84GiB with 0 errors
>>> btrfs scrub start -Bd /dev/mapper/pool1
>>> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1 (No space left on 
>>> device)
>>> scrub device /dev/mapper/pool1 (id 1) canceled
>> Scrub rewrites metadata (apparently even in -r aka readonly mode), and that
>> can lead to temporary metadata expansion (stuff gets COWed around); it's
>> a bit surprising but makes sense if you think about it.
> 
> How long must I think about it until it makes sense? :-)
> 
> Sorry I'm not sure why metadata is rewritten if no error is detected.
> I've several theories but lack information: is the fact that no error
> has been detected stored somewhere? is scrub using some kind of internal
> temporary snapshot(s) to avoid interfering with other operations? other
> reason I didn't think about?

Well..I have no idea what the historical motivation for this behaviour was,
even though I can make up at least two: rewriting known-good checksums
generally (since you know they are good this very moment), and in case of
error avoiding the area where the block error occurred (read errors on rust
are often clustered and affect entire tracks).

That's really all I know. I agree it's surprising, especially since it
happens by default and also in -r mode, which might be considered a bug.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/4] locks: new locks_mandatory_area calling convention

2015-12-08 Thread Al Viro

On Tue, Dec 08, 2015 at 03:54:53PM +0100, Christoph Hellwig wrote:
> On Tue, Dec 08, 2015 at 04:05:04AM +, Al Viro wrote:
> > Where the hell would truncate(2) get struct file, anyway?  IOW, the inode
> > argument is _not_ pointless; re-added.
> 
> Oh, right.  Interestingly is seems like xfstests has no coverage of this
> code path at all.

LTP does (ftruncate04)...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Scrub: no spae left on device

2015-12-08 Thread Holger Hoffstätte

On 12/08/15 17:06, Marc MERLIN wrote:
> Label: 'btrfs_pool1'  uuid: 5ee24229-2431-448a-868e-2c325d10bfa7
>   Total devices 1 FS bytes used 524.26GiB
>   devid1 size 615.01GiB used 614.94GiB path /dev/mapper/pool1

This is what I was alluding to. You could have started a -dusage balance
*before* the scrub so that one or several data chunks get freed.
Balancing metadata when you're out of space accomplishes nothing and only
will very likely fail, just as you saw. You have ~90GB usable space, but
that space is spread over chunks with low utilisation.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Austin S Hemmelgarn


On 2015-12-08 01:10, Qu Wenruo wrote:

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo 
---
v2:
   Make RO check mandatory for btrfs_parse_options().
   Add btrfs_show_options() support for nologreplay.

   Document for btrfs-mount(5) will follow after the patch being merged.


Same set of tests I ran against the last version, still no issues, so:
Tested-by: Austin S. Hemmelgarn




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Christoph Anton Mitterer

On Tue, 2015-12-08 at 07:15 -0500, Austin S Hemmelgarn wrote:
> Despite this, it really isn't a widely known or well documented
> behavior 
> outside of developers, forensic specialists, and people who have had
> to 
> deal with the implications it has on data recovery.  There really
> isn't 
> any way that the user would know about it without being explicitly
> told, 
> and it's something that can have a serious impact on being able to 
> recover a broken filesystem.  TBH, I really feel that _every_ 
> filesystem's documentation should have something about how to make it
> mount truly read-only, even if it's just a reference to how to mark
> the 
> block device read-only.
Exactly what I've meant.

And the developers here, should definitely consider that every normal
end-user, may easily assume the role of e.g. a forensics specialist
(especially with btrfs ;-) ), when recovery in case of corruptions is
tried.

I don't think that "it has always been improperly documented" (i.e. the
"ro" option) is a good excuse to continue doing it that way =)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] btrfs: Introduce new mount option to disable tree log replay

2015-12-08 Thread Austin S Hemmelgarn


On 2015-12-08 14:20, Christoph Anton Mitterer wrote:

On Tue, 2015-12-08 at 07:15 -0500, Austin S Hemmelgarn wrote:

Despite this, it really isn't a widely known or well documented
behavior
outside of developers, forensic specialists, and people who have had
to
deal with the implications it has on data recovery.  There really
isn't
any way that the user would know about it without being explicitly
told,
and it's something that can have a serious impact on being able to
recover a broken filesystem.  TBH, I really feel that _every_
filesystem's documentation should have something about how to make it
mount truly read-only, even if it's just a reference to how to mark
the
block device read-only.

Exactly what I've meant.

And the developers here, should definitely consider that every normal
end-user, may easily assume the role of e.g. a forensics specialist
(especially with btrfs ;-) ), when recovery in case of corruptions is
tried.


I don't think that "it has always been improperly documented" (i.e. the
"ro" option) is a good excuse to continue doing it that way =)
Agreed, 'but it's always been that way' is never a valid argument, and 
the fact that people who have been working on UNIX for decades know it 
doesn't mean that it's something that people will just inherently know. 
 The only reason it was that way to begin with is because it was 
assumed that everyone dealing with computers had a huge amount of domain 
specific knowledge of them (this was a valid assumption back in 1970, it 
hasn't been a valid assumption since at least 1990).


Stuff that seems obvious to people who have been working on it for years 
isn't necessarily obvious to people who have limited experience with it 
(I recently had to explain to a friend who had almost no networking 
background how IP addresses are just an abstraction for MAC addresses, 
and how it's not possible to block WiFi access based on an IP address; 
it took me three tries and eventually making the analogy of street 
addresses being an abstraction for geographical coordinates before he 
finally got it).


TBH, the only reason I knew about this rather annoying detail of 
filesystem implementation before using BTRFS is because of dealing with 
shared storage on VM's (that was an interesting week of debugging and 
restoring backups before I finally figured out what was going on).





smime.p7s
Description: S/MIME Cryptographic Signature

btrfs scrub can neither start nor cancel

2015-12-08 Thread Wolfgang Rohdewald

I just tried this script:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair

but I did not pass the directory where the filesystem is mounted.

Next I called it correctly: btrfs-scrub /t4
I also tried btrfs scrub start / cancel directly, but 
I am not really sure what I did in which order.

Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not help.
Running unmodified Linux 4.3

It seems like scrub stopped and did not clean up. Maybe because:
Dec  8 21:07:41 s5 kernel: [17833.840868] btrfs[23746]: segfault at 
ff98 ip 004079e1 sp 7fffafa27510 error 5 in 
btrfs[40+53000]

How can I now clean this up?

root@s5:~# btrfs --version
Btrfs v3.12

root@s5:~# btrfs scrub status /t4
scrub status for 700900de-e35f-4264-8f5d-1b2b249a5c3a
scrub started at Tue Dec  8 21:05:31 2015, running for 20 seconds
total bytes scrubbed: 3.09GiB with 0 errors

root@s5:~# btrfs scrub cancel /t4
ERROR: scrub cancel failed on /t4: not running

root@s5:~# btrfs scrub start /t4
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel /t4'.
To see the status use 'btrfs scrub status [-d] /t4'.


-- 
Wolfgang
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs scrub can neither start nor cancel

2015-12-08 Thread Hugo Mills

On Tue, Dec 08, 2015 at 09:46:48PM +0100, Wolfgang Rohdewald wrote:
> I just tried this script:
> http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair
> 
> but I did not pass the directory where the filesystem is mounted.
> 
> Next I called it correctly: btrfs-scrub /t4
> I also tried btrfs scrub start / cancel directly, but 
> I am not really sure what I did in which order.
> 
> Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not help.

   It might be that the userspace tools has got confused and left
behind a lock/pid/progress file in /var/lib/btrfs/

   Take a look in there and see if there's anything that you can
delete to good effect?

   Hugo.

> Running unmodified Linux 4.3
> 
> It seems like scrub stopped and did not clean up. Maybe because:
> Dec  8 21:07:41 s5 kernel: [17833.840868] btrfs[23746]: segfault at 
> ff98 ip 004079e1 sp 7fffafa27510 error 5 in 
> btrfs[40+53000]
> 
> How can I now clean this up?
> 
> root@s5:~# btrfs --version
> Btrfs v3.12
> 
> root@s5:~# btrfs scrub status /t4
> scrub status for 700900de-e35f-4264-8f5d-1b2b249a5c3a
> scrub started at Tue Dec  8 21:05:31 2015, running for 20 seconds
> total bytes scrubbed: 3.09GiB with 0 errors
> 
> root@s5:~# btrfs scrub cancel /t4
> ERROR: scrub cancel failed on /t4: not running
> 
> root@s5:~# btrfs scrub start /t4
> ERROR: scrub is already running.
> To cancel use 'btrfs scrub cancel /t4'.
> To see the status use 'btrfs scrub status [-d] /t4'.
> 
> 

-- 
Hugo Mills | Go not to the elves for counsel, for they will say
hugo@... carfax.org.uk | both no and yes.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Re: Fixing recursive fault and parent transid verify failed

2015-12-08 Thread Alistair Grant

On Tue, Dec 08, 2015 at 03:25:14PM +, Duncan wrote:
> Alistair Grant posted on Tue, 08 Dec 2015 06:55:04 +1100 as excerpted:
> 
> > On Mon, Dec 07, 2015 at 01:48:47PM +, Duncan wrote:
> >> Alistair Grant posted on Mon, 07 Dec 2015 21:02:56 +1100 as excerpted:
> >> 
> >> > I think I'll try the btrfs restore as a learning exercise, and to
> >> > check the contents of my backup (I don't trust my memory, so
> >> > something could have changed since the last backup).
> >> 
> >> Trying btrfs restore is an excellent idea.  It'll make things far
> >> easier if you have to use it for real some day.
> >> 
> >> Note that while I see your kernel is reasonably current (4.2 series), I
> >> don't know what btrfs-progs ubuntu ships.  There have been some marked
> >> improvements to restore somewhat recently, checking the wiki
> >> btrfs-progs release-changelog list says 4.0 brought optional metadata
> >> restore, 4.0.1 added --symlinks, and 4.2.3 fixed a symlink path check
> >> off-by-one error. (And don't use 4.1.1 as its mkfs.btrfs is broken and
> >> produces invalid filesystems.)  So you'll want at least progs 4.0 to
> >> get the optional metadata restoration, and 4.2.3 to get full symlinks
> >> restoration support.
> >> 
> >> ...

Thanks again Duncan for your assistance.

I plugged the ext4 drive I planned to use for the recovery in to the
machine and immediately got a couple of errors, which makes me wonder
whether there isn't a hardware problem with the machine somewhere.  So
decided to move to another machine to do the recovery.

So I'm now recovering on Arch Linux 4.1.13-1 with btrfs-progs v4.3.1
(the latest version from archlinuxarm.org).

Attempting:

sudo btrfs restore -S -m -v /dev/sdb /mnt/btrfs-recover/ ^&1 | tee 
btrfs-recover.log

only recovered 53 of the more than 106,000 files that should be available.

The log is available at: 

https://www.dropbox.com/s/p8bi6b8b27s9mhv/btrfs-recover.log?dl=0

I did attempt btrfs-find-root, but couldn't make sense of the output:

https://www.dropbox.com/s/qm3h2f7c6puvd4j/btrfs-find-root.log?dl=0

Simply mounting the drive, then re-mounting it read only, and rsync'ing
the files to the backup drive recovered 97,974 files before crashing.
If anyone is interested, I've uploaded a photo of the console to:

https://www.dropbox.com/s/xbrp6hiah9y6i7s/rsync%20crash.jpg?dl=0

I'm currently running a hashdeep audit between the recovered files and
the backup to see how the recovery went.

If you'd like me to try any other tests, I'll keep the damaged file
system for at least the next day or so.

Thanks again for all your assistance,
Alistair

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs scrub can neither start nor cancel

2015-12-08 Thread Wolfgang Rohdewald

Am Dienstag, 8. Dezember 2015, 20:51:08 schrieb Hugo Mills:
> On Tue, Dec 08, 2015 at 09:46:48PM +0100, Wolfgang Rohdewald wrote:
> > Anyway now I can neither cancel nor start btrfs scrub. Rebooting did not 
> > help.
> 
>It might be that the userspace tools has got confused and left
> behind a lock/pid/progress file in /var/lib/btrfs/
> 
>Take a look in there and see if there's anything that you can
> delete to good effect?

root@s5:/var/lib/btrfs# ls -l
insgesamt 4
srwxr-xr-x 1 root root   0 Dez  8 21:05 
scrub.progress.700900de-e35f-4264-8f5d-1b2b249a5c3a
-rw--- 1 root root 394 Dez  8 21:05 
scrub.status.700900de-e35f-4264-8f5d-1b2b249a5c3a

that fixed it, thanks!

I would have expected that such temporary files are deleted at reboot, so
ẗo me this looks like a bug in user-space.


-- 
Wolfgang
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Btrfs device and pool management (wip)

2015-12-08 Thread Christoph Anton Mitterer

On Mon, 2015-11-30 at 13:17 -0700, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
>  wrote:
> 
> > General thoughts on this:
> > 1. If there's a write error, we fail unconditionally right now.  It
> > would be
> > nice to have a configurable number of retries before failing.
> 
> I'm unconvinced. I pretty much immediately do not trust a block
> device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device.
From my large-amounts-of-storage-admin PoV,... I'd say it would be nice
to have more knobs to control when exactly a device is considered no
longer perfectly fine, which can include several different stages like:
- perhaps unreliable
  e.g. maybe the device shows SMART problems or there were correctable 
  read and/or write errors under a certain threshold (either in total,
  or per time period)
  Then I could imagine that one can control whether the device is put 
  - continued to be normally used until certain error thresholds are
    exceeded.
  - placed in a mode where data is still written to, but only when
    there's a duplicate on at least on other good device,... so the
    device would be used as read pool
    maybe optionally, data already on the device is auto-replicated to
    good devices
  - offline (perhaps only to be automatically reused in case of
    emergency (as a hot spare) when the fs knows that otherwise it's

even more likely that data would be lost soon
- failed
  the threshold
from above has been reached, the fs suspects the
  device to completely
fail soon
  Possible knobs would include how aggressively data is tried
to move
  of the device.
  How often should retries be made? In case the
other devices are
  under high IO load how much percentage should be
used to get the
  still working data of the bad device (i.e. up to 100%,
meaning 
  "rather stop any other IO, just to move the data to good
devices 
  ASAP)? 
- dead
  accesses don't work anymore at all an the fs shouldn't even waste 
  time trying to read/recover data from it.

It would also make sense to allow tuning what conditions need be met to
e.g. consider a drive unreliable (e.g. which SMART errors?) and to
allow an admin to manually place a drive in a certain state (e.g. SMART
would be still good, no IO errors so far, but the drive is 5 year old
and I better want to consider it unreliable).

That's - to some extent - what we at our LHC Tier-2 do at higher levels
(partly simply by human management, partly via the storage management
system we use (dCache), partly by RAID and other tools and scripting).

In any case, though,... any of these knobs should IMHO default to the
most conservative settings.
In other words: If a device shows the slightest hint of being
unstable/unreliable/failed... it should be considered bad, no new data
should go on it (if necessary, because not enough other devices are
left, the fs should get ro).
The only thing I wouldn't have a opinion is: should the fs go ro and do
nothing, waiting for a human to decide what's next, or should it go ro
and (if possible) try to move data off the bad device (per default).

Generally, a filesystem should be safe per default (which is why I see
the issue in the other thread with the corruption/security leaks in
case of UUID collisions quite a showstopper).
From the admin side, I don't want to be required to make it safe,.. my
interaction should rather only be needed to tune things.

Of course I'm aware that btrfs brings several techniques which make it
unavoidable that more maintenance is put into the filesystem, but, per
default, this should be minimised as far as possible.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-08 Thread Christoph Anton Mitterer

Hey Hugo,


On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote:
>    Answering the second part first, no, it can't.
Thanks so far :)


>    The issue is that nodatacow bypasses the transactional nature of
> the FS, making changes to live data immediately. This then means that
> if you modify a modatacow file, the csum for that modified section is
> out of date, and won't be back in sync again until the latest
> transaction is committed. So you can end up with an inconsistent
> filesystem if there's a crash between the two events.
Sure,... (and btw: is there some kind of journal planned for
nodatacow'ed files?),... but why not simply trying to write an updated
checksum after the modified section has been flushed to disk... of
course there's no guarantee that both are consistent in case of crash (
but that's also the case without any checksum)... but at least one
would have the csum protection against everything else (blockerrors and
that like) in case no crash occurs?



> > For me the checksumming is actually the most important part of
> > btrfs
> > (not that I wouldn't like its other features as well)... so turning
> > it
> > off is something I really would want to avoid.
> > 
> > Plus it opens questions like: When there are no checksums, how can
> > it
> > (in the RAID cases) decide which block is the good one in case of
> > corruptions?
>    It doesn't decide -- both copies look equally good, because
> there's
> no checksum, so if you read the data, the FS will return whatever
> data
> was on the copy it happened to pick.
Hmm I see... so one gets basically the behaviour of RAID.
Isn't that kind of a big loss? I always considered the guarantee
against block errors and that like one of the big and basic features of
btrfs.
It seems that for certain (not too unimportant cases: DBs, VMs) one has
to decide between either evil, loosing the guaranteed consistency via
checksums... or basically running into severe troubles (like Mitch's
reported fragmentation issues).


> > 3) When I would actually disable datacow for e.g. a subvolume that
> > holds VMs or DBs... what are all the implications?
> > Obviously no checksumming, but what happens if I snapshot such a
> > subvolume or if I send/receive it?
> 
>    After snapshotting, modifications are CoWed precisely once, and
> then it reverts to nodatacow again. This means that making a snapshot
> of a nodatacow object will cause it to fragment as writes are made to
> it.
I see... something that should possibly go to some advanced admin
documentation (if not already in).
It means basically, that one must assure that any such files (VM
images, DB data dirs) are already created with nodatacow (perhaps on a
subvolume which is mounted as such.


> > 4) Duncan mentioned that defrag (and I guess that's also for auto-
> > defrag) isn't ref-link aware...
> > Isn't that somehow a complete showstopper?
>    It is, but the one attempt at dealing with it caused massive data
> corruption, and it was turned off again.
So... does this mean that it's still planned to be implemented some day
or has it been given up forever?
And is it (hopefully) also planned to be implemented for reflinks when
compression is added/changed/removed?


Given that you (or Duncan?,... sorry I sometimes mix up which of said
exactly what, since both of you are notoriously helpful :-) ) mentioned
that autodefrag basically fails with larger files,... and given that it
seems to be quite important for btrfs to not be fragmented too heavily,
it sounds a bit as if anything that uses (multiple) reflinks (e.g.
snapshots) cannot be really used very well.


>  autodefrag, however, has
> always been snapshot aware and snapshot safe, and would be the
> recommended approach here.
Ahhh... so autodefag *is* snapshot aware, and that's basically why the
suggestion is (AFAIU) that it's turned on, right?
So, I'm afraid O:-), that triggers a follow-up question:
Why isn't it the default? Or in other words what are its drawbacks
(e.g. other cases where ref-links would be broken up,... or issues with
compression)?

And also, when I now activate it on an already populated fs, will it
defrag also any old files (even if they're not rewritten or so)?
I tried to have a look for some general (rather "for dummies" than for
core developers) description of how defrag and autodefrag work... but
couldn't find anything in the usual places... :-(

btw: The wiki (https://btrfs.wiki.kernel.org/index.php/UseCases#How_do_
I_defragment_many_files.3F) doesn't mention that auto-defrag doesn't
suffer from that problem.


>  (Actually, it was broken in the same
> incident I just described -- but fixed again when the broken patches
> were reverted).
So it just couldn't be fixed (hopfully: yet) for the (manual) online
defragmentation?!


> > 5) Especially keeping (4) in mind but also the other comments in
> > from
> > Duncan and Austin...
> > Is auto-defrag now recommended to be generally used?
>
>    Absolutely, yes.
I see... well, I'll probably wait

Re: Scrub: no spae left on device

2015-12-08 Thread Duncan

Marc MERLIN posted on Tue, 08 Dec 2015 08:06:15 -0800 as excerpted:

> On Tue, Dec 08, 2015 at 04:46:32PM +0100, Lionel Bouton wrote:
>> Le 08/12/2015 16:37, Holger Hoffstätte a écrit :
>> > On 12/08/15 16:06, Marc MERLIN wrote:
>> >>
>> >> Why would scrub need space and why would it cancel if there isn't
>> >> enough of it? (kernel 4.3)
>> >>
>> >> btrfs scrub start -Bd /dev/mapper/pool1
>> >> ERROR: scrubbing /dev/mapper/pool1 failed for device id 1
>> >> (No space left on device)
>> >> scrub device /dev/mapper/pool1 (id 1) canceled
>> > Scrub rewrites metadata (apparently even in -r aka readonly mode),
>> > and that can lead to temporary metadata expansion (stuff gets COWed
>> > around); it's a bit surprising but makes sense if you think about it.

Are you sure about that?

My / is mounted ro by default, and if I try to scrub it in normal mode, 
it'll error out due to read-only.  But I can run a read-only scrub just 
fine, and if I find errors, I simply mount it writable and redo the scrub 
without the -r.  (My / is only 8 GiB, under half used including metadata 
on a fast SSD, so scrubs complete in under 30 seconds, and doing a read-
only scrub followed by a mount-writable and a second fixing scrub if 
necessary, is trivial.)

>> Sorry I'm not sure why metadata is rewritten if no error is detected.

But scrub will of course do copy-on-write if there's an error, and it's 
possible that on initialization it checks for space to do a few cows if 
necessary, before it actually checks for the -r read-only flag.  I try to 
leave at least enough unallocated space to do a balance, which of course 
except for -dusage=0 (or -musage=0) writes a new chunk to rewrite 
existing chunks into, so I'd be unlikely to ever get that close to out of 
space to trigger the possible initialization-time space-warning, and thus 
wouldn't know whether it has one or whether it comes before the -r check, 
or not.

> And this is what I got:
> legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1/
> Dumping filters: flags 0x6, state 0x0, force is off
>   METADATA (flags 0x2): balancing, usage=10
>   SYSTEM (flags 0x2): balancing, usage=10
> ERROR: error during balancing '/mnt/btrfs_pool1/' - No space left on
> device There may be more info in syslog - try dmesg | tail
> 
> Ok, that sucks.
> 
> legolas:~# btrfs balance start -musage=0 -v /mnt/btrfs_pool1/
> Dumping filters: flags 0x6, state 0x0, force is off
>   METADATA (flags 0x2): balancing, usage=0
>   SYSTEM (flags 0x2): balancing, usage=0
> Done, had to relocate 0 out of 618 chunks
> 
> This worked. Mmmh, I thought this wouldn't be necessary anymore in 4.3
> kernels?

Well, it said it had to relocate zero blocks, so it _appears_ that it 
didn't do anything, which would be expected on reasonably current kernels 
as they already clean up zero-usage chunks, automatically.  *BUT*...

> legolas:~# btrfs balance start -musage=10 -v /mnt/btrfs_pool1
> Dumping filters: flags 0x6, state 0x0, force is off
>   METADATA (flags 0x2): balancing, usage=10
>   SYSTEM (flags 0x2):  balancing, usage=10
> Done, had to relocate 1 out of 618 chunks

... if it did nothing in the -musage=0 case above, why did the -musage=10 
case fail before, but succeed after?

That's a very good question I don't have an answer to.  Good question for 
the devs and others that actually read code.

Meanwhile, note that if it relocates only a single chunk (of non-zero 
usage), under normal circumstances, it'll take exactly the same amount of 
space as before, because it'd allocate a new chunk of exactly the same 
size as the one it was rewriting.

However, once remaining unallocated space gets tight enough, it starts 
allocating smaller than normal chunks, which may be what happened this 
time.  Presumably that chunk was originally allocated when the filesystem 
still has much more unallocated free space, so it was a standard size 
chunk.  When it was rewritten, unallocated space was much tighter, so a 
smaller chunk would likely be written, which would then be rather fuller 
than it was previously, as it would have the same amount of metadata in 
it, but be a smaller chunk.

And, perhaps partially answering my own question above, the balance with 
-musage=0 somehow triggered a space reevaluation, thus allowing the 
-musage=10 balance to run afterward when it wouldn't before, even tho the 
-musage=0 didn't actually relocate (to /dev/null as they'd be empty, IOW, 
delete) any empty chunks.

But... it still shouldn't happen, as if -musage=0 didn't relocate 
anything, it shouldn't trigger a space reevaluage that -musage=10 
wouldn't trigger on its own, so while this might partially answer what 
happened, it does nothing to explain /why/ it happened.  I'd call it a 
bug in the balance code, as the result of the -musage=10 should be 
exactly the same before and after, because the -musage=0 didn't actually 
relocate/delete anything.

> And now I'm back in business...
> 
> Still, this is a bit disappointing and at the

Missing half of available space (resend)

2015-12-08 Thread David Hampton

Hi all.  I'm trying to figuring out why my btrfs file system doesn't
show all the available space.  I currently have four 4TB drives set up
as a raid6 array, so I would expect to see a total available data size
slightly under 8TB (two drives for data + two drives for parity).  The
'btrfs fi df' command consistently shows a total size of around 3TB, and
says that space is almost completely full.  Here's my current system
information...

===
root@selene:~# uname -a
Linux selene.dhampton.net 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu
Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
root@selene:~# btrfs --version
Btrfs v3.12
root@selene:~# btrfs fi show /video
Label: none  uuid: 74a4c4fa-9e83-465a-850d-cc089ecd00f6
Total devices 4 FS bytes used 3.12TiB
devid1 size 3.64TiB used 1.58TiB path /dev/vdb
devid2 size 3.64TiB used 1.58TiB path /dev/vda
devid3 size 3.64TiB used 1.58TiB path /dev/vdc
devid4 size 3.64TiB used 1.58TiB path /dev/vdd

Btrfs v3.12
root@selene:~# btrfs fi df /video
Data, RAID6: total=3.15TiB, used=3.11TiB
System, RAID6: total=64.00MiB, used=352.00KiB
Metadata, RAID6: total=5.00GiB, used=3.73GiB
unknown, single: total=512.00MiB, used=1.07MiB
root@selene:~# df -h /video
Filesystem  Size  Used Avail Use% Mounted on
/dev/vda 15T  3.2T  8.3T  28% /video
===

I have tried issuing the command "btrfs filesystem resize
:max /video" on each devid in the array, and also tried balancing
the array.  None of these commands changed the indication that the file
system is almost full.  I'm wondering if the problem is because this
file system began as a two drive raid1 array, and I later added the
other two drives and used the 'btrfs balance' command to convert to
raid6.  Any suggestions on what I can try to get the 'btrfs fi df'
command to show me more available space?  Did I forget a command when I
converted the raid1 array to raid6?

Alternatively, can I trust the numbers in the standard df command?  The
'used' number seems right but the 'avail' number seems high.

If i can provide any more information to help figure out what's
happening, please ask.

Thanks.

David


[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Initializing cgroup subsys cpuacct
[0.00] Linux version 3.19.0-32-generic (buildd@lgw01-43) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 (Ubuntu 3.19.0-32.37~14.04.1-generic 3.19.8-ckt7)
[0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.19.0-32-generic root=UUID=b9fb1104-f681-4664-b0c3-b17db28d9d68 ro quiet splash vt.handoff=7
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x3fffdfff] usable
[0.00] BIOS-e820: [mem 0x3fffe000-0x3fff] reserved
[0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
[0.00] BIOS-e820: [mem 0xfffc-0x] reserved
[0.00] NX (Execute Disable) protection: active
[0.00] SMBIOS 2.4 present.
[0.00] DMI: Red Hat KVM, BIOS 0.5.1 01/01/2011
[0.00] Hypervisor detected: KVM
[0.00] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] AGP: No AGP bridge found
[0.00] e820: last_pfn = 0x3fffe max_arch_pfn = 0x4
[0.00] MTRR default type: write-back
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 8000 mask 3FFF8000 uncachable
[0.00]   1 disabled
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] PAT configuration [0-7]: WB  WC  UC- UC  WB  WC  UC- UC  
[0.00] found SMP MP-table at [mem 0x000f1ff0-0x000f1fff] mapped at [880f1ff0]
[0.00] Scanning 1 areas for low memory corruption
[0.00] Base memory trampoline at [88099000] 99000 size 24576
[0.00] init_memory_mapping: [mem 0x-0x000f]
[0.00]  [mem 0x-0x000f] page 4k
[0.00] BRK [0x01fd4000, 0x01fd4fff] PGTABLE
[0.00] BRK [0x01fd5000, 0x01fd5fff] PGTABLE
[0.00] BRK [0x01fd6000, 0x01fd6fff] PGTABLE
[0.00] init_memory_mapping: [mem

Re: Missing half of available space (resend)

2015-12-08 Thread Chris Murphy

On Tue, Dec 8, 2015 at 10:02 PM, David Hampton
 wrote:
> The
> 'btrfs fi df' command consistently shows a total size of around 3TB, and
> says that space is almost completely full.

and

> root@selene:~# btrfs fi df /video
> Data, RAID6: total=3.15TiB, used=3.11TiB

The "total=3.15TiB" means "there's a total of 3.15TiB allocated for
data chunks using raid6 profile" and of that 3.11TiB is used.

btrfs fi df doesn't ever show how much is free or available. You can
get an estimate of that by using 'btrfs fi usage' instead.

> root@selene:~# df -h /video
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vda 15T  3.2T  8.3T  28% /video

That's about right although it seems it's slightly overestimating the
available free space.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-08 Thread Christoph Anton Mitterer

On 2015-11-27 00:08, Duncan wrote:
> Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
> excerpted:
>> 1) AFAIU, the fragmentation problem exists especially for those files
>> that see many random writes, especially, but not limited to, big files.
>> Now that databases and VMs are affected by this, is probably broadly
>> known in the meantime (well at least by people on that list).
>> But I'd guess there are n other cases where such IO patterns can happen
>> which one simply never notices, while the btrfs continues to degrade.
> 
> The two other known cases are:
> 
> 1) Bittorrent download files, where the full file size is preallocated 
> (and I think fsynced), then the torrent client downloads into it a chunk 
> at a time.
Okay, sounds obvious.


> The more general case would be any time a file of some size is 
> preallocated and then written into more or less randomly, the problem 
> being the preallocation, which on traditional rewrite-in-place 
> filesystems helps avoid fragmentation (as well as ensuring space to save 
> the full file), but on COW-based filesystems like btrfs, triggers exactly 
> the fragmentation it was trying to avoid.
Is it really just the case when the file storage *is* actually fully
pre-allocated?
Cause that wouldn't (necessarily) be the case for e.g. VM images (e.g.
qcow2, or raw images when these are sparse files).
Or is it rather any case where, in larger file, many random (file
internal) writes occur?


> arranging to 
> have the client write into a dir with the nocow attribute set, so newly 
> created torrent files inherit it and do rewrite-in-place, is highly 
> recommended.
At the IMHO pretty high expense of loosing the checksumming :-(
Basically loosing half of the main functionalities that make btrfs
interesting for me.


> It's also worth noting that once the download is complete, the files 
> aren't going to be rewritten any further, and thus can be moved out of 
> the nocow-set download dir and treated normally.
Sure... but this requires manual intervention.

For databases, will e.g. the vacuuming maintenance tasks solve the
fragmentation issues (cause I guess at least when doing full vacuuming,
it will rewrite the files).


> The problem is much reduced in newer systemd, which is btrfs aware and in 
> fact uses btrfs-specific features such as subvolumes in a number of cases 
> (creating subvolumes rather than directories where it makes sense in some 
> shipped tmpfiles.d config files, for instance), if it's running on 
> btrfs.
Hmm doesn't seem really good to me if systemd would do that, cause it
then excludes any such files from being snapshot.


> For the journal, I /think/ (see the next paragraph) that it now 
> sets the journal files nocow, and puts them in a dedicated subvolume so 
> snapshots of the parent won't snapshot the journals, thereby helping to 
> avoid the snapshot-triggered cow1 issue.
The same here, kinda disturbing if systemd would decide that on it's
own, i.e. excluding files from being checksum protected...


>> So is there any general approach towards this?
> The general case is that for normal desktop users, it doesn't tend to be 
> a problem, as they don't do either large VMs or large databases,
Well depends a bit on how one defines the "normal desktop user",... for
e.g. developers or more "power users" it's probably not so unlikely that
they do run local VMs for testing or whatever.

> and 
> small ones such as the sqlite files generated by firefox and various 
> email clients are handled quite well by autodefrag, with that general 
> desktop usage being its primary target.
Which is however not yet the default...


> For server usage and the more technically inclined workstation users who 
> are running VMs and larger databases, the general feeling seems to be 
> that those adminning such systems are, or should be, technically inclined 
> enough to do their research and know when measures such as nocow and 
> limited snapshotting along with manual defrags where necessary, are 
> called for.
mhh... well it's perhaps simple to expect that knowledge for few things
like VMs, DBs and that like... but there are countless of software
systems, many of them being more or less like a black box, at least with
respect to their internals.

It feels a bit, if there should be some tools provided by btrfs, which
tell the users which files are likely problematic and should be nodatacow'ed


> And if they don't originally, they find out when they start 
> researching why performance isn't what they expected and what to do about 
> it. =:^)
Which can take quite a while to be found out...


>> And what are the actual possible consequences? Is it just that fs gets
>> slower (due to the fragmentation) or may I even run into other issues to
>> the point the space is eaten up or the fs becomes basically unusable?
> It's primarily a performance issue, tho in severe cases it can also be a 
> scaling issue, to the point that maintenance tasks such

!PageLocked BUG_ON hit in clear_page_dirty_for_io

2015-12-08 Thread Dave Jones

Not sure if I've already reported this one, but I've been seeing this
a lot this last couple days.

kernel BUG at mm/page-writeback.c:2654!
invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
CPU: 1 PID: 2566 Comm: trinity-c1 Tainted: GW   4.4.0-rc4-think+ #14
task: 880462811b80 ti: 8800cd808000 task.ti: 8800cd808000
RIP: 0010:[]  [] 
clear_page_dirty_for_io+0x180/0x1d0
RSP: 0018:8800cd80fa00  EFLAGS: 00010246
RAX: 880c RBX: ea0011098a00 RCX: 8800cd80fbb7
RDX: dc00 RSI: 110019b01f76 RDI: ea0011098a00
RBP: 8800cd80fa20 R08: 880453272000 R09: 
R10:  R11:  R12: 88045326f2c0
R13: 88046272a310 R14:  R15: 0001
FS:  7f186573d700() GS:880468a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 010dd580 CR3: 00046261c000 CR4: 001406e0
Stack:
 0001 88046272a310 88046272a310 
 8800cd80fa90 c03891b5 8800cd80fb30 880402400040
 88045326f0e8 1000 88045326ed88 8800cd80fbb0
Call Trace:
 [] lock_and_cleanup_extent_if_need+0xa5/0x260 [btrfs]
 [] __btrfs_buffered_write+0x324/0x8a0 [btrfs]
 [] ? btrfs_dirty_pages+0xf0/0xf0 [btrfs]
 [] ? generic_file_direct_write+0x2ac/0x2c0
 [] ? generic_file_read_iter+0xa00/0xa00
 [] btrfs_file_write_iter+0x6dd/0x800 [btrfs]
 [] __vfs_write+0x21d/0x260
 [] ? __vfs_read+0x260/0x260
 [] ? __lock_is_held+0x92/0xd0
 [] ? preempt_count_sub+0xc1/0x120
 [] ? percpu_down_read+0x57/0xa0
 [] ? __sb_start_write+0xb4/0xf0
 [] vfs_write+0xf6/0x260
 [] SyS_write+0xbf/0x160
 [] ? SyS_read+0x160/0x160
 [] ? trace_hardirqs_on_thunk+0x17/0x19
 [] entry_SYSCALL_64_fastpath+0x12/0x6b
Code: 61 01 49 8d bd f0 00 00 00 8d 14 c5 08 00 00 00 e8 b6 cd 31 00 f6 c7 02 
74 20 e8 8c 41 ec ff 53 9d b8 01 00 00 00 e9 1d ff ff ff <0f> 0b 48 89 df e8 b6 
f5 ff ff e9 41 ff ff ff 53 9d e8 0a e7 eb 


That BUG is..

2653 
2654 BUG_ON(!PageLocked(page));
2655 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: subvols and parents - how?

2015-12-08 Thread Christoph Anton Mitterer

On Fri, 2015-11-27 at 02:02 +, Duncan wrote:
> Uhm, I don't get the big security advantage here... whether nested
> > or
> > manually mounted to a subdir,... if the permissions are insecure
> > I'll
> > have a problem... if they're secure, than not.
> Consider a setuid-root binary with a recently publicized but patched
> on 
> your system vuln.  But if you have root snapshots from before the
> patch 
> and those snapshots are nested below root, then they're always 
> accessible.  If the path to the vulnerable setuid is as user
> accessible 
> as it likely was in its original location, then anyone with login
> access 
> to the system is likely to be able to run it from the snapshot... and
> will be able to get root due to the vuln.

Hmm good point... I think it would be great if you could add that
scenario somewhere to the documentation. :-)
Based on that one can easily think about more/similar examples...
device file that had too permissive modes set, and where snapshotted
like that... and so on.

I think that's another example why it would be nice if btrfs had
something (per subvolume) like ext4's default mount options (I mean the
ones stored in the superblock).

Not only would it allow the userland tools to do things like "adding
notatime" per default on snapshots (at least ro snapshot), so that one
can have them nested and still doesn't suffer from the previously
discussed writes-on-read-amplifications... it would also allow to set
things like nodev, noexec, nosuid and that like on subvols... and again
it would make the whole thing practically usable with nested subvols.

Where would be the appropriate place to record that as a feature
request?
Simply here on the list?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions?

2015-12-08 Thread Christoph Anton Mitterer

On Sun, 2015-12-06 at 22:34 +0800, Qu Wenruo wrote:
> Not sure about LVM/MD, but they should suffer the same UUID conflict
> problem.
Well I had that actually quite often in LVM (i.e. same UUIDs visible on
the same system), basically because we made clones from one template VM
image and when that is normally booted, LVM doesn't allow to change the
UUIDs of already active PV/VG/LVs (or maybe just some of these three,
forgot the details)

But there was never any issue, LVM on the host system, when one set was
already used, continues to use that just fine and the toolset reports
which it would use (more below).

> The only idea I have can only enhance the behavior, but never fix it.
> For example, if found multiple btrfs devices with same devid, just 
> refuse to mount.
> And for already mounted btrfs, ignore any duplicated fsid/devid.
Well I think that's already a perfectly valid solution... basically the
idea that I had before.
I'd call that a 100% fix, not just a workaround.

If then the tools (i.e. btrfstune) allows to change the UUID of the duplicate 
set of devices (perhaps again with the necessity to specify each of them via 
device=/dev/sda,etc.) I'd be completely happy again,... and the show could get 
on ;)

> The problem can get even tricky for case like device missing for a
> while 
> and appear again case.
I had thought about that too:
a) In the non-malicious case, this could e.g. mean that a device from a
btrfs RAID was missing and a clone with the same UUID / dev ID get's
added to the system
Possible consequences, AFAICS:
- The data is simply auto-rebuilt on the clone.
- Some corruptions occur when the clone is older, and data that was
only on the newer device is now missing (not sure if this can happen at
all or whether generation IDs prevent it).

b) In the malicious/attack case, one possible scenario could be:
A device is missing from a btrfs RAID... the machine is left
unattended. An attacker comes plugs in the USB stick with the missing
UUID. Is the rebuild (and thus data leakage) now happening
automatically?

In any case though, a simply solution could be, that not automatic
assemblies happen per default, and the people who still want to do
that, are properly warned about the possible implications in the docs.

> But just as you mentioned, it *IS* a real problem, and we should need
> to 
> enhance it.
Should one (or I) add this as a ticket to the kernel bugzilla, or as an
entry to the btrfs wiki?

> I'd like to see how LVM/DM behaves first, at least as a reference if 
> they are really so safe.
Well that's very simple to check, I did it here for the LV case only:
root@lcg-lrz-admin:~# truncate -s 1G image1
root@lcg-lrz-admin:~# losetup -f image1 
root@lcg-lrz-admin:~# pvcreate /dev/loop0
  Physical volume "/dev/loop0" successfully created
root@lcg-lrz-admin:~# losetup -d /dev/loop0 
root@lcg-lrz-admin:~# cp image1 image2
root@lcg-lrz-admin:~# losetup -f image1 
root@lcg-lrz-admin:~# pvscan 
  PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free]
  PV /dev/sda1VG vg_system   lvm2 [9,99 GiB / 0free]
  PV /dev/loop0  lvm2 [1,00 GiB]
  Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB]
root@lcg-lrz-admin:~# losetup -f image2 
root@lcg-lrz-admin:~# pvscan 
  Found duplicate PV tSK9Cdpw6bcmocZnxFPD6ThNz1opRXsB: using /dev/loop1 not 
/dev/loop0
  PV /dev/sdb VG vg_data lvm2 [50,00 GiB / 0free]
  PV /dev/sda1VG vg_system   lvm2 [9,99 GiB / 0free]
  PV /dev/loop1  lvm2 [1,00 GiB]
  Total: 3 [60,99 GiB] / in use: 2 [59,99 GiB] / in no VG: 1 [1,00 GiB]

Obviously, with PVs alone, there is no "x is already used" case. As one
can see it just says it would ignore one of them, which I think is
rather stupid in that particular case (i.e. non of the devices already
used somehow), because it probably just "randomly" decides which is to
be used, which is ambiguous.

> And what will rescan show if they are not active?
My experience was always (it's just quite late and I don't want to
simulate everything right now, which is trivial anyway):
- It shows warnings about the duplicates in the tools
- It continues to use the already active devices (if any)
- Unfortunately, while the kernel continues to use the already used
devices, the toolset may use other device (kinda stupid, but at least
it warns and the already used devices seem to be still properly used):

continuation from the setup above:
root@lcg-lrz-admin:~# losetup -d /dev/loop1 
(now only image1 is seen as loop0)
root@lcg-lrz-admin:~# vgcreate vg_test /dev/loop0
  Volume group "vg_test" successfully created
root@lcg-lrz-admin:~# lvcreate -n test vg_test -l 100
  Logical volume "test" created
root@lcg-lrz-admin:~# mkfs.ext4 /dev/vg_test/test 
mke2fs 1.42.12 (29-Aug-2014)
...
root@lcg-lrz-admin:~# mount /dev/vg_test/test /mnt/
root@lcg-lrz-admin:~# losetup -a
/dev/loop0: [64768]:518297 (/root/image1)
root@lcg-lrz-admin:~# losetup -f image2 
root@lcg-lrz-admin:~# vgs

Re: kernel call trace during send/receive

2015-12-08 Thread Christoph Anton Mitterer

Hey.

Hmm I guess no one has any clue about that error?

Well it seems at least that an fsck over the receiving fs passes
through without any error.

Cheers,
Chris.

On Fri, 2015-11-27 at 02:49 +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> Just got the following during send/receiving a big snapshot from one
> btrfs to another fresh one.
> 
> Both under kernel 4.2.6, tools 4.3
> 
> The send/receive seems to continue however...
> 
> Any ideas what that means?
> 
> Cheers,
> Chris.
> 
> Nov 27 01:52:36 heisenberg kernel: [ cut here ]
> 
> Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at
> /build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794
> btrfs_ioctl_send+0x661/0x1120 [btrfs]()
> Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache
> jbd2 nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp tun bridge stp llc fuse ccm
> ebtable_filter ebtables seqiv ecb drbg ansi_cprng algif_skcipher md4
> algif_hash af_alg binfmt_misc xfrm_user xfrm4_tunnel tunnel4 ipcomp
> xfrm_ipcomp esp4 ah4 cpufreq_userspace cpufreq_powersave
> cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6
> nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy
> ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4
> xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables
> x_tables joydev rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt
> iTCO_vendor_support x86_pkg_temp_thermal
> Nov 27 01:52:36 heisenberg kernel:  intel_powerclamp intel_rapl
> iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev
> deflate ctr psmouse serio_raw twofish_generic pcspkr btusb btrtl
> btbcm btintel bluetooth crc16 uvcvideo videobuf2_vmalloc
> videobuf2_memops videobuf2_core v4l2_common videodev media
> twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common
> sg arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci
> rfkill camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm
> 8250_fintek camellia_x86_64 snd_hda_codec_realtek
> snd_hda_codec_generic processor battery fujitsu_laptop i2c_i801 ac
> lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel snd_hda_codec
> snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd soundcore
> i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me
> Nov 27 01:52:36 heisenberg kernel:  i2c_algo_bit mei
> serpent_sse2_x86_64 xts serpent_generic blowfish_generic
> blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic
> cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3
> sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key
> xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod
> md_mod btrfs xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel
> aesni_intel aes_x86_64 glue_helper ahci lrw gf128mul ablk_helper
> libahci cryptd libata ehci_pci xhci_pci ehci_hcd scsi_mod xhci_hcd
> usbcore usb_common
> Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not
> tainted 4.2.0-1-amd64 #1 Debian 4.2.6-1
> Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK
> E782/FJNB23E, BIOS Version 1.11 05/24/2012
> Nov 27 01:52:36 heisenberg kernel:   a02e6260
> 8154e2f6 
> Nov 27 01:52:36 heisenberg kernel:  8106e5b1 880235a3c42c
> 7ffd3d3796c0 8802f0e5c000
> Nov 27 01:52:36 heisenberg kernel:  0004 88010543c500
> a02d2d81 88041e5ebb00
> Nov 27 01:52:36 heisenberg kernel: Call Trace:
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> dump_stack+0x40/0x50
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> warn_slowpath_common+0x81/0xb0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> btrfs_ioctl_send+0x661/0x1120 [btrfs]
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> __alloc_pages_nodemask+0x194/0x9e0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> btrfs_ioctl+0x26c/0x2a10 [btrfs]
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> sched_move_task+0xca/0x1d0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> cpumask_next_and+0x2e/0x50
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> select_task_rq_fair+0x23f/0x5c0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> enqueue_task_fair+0x387/0x1120
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> native_sched_clock+0x24/0x80
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> sched_clock+0x5/0x10
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> do_vfs_ioctl+0x2c3/0x4a0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> _do_fork+0x146/0x3a0
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> SyS_ioctl+0x76/0x90
> Nov 27 01:52:36 heisenberg kernel:  [] ?
> system_call_fast_compare_end+0xc/0x6b
> Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]-
> --

smime.p7s
Description: S/MIME cryptographic signature

Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-08 Thread Christoph Anton Mitterer

On Sun, 2015-12-06 at 04:06 +, Duncan wrote:
> There's actually a number of USB-based hardware and software vulns
> out
> there, from the under $10 common-component-capacitor-based charge-
> and-zap
> (charges off the 5V USB line, zaps the port with several hundred
> volts
> reverse-polarity, if the machine survives the first pulse and
> continues
> supplying 5V power, repeat...), to the ones that act like USB-based
> input
> devices and "type" in whatever commands, to simple USB-boot to a
> forensic
> distro and let you inspect attached hardware (which is where the
> encrypted
> storage comes in, they've got everything that's not encrypted),
> to the plain old fashioned boot-sector viruses that quickly jump to
> everything else on the system that's not boot-sector protected and/or
> secure-boot locked, to...
Well this is all well known - at least to security folks ;) - but to be
quite honest:
Not an excuse for allowing even more attack surface, in this case via
the filesystem.
One will *always* find a weaker element in the security chain, and
could always argue with that not to fixe one's own issues.

"Well, there's no need to fix that possible collision-data-leakage-
issue in btrfs[0]! Why? Well an attacker could still simply abduct the
bank manager, torture him for hours until he gives any secret with
pleasure"
;-)

> Which is why most people in the know say if you have unsupervised
> physical
> access, you effectively own the machine and everything on it, at
> least
> that's not encrypted.
Sorry, I wouldn't say so. Ultimately you're of course right, which is
why my fully-dm-crypted notebook is never left alone when it runs (cold
boot or USB firmware attacks)... but in practise things are a bit
different I think.
Take the ATM example.

Or take real world life in big computing centres.
Fact is, many people have usually access, from the actual main
personell, over electricians to the cleaning personnel.
Whacking a device or attacking it via USB firmware tricks, is of course
possible for them, but it's much more likely to be noted (making noise,
taking time and so on),... so there is no need to give another attack
surface by this.

> If you haven't been keeping up, you really have some reading to
> do.  If
> you're plugging in untrusted USB devices, seriously, a thumb drive
> with a
> few duplicated btrfs UUIDs is the least of your worries!
Well as I've said, getting that in via USB may be only one way.
We're already so far that GNOME automount devices when plugged...
who says the the next step isn't that this happens remotely in some
form, e.g. btrfs-image on dropbox, automounted by nautilus.
Okay, that may be a bit constructed, but it should demonstrate that
there could be plenty of ways for that to happen, which we don't even
think of (and usually these are the worst in security).

You said it's basically not fixable in btrfs:
It's absolutely clear that I'm no btrfs expert (or even developer), but
my poor man approach which I think I've written before doesn't seem so
impossible, does it?
1) Don't simply "activate" btrfs devices that are found but rather:
2) Check if there are other devices of the same fs UUID + device ID, or
more generally said: check if there are any collisions
3) If there are, and some of them are already active, continue to use
them, don't activate the newly appeared ones
4) If there are, and none of them are already active, refuse to
activate *any* of them unless the user manually instructs to do so via
device= like options.

> BTW, this is documented (in someone simpler "do not do XX" form) on
> the
> wiki, gotchas page.
> 
> https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of
> _devices
I know, but it doesn't really tell all possibly consequences, and
again, it's unlikely that the end-user (even if possibly heavily
affected by it) will stumble over that.

Cheer,
Chris.

[0] Assuming there is actually one, I haven't really verified that and
base it solely one what people told that basically arbitrary
corruptions may happen on both devices.

smime.p7s
Description: S/MIME cryptographic signature

Re: Missing half of available space (resend)

2015-12-08 Thread David Hampton

On Tue, 2015-12-08 at 22:27 -0700, Chris Murphy wrote:
> On Tue, Dec 8, 2015 at 10:02 PM, David Hampton
>  wrote:
> > The
> > 'btrfs fi df' command consistently shows a total size of around
> > 3TB, and says that space is almost completely full.
> 
> and
> 
> 
> > root@selene:~# btrfs fi df /video
> > Data, RAID6: total=3.15TiB, used=3.11TiB
> 
> The "total=3.15TiB" means "there's a total of 3.15TiB allocated for
> data chunks using raid6 profile" and of that 3.11TiB is used.
> 
> btrfs fi df doesn't ever show how much is free or available.

I think I get it.  The numbers in the 'df' command don't show the total
number of chunks that exist, only the subset of those chunks that have
been allocated to something.

> You can get an estimate of that by using 'btrfs fi usage' instead.

Seems I need to upgrade my tools.  That command was added in 3.18 and I
only have the 3.12 tools.

> > root@selene:~# df -h /video
> > Filesystem  Size  Used Avail Use% Mounted on
> > /dev/vda 15T  3.2T  8.3T  28% /video
> 
> That's about right although it seems it's slightly overestimating the
> available free space.

Thanks.  Make me feel a lot better.

David



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: subvols and parents - how?

2015-12-08 Thread Christoph Anton Mitterer

On Fri, 2015-11-27 at 01:02 +, Duncan wrote:
[snip snap]
> #1 could be a pain to setup if you weren't actually mounting it
> previously, just relying on the nested tree, AND...
> 
> #2 The point I was trying to make, now, to mount it you'll mount not
> a 
> native nested subvol, and not a directly available sibling
> 5/subvols/home, but you'll actually be reaching into an entirely 
> different nesting structure to grab something down inside, mounting
> 5/subvols/root/home subvolume nesting down inside the direct
> 5/subvols/root sibling subvol.

Okay so your main point was basically "keeping things administrable"...

> one of which was that everything 
> that the package manager installs should be on the same partition
> with 
> the installed-package database, so if it has to be restored from
> backup, 
> at least if it's all old, at least it's all equally old, and the
> package 
> database actually matches what's on the system because it's in the
> same 
> backup!
I basically agree, though I'd allow few exceptions, like database-like
data that is stored in /var/ sometimes and that doesn't need to be
consistent with anything but iself... e.g. static web pages
(/var/www)... postgresl DB, or sks keyserver DB... and so on.

btw: What's the proper way for merging / splitting into subvols.
E.g. consider I have:
5
|
+--root (subvol)
   |
   +-- var (no subvol)

And say I would want to split of var/www into a subvol.
Well one obvious way would be with mv (and AFAIU that would keep my
ref-links with clones, if any) but that also means that anything that
accesses /var/www probably needs a downtime.
Is it planned to have a special function that basically says:
"make dir foo and anything below (except nested subvols) a subvol named
foo, immediately and atomically"?

And similar vice-versa... a special function that says:
"make subvol foo and anything below (except nested subvols) a dir of
the parent subvol named foo, immediately and atomically"?

Could be handy for real world administration, especially when one
want's
to avoid downtimes.

btw: Few days ago, either Hugo or your thought that mv'ing a subvol
would change it's UUID, but my try (which was with coreutils 8.3 -> no
reflinked mv) seemed to show it wouldn't but there was no further reply
then... so am I right that the UUID wouldn't change?

> The same idea applies here.  Once you start reaching into nested
> subvols 
> to get the deeper nested subvols you're trying to mount, it's too
> much 
> and you're just begging to get it wrong under the extreme pressures
> of a 
> disaster recovery.
Well apparently you oversaw the extremely simple and reliable solution:
leaving a tiny little note on your desk saying something like: "dear
boss, things are screwed up, I'm on vacation now..." ;-)

Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

41 matches

Mail list logo