Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 14:26 -0700, Chris Murphy wrote:
> The automobile is invented and due to the ensuing chaos, common
> practice of doing whatever the F you wanted came to an end in favor
> of
> rules of the road and traffic lights. I'm sure some people went
> ballistic, but for the most part things were much better without the
> brokenness or prior common practice.
Okay than take your road traffic example, apply it to filesystems.

In road traffic you have rules, e.g. pedestrians may cross the road
when their light shows green and that of the cars red.
That could be the rule, similar as to "don't have duplicate UUIDs with
btrfs".

Despite we have the rule, cars stop at red, pedestrians walk at green,
we still teach our kids: "look at both sides on the road, only cross if
there's no car (or tank or whatever ;) ) crossing.
Applying that to filesystems would be: "hope that everyone plays the
rules, but don't kill yourself in one doesn't and there are duplicate
IDs).

 
> So the fact we're going to have this problem with all file systems
> that incorporate the volume UUID into the metadata stream, tells me
> that the very rudimentary common practice of using dd needs to go
> away, in general practice.
Sure, for those that use multiple devices (LVM, MD, etc.), or for those
that actually just use the UUID to select the block device for each
write/read (and not use these only "once") to get the right major/minor
dev id (or whatever the kernel uses internally for path based
addressing).


> http://www.ietf.org/rfc/rfc4122.txt
> "A UUID is 128 bits long, and can guarantee uniqueness across space
> and time."
But of course not in terms of the problems we're talking about here,
where UUIDs may be accidentally or maliciously duplicated.

> Also see security considerations in section 6.
Doesn't section 6 basically imply that you can not 100% guarantee
they're equal? E.g. bad random seed on multiple systems?

Also, IIRC, one of the UUID algos just used some combination of MAC,
time and PID... which especially in VMs may even lead to dupes.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 1/2] btrfs: Enhance super validation check

2015-12-14 Thread Qu Wenruo



David Sterba wrote on 2015/12/14 17:24 +0100:

On Tue, Dec 08, 2015 at 03:35:57PM +0800, Qu Wenruo wrote:

@@ -4005,31 +3989,47 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
}

/*
-* The common minimum, we don't know if we can trust the 
nodesize/sectorsize
-* items yet, they'll be verified later. Issue just a warning.
+* Check sectorsize and nodesize first, some other check will need it.
+* XXX: Just do a favor for later subpage size check. Check all


Same as in v1: Please do not add new XXX or TODO markers to the sources.
The comment would be fine with just:

 "Check all possible sectorsizes (4K, 8K, 16K, 32K, 64K) here."

With that fixed,

Reviewed-by: David Sterba 



Oh, sorry, forgot to modify this, I'll update it soon.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote:
> My understanding of BTRFS is that the metadata referencing data
> blocks has the 
> checksums for those blocks, then the blocks which link to that
> metadata (EG 
> directory entries referencing file metadata) has checksums of those.
You mean basically, that all metadata is chained, right?

> For each 
> metadata block there is a new version that is eventually linked from
> a new 
> version of the tree root.
> 
> This means that the regular checksum mechanisms can't work with nocow
> data.  A 
> filesystem can have checksums just pointing to data blocks but you
> need to 
> cater for the case where a corrupt metadata block points to an old
> version of 
> a data block and matching checksum.  The way that BTRFS works with an
> entire 
> checksumed tree means that there's no possibility of pointing to an
> old 
> version of a data block.
Hmm I'm not sure whether I understand that (or better said, I'm
probably sure I don't :D).

AFAIU, the metadata is always CoWed, right? So when a nodatacow file is
written, I'd assume it's mtime was update, which already leads to
CoWing of metadata... just that now, the checksums should be written as
well.

If the metadata block is corrupt, then should that be noticed via the
csums on that?

And you said "The way that BTRFS works with an entire checksumed tree
means that there's no possibility of pointing to an old version of a
data block."... how would that work for nodatacow'ed blocks? If there
is a crash it cannot know whether it was still the old block or the new
one or any garbage in between?!


> The NetApp published research into hard drive errors indicates that
> they are 
> usually in small numbers and located in small areas of the disk.  So
> if BTRFS 
> had a nocow file with any storage method other than dup you would
> have metadata 
> and file data far enough apart that they are not likely to be hit by
> the same 
> corruption (and the same thing would apply with most Ext4 Inode
> tables and 
> data blocks).
Well put aside any such research (whose results aren't guaranteed to be
always the case)... but that's just one reason from my motivation why
I've said checksums for no-CoWed files would be great (I used the
multi-device example though, not DUP).


> I think that a file mode where there were checksums on data 
> blocks with no checksums on the metadata tree would be useful.  But
> it would 
> require a moderate amount of coding
Do you mean in general, or having this as a mode for nodatacow'ed
files?
Loosing the meta data checksumming, doesn't seem really much more
appealing than not having data checksumming :-(


> and there's lots of other things that the 
> developers are working on.
Sure, I just wanted to bring this to their attending... I already
imagined that they wouldn't drop their current work to do that, just
because me whining for it ;-)


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Lionel Bouton
Le 15/12/2015 02:49, Duncan a écrit :
> Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as
> excerpted:
>
>> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
>>
>>> I use noatime and nodiratime
>> FYI: noatime implies nodiratime :-)
> Was going to post that myself.  Is there some reason you:
>
> a) use nodiratime when noatime is already enabled, despite the fact that 
> the latter already includes the former, or

I don't (for some time). I didn't check for nodiratime on all the
systems I admin so there could be some left around but as they are
harmless I only remove them when I happen to stumble on them.

>
> b) didn't sufficiently research the option (at least the current mount 
> manpage documents that noatime includes nodiratime under both the noatime 
> and nodiratime options,

I just checked: this has only be made crystal-clear in the latest
man-pages version 4.03 released 10 days ago.

The mount(8) page of Gentoo's current stable man-pages (4.02 release in
August) which is installed on my systems states for noatime:
"Do not update inode access times on this filesystem (e.g., for faster
access on the news spool to speed up news servers)."

This is prone to misinterpretation: directories are inodes but that may
not be self-explanatory for everyone. At least it could leave me with a
doubt if I wasn't absolutely certain of the behavior (see below): I'm
not sure myself that there isn't a difference between a VFS inode (the
in-memory structure) and an on-disk structure called inode which some
filesystems may not have (I may have been mistaken but IIRC ReiserFS
left me with the impression that it wasn't storing directory entries in
inodes or it didn't call it that).

In fact I remember that when I read statements about noatime implying
nodiratime I had to check fs/inode.c after I found a random discussion
on the subject mentioning the proof being in the code to make sure of
the behavior.


>  and at least some hint of that has been in the 
> manpage for years as I recall reading it when I first read of nodiratime 
> and checked whether my noatime options included it) before standardizing 
> on it, or
>
> c) might have actually been talking in general, and there's some mounts 
> you don't actually choose to make noatime, but still want nodiratime, or

I probably used this case for testing purposes (but don't remember a
case where it was useful to me).
The expression I used was not meant to describe the exact flags in fstab
on my systems but the general idea of avoiding files and directories
atime updates as by using noatime I'm implicitly using nodiratime too.
Sorry for the confusion (I've been confused about the subject a long
time which probably didn't help express myself clearly).

Best regards,

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-12-14 Thread Qu Wenruo



Laurent Bonnaud wrote on 2015/12/14 13:47 +0100:

On 11/12/2015 15:21, Laurent Bonnaud wrote:


The next step will we to run a "btrfs scrub" to check if data loss did happen...


Scrubbing is now finished and it detected no errors.


Glad to hear that.

Your fs should be OK now.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: !PageLocked BUG_ON hit in clear_page_dirty_for_io

2015-12-14 Thread Liu Bo
On Mon, Dec 14, 2015 at 07:03:24PM -0500, Chris Mason wrote:
> On Tue, Dec 08, 2015 at 11:25:28PM -0500, Dave Jones wrote:
> > Not sure if I've already reported this one, but I've been seeing this
> > a lot this last couple days.
> > 
> > kernel BUG at mm/page-writeback.c:2654!
> > invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 
> We ended up discussing this in more detail on lkml, but I'll summarize
> here.
> 
> There were two problems.  First lock_page() might not actually lock the
> page in v4.4-rc4, it can bail out if a signal is pending.  This got
> fixed just before v4.4-rc5, so if you were on rc4, upgrade asap.
> 
> Second, prepare_pages had a bug for single page writes:
> 
> From f0be89af049857bcc537a53fe2a2fae080e7a5bd Mon Sep 17 00:00:00 2001
> From: Chris Mason 
> Date: Mon, 14 Dec 2015 15:40:44 -0800
> Subject: [PATCH] Btrfs: check prepare_uptodate_page() error code earlier
> 
> prepare_pages() may end up calling prepare_uptodate_page() twice if our
> write only spans a single page.  But if the first call returns an error,
> our page will be unlocked and its not safe to call it again.
> 
> This bug goes all the way back to 2011, and it's not something commonly
> hit.
> 
> While we're here, add a more explicit check for the page being truncated
> away.  The bare lock_page() alone is protected only by good thoughts and
> i_mutex, which we're sure to regret eventually.

Reviewed-by: Liu Bo 

Thanks,

-liubo
> 
> Reported-by: Dave Jones 
> Signed-off-by: Chris Mason 
> ---
>  fs/btrfs/file.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 72e7346..0f09526 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1291,7 +1291,8 @@ out:
>   * on error we return an unlocked page and the error value
>   * on success we return a locked page and 0
>   */
> -static int prepare_uptodate_page(struct page *page, u64 pos,
> +static int prepare_uptodate_page(struct inode *inode,
> +  struct page *page, u64 pos,
>bool force_uptodate)
>  {
>   int ret = 0;
> @@ -1306,6 +1307,10 @@ static int prepare_uptodate_page(struct page *page, 
> u64 pos,
>   unlock_page(page);
>   return -EIO;
>   }
> + if (page->mapping != inode->i_mapping) {
> + unlock_page(page);
> + return -EAGAIN;
> + }
>   }
>   return 0;
>  }
> @@ -1324,6 +1329,7 @@ static noinline int prepare_pages(struct inode *inode, 
> struct page **pages,
>   int faili;
>  
>   for (i = 0; i < num_pages; i++) {
> +again:
>   pages[i] = find_or_create_page(inode->i_mapping, index + i,
>  mask | __GFP_WRITE);
>   if (!pages[i]) {
> @@ -1333,13 +1339,17 @@ static noinline int prepare_pages(struct inode 
> *inode, struct page **pages,
>   }
>  
>   if (i == 0)
> - err = prepare_uptodate_page(pages[i], pos,
> + err = prepare_uptodate_page(inode, pages[i], pos,
>   force_uptodate);
> - if (i == num_pages - 1)
> - err = prepare_uptodate_page(pages[i],
> + if (!err && i == num_pages - 1)
> + err = prepare_uptodate_page(inode, pages[i],
>   pos + write_bytes, false);
>   if (err) {
>   page_cache_release(pages[i]);
> + if (err == -EAGAIN) {
> + err = 0;
> + goto again;
> + }
>   faili = i - 1;
>   goto fail;
>   }
> -- 
> 2.4.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 15:20 -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 14:44, Christoph Anton Mitterer wrote:
> > On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:
> > > The traditional reasoning was that read-only meant that users
> > > couldn't
> > > change anything
> > Where I'd however count the atime changes to.
> > The atimes wouldn't change magically, but only because the user
> > stared
> > some program, configured some daemon, etc. ... which
> > reads/writes/etc.
> > the file.
> But reading the file is allowed, which is where this starts to get 
> ambiguous.
Why?

> Reading a file updates the atime (and in fact, this is the 
> way that most stuff that uses them cares about them), but even a ro 
> mount allows reading the file.
As I just wrote in the other post, at least for btrfs (haven't checked
ext/xfs due to being... well... lazy O:-) ) ro mount option  or  ro
snapshot seems to mean: no atime updates even if mounted with
strictatimes (or maybe I did just something stupid when checking, so
better double check)


> The traditional meaning of ro on UNIX 
> was (AFAIUI) that directory structure couldn't change, new files 
> couldn't be created, existing files couldn't be deleted, flags on the
> inodes couldn't be changed, and file data couldn't be changed.  TBH,
> I'm 
> not even certain that atime updates on ro filesystems was even an 
> intentional thing in the first place, it really sounds to me like the
> type of thing that somebody forgot to put in a permissions check for,
> and then people thought it was a feature.
Well in the end it probably doesn't matter how it came to existence,...
rather what it should be and what it actually is.
As said, I, personally, from the user PoV, would says soft-ro already
includes no dates on files being modifiable (including atime), as I'd
consider these a property of the file.
However anyone else may of course see that differently and at the same
time be smarter than I am.



> Also,
 
> even with noatime, I'm pretty sure the VFS updates the atime every
> time 
> the mtime changes
I've just checked and not it doesn't:
  File: ‘subvol/FILE’
  Size: 8   Blocks: 16 IO Block: 4096   regular
file
Device: 30h/48d Inode: 257 Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/root)   Gid:
(0/root)
Access: 2015-12-15 00:01:46.452007798 +0100
Modify: 2015-12-15 00:31:26.579511816 +0100
Change: 2015-12-15 00:31:26.579511816 +0100

(rw,noatime mounted,... mtime, is more recent than atime)


>  (because not doing so would be somewhat stupid, and 
> you're writing the inode anyway), which technically means that stuff 
> could work around this by opening the file, truncating it to the size
> it 
> already is, and then closing it.
Hmm I don't have a strong opinion here... it sounds "supid" from the
technical point in that it *could* write the atime and that wouldn't
cost much.
OTOH, that would make things more ambiguous when atimes change and when
not... (they'd only change on writes, never on reads,...)
So I think it's good as it is... and it matches the name, which is
noatime - and not noatime-unless-on-writes ;-)



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2] btrfs-progs: Enhance chunk validation check

2015-12-14 Thread Qu Wenruo
Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause btrfsck error.

Reported-by: Vegard Nossum 
Signed-off-by: Qu Wenruo 
---
v2:
  Move some macros to kerncompat.h
---
 disk-io.c|  2 --
 kerncompat.h |  8 
 volumes.c| 29 -
 3 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index 7a63b91..83bdb27 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -40,8 +40,6 @@
 #define BTRFS_BAD_LEVEL(-3)
 #define BTRFS_BAD_NRITEMS  (-4)
 
-#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
-
 /* Calculate max possible nritems for a leaf/node */
 static u32 max_nritems(u8 level, u32 nodesize)
 {
diff --git a/kerncompat.h b/kerncompat.h
index 7c627ba..0f207b7 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -310,6 +310,14 @@ static inline long IS_ERR(const void *ptr)
 #define __bitwise
 #endif
 
+/* Alignment check */
+#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
+
+static inline int is_power_of_2(unsigned long n)
+{
+   return (n != 0 && ((n & (n - 1)) == 0));
+}
+
 typedef u16 __bitwise __le16;
 typedef u16 __bitwise __be16;
 typedef u32 __bitwise __le32;
diff --git a/volumes.c b/volumes.c
index 492dcd2..a94be0e 100644
--- a/volumes.c
+++ b/volumes.c
@@ -1591,6 +1591,7 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
struct cache_extent *ce;
u64 logical;
u64 length;
+   u64 stripe_len;
u64 devid;
u8 uuid[BTRFS_UUID_SIZE];
int num_stripes;
@@ -1599,6 +1600,33 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
 
logical = key->offset;
length = btrfs_chunk_length(leaf, chunk);
+   stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+   /* Validation check */
+   if (!num_stripes) {
+   error("invalid chunk num_stripes: %u", num_stripes);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(logical, root->sectorsize)) {
+   error("invalid chunk logical %llu", logical);
+   return -EIO;
+   }
+   if (!length || !IS_ALIGNED(length, root->sectorsize)) {
+   error("invalid chunk length %llu", length);
+   return -EIO;
+   }
+   if (!is_power_of_2(stripe_len)) {
+   error("invalid chunk stripe length: %llu", stripe_len);
+   return -EIO;
+   }
+   if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+   btrfs_chunk_type(leaf, chunk)) {
+   error("unrecognized chunk type: %llu",
+ ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
+   BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+ btrfs_chunk_type(leaf, chunk));
+   return -EIO;
+   }
 
ce = search_cache_extent(_tree->cache_tree, logical);
 
@@ -1607,7 +1635,6 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
return 0;
}
 
-   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
map = kmalloc(btrfs_map_lookup_size(num_stripes), GFP_NOFS);
if (!map)
return -ENOMEM;
-- 
2.6.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Duncan
Christoph Anton Mitterer posted on Tue, 15 Dec 2015 00:25:05 +0100 as
excerpted:

> On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
> 
>> I use noatime and nodiratime

> FYI: noatime implies nodiratime :-)

Was going to post that myself.  Is there some reason you:

a) use nodiratime when noatime is already enabled, despite the fact that 
the latter already includes the former, or

b) didn't sufficiently research the option (at least the current mount 
manpage documents that noatime includes nodiratime under both the noatime 
and nodiratime options, and at least some hint of that has been in the 
manpage for years as I recall reading it when I first read of nodiratime 
and checked whether my noatime options included it) before standardizing 
on it, or

c) might have actually been talking in general, and there's some mounts 
you don't actually choose to make noatime, but still want nodiratime, or

d) chose that isn't otherwise reflected in the above?  If so, please 
describe, as it could be a learning experience for me, and possibly 
others as well.

>> Finally Linus Torvalds has been quite vocal and consistent on the
>> general subject of the kernel not breaking user-space APIs no matter
>> what so I wouldn't have much hope for default kernel mount options
>> changes...

> He surely is right in general,... but when the point has been reached,
> where only a minority actually requires the feature... and the minority
> actually starts to suffer from that... it may change.

Generally speaking, the practical rule is that you don't break userspace, 
but that a break that isn't noticed and reported by someone within a few 
release cycles is considered OK, as obviously nobody who actually cares 
enough about the possibility of old userspace breaking on new kernels 
enough to test for it was (still) using that functionality anyway.  (This 
is sometimes known as the "if a tree falls in the forest and there's 
nobody around to hear it, did it actually fall", rule. =:^)

But if it's noticed and reported before the new behavior itself is locked 
into place by other userspace relying on it, the change in behavior must 
be reverted.  (There have actually been a few cases over the years where 
they went to rather exceptional lengths to make two otherwise 
incompatible userspace-exposed behaviors both continue to work for the 
userspace that expected that behavior, without actually coding in such 
obvious hacks as executable name conditionals or the like, as others have 
been known to do at times.  Sometimes these fixes do end up bending the 
rules a bit, particularly the no-policy-in-the-kernel rule, but they do 
reinforce the now userspace breakage rule.)

The possible workarounds include the handful of kernel compatibility 
options that when enabled continue otherwise userspace breaking behavior 
such as removing old kernel API procfs files and the like.

That practical rule does in effect make it possible to do userspace-
breaking changes if you wait around long enough that there's nobody who 
will complain still actually using the old behavior.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: !PageLocked BUG_ON hit in clear_page_dirty_for_io

2015-12-14 Thread Chris Mason
On Tue, Dec 08, 2015 at 11:25:28PM -0500, Dave Jones wrote:
> Not sure if I've already reported this one, but I've been seeing this
> a lot this last couple days.
> 
> kernel BUG at mm/page-writeback.c:2654!
> invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN

We ended up discussing this in more detail on lkml, but I'll summarize
here.

There were two problems.  First lock_page() might not actually lock the
page in v4.4-rc4, it can bail out if a signal is pending.  This got
fixed just before v4.4-rc5, so if you were on rc4, upgrade asap.

Second, prepare_pages had a bug for single page writes:

>From f0be89af049857bcc537a53fe2a2fae080e7a5bd Mon Sep 17 00:00:00 2001
From: Chris Mason 
Date: Mon, 14 Dec 2015 15:40:44 -0800
Subject: [PATCH] Btrfs: check prepare_uptodate_page() error code earlier

prepare_pages() may end up calling prepare_uptodate_page() twice if our
write only spans a single page.  But if the first call returns an error,
our page will be unlocked and its not safe to call it again.

This bug goes all the way back to 2011, and it's not something commonly
hit.

While we're here, add a more explicit check for the page being truncated
away.  The bare lock_page() alone is protected only by good thoughts and
i_mutex, which we're sure to regret eventually.

Reported-by: Dave Jones 
Signed-off-by: Chris Mason 
---
 fs/btrfs/file.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 72e7346..0f09526 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1291,7 +1291,8 @@ out:
  * on error we return an unlocked page and the error value
  * on success we return a locked page and 0
  */
-static int prepare_uptodate_page(struct page *page, u64 pos,
+static int prepare_uptodate_page(struct inode *inode,
+struct page *page, u64 pos,
 bool force_uptodate)
 {
int ret = 0;
@@ -1306,6 +1307,10 @@ static int prepare_uptodate_page(struct page *page, u64 
pos,
unlock_page(page);
return -EIO;
}
+   if (page->mapping != inode->i_mapping) {
+   unlock_page(page);
+   return -EAGAIN;
+   }
}
return 0;
 }
@@ -1324,6 +1329,7 @@ static noinline int prepare_pages(struct inode *inode, 
struct page **pages,
int faili;
 
for (i = 0; i < num_pages; i++) {
+again:
pages[i] = find_or_create_page(inode->i_mapping, index + i,
   mask | __GFP_WRITE);
if (!pages[i]) {
@@ -1333,13 +1339,17 @@ static noinline int prepare_pages(struct inode *inode, 
struct page **pages,
}
 
if (i == 0)
-   err = prepare_uptodate_page(pages[i], pos,
+   err = prepare_uptodate_page(inode, pages[i], pos,
force_uptodate);
-   if (i == num_pages - 1)
-   err = prepare_uptodate_page(pages[i],
+   if (!err && i == num_pages - 1)
+   err = prepare_uptodate_page(inode, pages[i],
pos + write_bytes, false);
if (err) {
page_cache_release(pages[i]);
+   if (err == -EAGAIN) {
+   err = 0;
+   goto again;
+   }
faili = i - 1;
goto fail;
}
-- 
2.4.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 13:55 -0700, Chris Murphy wrote:
> I'm aware of this proof of concept. I'd put it, and this one, in the
> realm of a targeted attack, so it's not nearly as likely as other
> problems needing fixing. That doesn't mean don't understand it better
> so it can be fixed. It means understand before arriving at risk
> assessment let alone conclusions.
Assessing the actual risk of any such attack vector is IMHO quite
difficult... but at least past experience has shown countless times
over and over again, that any system, where people already saw it would
have issues, were sooner or later actively attacked.

Take all the things from online banking... TAN, iTAN... at some point
the two-factor auth via mobileTAN were some people already warned, that
this would be rather easy to attack... banks and proponents of the
system said, that this is rather not realistic in practise.
I think alone in Germany we had some 8 million Euros that were stolen
by hacking mTANs last year.


> I didn't. I did state there are edge cases, not normal use. My
> criticism of dd for copying a volume is for general purpose copying,
> not edge cases.
Sure... but I guess we've never needed to argue about that.
If a howto were to be written on "how to best copy a btrfs filesystem"
and someone would say "me! take dd"... I'd be surely on your side,
sayin "Naaahh... stupid... you copy empty blocks and that like".

But here we talk about something completely different... namely all
those cases where UUID collisions could happen, including those where a
bit-identical copy is, for whichever reason, the best solution.



> I already have, as have others.
So far you've only said it would be bad practise as it wouldn't work
well with filesystems that do use UUIDs.
I agree with what Austin gave you as an answer upon that.


> Does the user want cake or pie? The computer doesn't have that level
> of granular information when there are two apparently bitwise
> identical devices.
I'm quite sure the computer has some concept of device path, and UUID
isn't the only way to identify a device. If that was so, than any
cloned ext4 would suffer from corruptions as well, as the fs would
chose the device based on UUID.

brtfs does of course more, especially in the multi-device case,...
where it needs to differ devices based on their content, no on their
path (which may be unstable).
But such case can surely be detected, and as you said yourself below:

> So option a is to simply fail and let the user
> resolve the ambiguity.
... on could e.g. simply require the user to resolve the situation
manually.
And I guess that's exactly what I've wrote here several times in this
thread, for mounting situations, for rebuild/fsck/repair/etc.
sitations.


>  Option b is maybe to leveral btrfs check code
> and find out if there's more to the story, some indication that one
> of
> the apparently identical copies isn't really identical.
Can't believe that this would be possible... if they're bitwise
identical, they're bitwise identical, the only thing that differs them
is how they're connected, e.g. USB port 1, sata port 2, etc..
But as this is unstable (just swap two sata disks) it cannot be used.


> That's not something btrfs can resolve alone.
Sure, I've never demanded that.
I always said "handle it gracefully" (i.e. no corruptions, no new
mounts, fsck's, etc.), require the user to manually sort out things.
Not automagically determine which of the devices are actually the right
ones and use them.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


[PATCH v3 2/2] btrfs: Enhance chunk validation check

2015-12-14 Thread Qu Wenruo
Enhance chunk validation:
1) Num_stripes
   We already have such check but it's only in super block sys chunk
   array.
   Now check all on-disk chunks.

2) Chunk logical
   It should be aligned to sector size.
   This behavior should be *DOUBLE CHECKED* for 64K sector size like
   PPC64 or AArch64.
   Maybe we can found some hidden bugs.

3) Chunk length
   Same as chunk logical, should be aligned to sector size.

4) Stripe length
   It should be power of 2.

5) Chunk type
   Any bit out of TYPE_MAS | PROFILE_MASK is invalid.

With all these much restrict rules, several fuzzed image reported in
mail list should no longer cause kernel panic.

Reported-by: Vegard Nossum 
Signed-off-by: Qu Wenruo 

---
v2:
  Fix a typo which forgot to return -EIO after num_stripes check.
---
 fs/btrfs/volumes.c | 33 -
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9ea345f..bda84be 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6199,6 +6199,7 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
struct extent_map *em;
u64 logical;
u64 length;
+   u64 stripe_len;
u64 devid;
u8 uuid[BTRFS_UUID_SIZE];
int num_stripes;
@@ -6207,6 +6208,37 @@ static int read_one_chunk(struct btrfs_root *root, 
struct btrfs_key *key,
 
logical = key->offset;
length = btrfs_chunk_length(leaf, chunk);
+   stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+   /* Validation check */
+   if (!num_stripes) {
+   btrfs_err(root->fs_info, "invalid chunk num_stripes: %u",
+ num_stripes);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(logical, root->sectorsize)) {
+   btrfs_err(root->fs_info,
+ "invalid chunk logical %llu", logical);
+   return -EIO;
+   }
+   if (!length || !IS_ALIGNED(length, root->sectorsize)) {
+   btrfs_err(root->fs_info,
+   "invalid chunk length %llu", length);
+   return -EIO;
+   }
+   if (!is_power_of_2(stripe_len)) {
+   btrfs_err(root->fs_info, "invalid chunk stripe length: %llu",
+ stripe_len);
+   return -EIO;
+   }
+   if (~(BTRFS_BLOCK_GROUP_TYPE_MASK | BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+   btrfs_chunk_type(leaf, chunk)) {
+   btrfs_err(root->fs_info, "unrecognized chunk type: %llu",
+ ~(BTRFS_BLOCK_GROUP_TYPE_MASK |
+   BTRFS_BLOCK_GROUP_PROFILE_MASK) &
+ btrfs_chunk_type(leaf, chunk));
+   return -EIO;
+   }
 
read_lock(_tree->map_tree.lock);
em = lookup_extent_mapping(_tree->map_tree, logical, 1);
@@ -6223,7 +6255,6 @@ static int read_one_chunk(struct btrfs_root *root, struct 
btrfs_key *key,
em = alloc_extent_map();
if (!em)
return -ENOMEM;
-   num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
if (!map) {
free_extent_map(em);
-- 
2.6.3



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] btrfs: Enhance super validation check

2015-12-14 Thread Qu Wenruo
Enhance btrfs_check_super_valid() function by the following points:
1) Restrict sector/node size check
   Not the old max/min valid check, but also check if it's a power of 2.
   So some bogus number like 12K node size won't pass now.

2) Super flag check
   For now, there is still some inconsistency between kernel and
   btrfs-progs super flags.
   And considering btrfs-progs may add new flags for super block, this
   check will only output warning.

3) Better root alignment check
   Now root bytenr is checked against sector size.

4) Move some check into btrfs_check_super_valid().
   Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
   And magic number check.

Reported-by: Vegard Nossum 
Signed-off-by: Qu Wenruo 
Reviewed-by: David Sterba 
---
v2:
  Make super flag check optional and won't cause mount failure.
v3:
  Remove an XXX in comment
---
 fs/btrfs/disk-io.c | 97 +++---
 1 file changed, 48 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 617bf4f..ffa3ac6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -54,6 +54,12 @@
 #include 
 #endif
 
+#define BTRFS_SUPER_FLAG_SUPP  (BTRFS_HEADER_FLAG_WRITTEN |\
+BTRFS_HEADER_FLAG_RELOC |\
+BTRFS_SUPER_FLAG_ERROR |\
+BTRFS_SUPER_FLAG_SEEDING |\
+BTRFS_SUPER_FLAG_METADUMP)
+
 static const struct extent_io_ops btree_extent_io_ops;
 static void end_workqueue_fn(struct btrfs_work *work);
 static void free_fs_root(struct btrfs_root *root);
@@ -2727,26 +2733,6 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
-   /*
-* Leafsize and nodesize were always equal, this is only a sanity check.
-*/
-   if (le32_to_cpu(disk_super->__unused_leafsize) !=
-   btrfs_super_nodesize(disk_super)) {
-   printk(KERN_ERR "BTRFS: couldn't mount because metadata "
-  "blocksizes don't match.  node %d leaf %d\n",
-  btrfs_super_nodesize(disk_super),
-  le32_to_cpu(disk_super->__unused_leafsize));
-   err = -EINVAL;
-   goto fail_alloc;
-   }
-   if (btrfs_super_nodesize(disk_super) > BTRFS_MAX_METADATA_BLOCKSIZE) {
-   printk(KERN_ERR "BTRFS: couldn't mount because metadata "
-  "blocksize (%d) was too large\n",
-  btrfs_super_nodesize(disk_super));
-   err = -EINVAL;
-   goto fail_alloc;
-   }
-
features = btrfs_super_incompat_flags(disk_super);
features |= BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF;
if (tree_root->fs_info->compress_type == BTRFS_COMPRESS_LZO)
@@ -2818,17 +2804,6 @@ int open_ctree(struct super_block *sb,
sb->s_blocksize = sectorsize;
sb->s_blocksize_bits = blksize_bits(sectorsize);
 
-   if (btrfs_super_magic(disk_super) != BTRFS_MAGIC) {
-   printk(KERN_ERR "BTRFS: valid FS not found on %s\n", sb->s_id);
-   goto fail_sb_buffer;
-   }
-
-   if (sectorsize != PAGE_SIZE) {
-   printk(KERN_ERR "BTRFS: incompatible sector size (%lu) "
-  "found on %s\n", (unsigned long)sectorsize, sb->s_id);
-   goto fail_sb_buffer;
-   }
-
mutex_lock(_info->chunk_mutex);
ret = btrfs_read_sys_array(tree_root);
mutex_unlock(_info->chunk_mutex);
@@ -3986,8 +3961,17 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
  int read_only)
 {
struct btrfs_super_block *sb = fs_info->super_copy;
+   u64 nodesize = btrfs_super_nodesize(sb);
+   u64 sectorsize = btrfs_super_sectorsize(sb);
int ret = 0;
 
+   if (btrfs_super_magic(sb) != BTRFS_MAGIC) {
+   printk(KERN_ERR "BTRFS: no valid FS found\n");
+   ret = -EINVAL;
+   }
+   if (btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP)
+   printk(KERN_WARNING "BTRFS: unrecognized super flag: %llu\n",
+   btrfs_super_flags(sb) & ~BTRFS_SUPER_FLAG_SUPP);
if (btrfs_super_root_level(sb) >= BTRFS_MAX_LEVEL) {
printk(KERN_ERR "BTRFS: tree_root level too big: %d >= %d\n",
btrfs_super_root_level(sb), BTRFS_MAX_LEVEL);
@@ -4005,31 +3989,46 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info,
}
 
/*
-* The common minimum, we don't know if we can trust the 
nodesize/sectorsize
-* items yet, they'll be verified later. Issue just a warning.
+* Check sectorsize and nodesize first, other check will need it.
+* Check all possible sectorsize(4K, 8K, 16K, 32K, 64K) here.
 */
-   if 

Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 08:23 -0500, Austin S. Hemmelgarn wrote:
> The reason that this isn't quite as high of a concern is because
> performing this attack requires either root access, or direct
> physical 
> access to the hardware, and in either case, your system is already 
> compromised.
No necessarily.
Apart from the ATM image (where most people wouldn't call it
compromised, just because it's openly accessible on the street)
imageine you're running a VM hosting service, where you allow users to
upload images and have them deployed.
In the cheap" case these will end up as regular files, where they
couldn't do any harm (even if colliding UUIDs)... but even there one
would have to expect, that the hypervisor admin may losetup them for
whichever reason.
But if you offer more professional services, you may give your clients
e.g. direct access to some storage backend, which are then probably
also seen on the host by its kernel.
And here we already have the case, that a client could remotely trigger
such collision.

And remember, things only sounds far-fetched until it actually happens
the first time ;)


> I still think that that isn't a sufficient excuse for not fixing the 
> issue, as there are a number of non-security related issues that can 
> result from this (there are some things that are common practice with
> LVM or mdraid that can't be done with BTRFS because of this).
Sure, I guess we agree on that,...


> > Apart from that, btrfs should be a general purpose fs, and not just
> > a
> > desktop or server fs.
> > So edge cases like forensics (where it's common that you create
> > bitwise
> > identical images) shouln't be forgotten either.
> While I would normally agree, there are ways to work around this in
> the 
> forensics case that don't work for any other case (namely, if BTRFS
> is 
> built as a module, you can unmount everything, unload the module,
> reload 
> it, and only scan the devices you want).
see below (*)


> On that note, why exactly is it better to make the filesystem UUID
> such 
> an integral part of the filesystem?
Well I think it's a proper way to e.g. handle the multi-device case.
You have n devices, you want to differ them,... using a pseudo-random
UUID is surely better than giving them numbers.
Same for the fs UUID, e.g. when used for mounting devices whose paths
aren't stable.

As said before, using the UUID isn't the problem - not protecting
against collisions is.


> The other thing I'm reading out of 
> this all, is that by writing a total of 64 bytes to a specific
> location 
> in a single disk in a multi-device BTRFS filesystem, you can make the
> whole filesystem fall apart, which is absolutely absurd.
Well,... I don't think that writing *into* the filesystem is covered by
common practise anymore.

In UNIX, a device (which holds the filesystem) is a file. Therefore one
can argue: if one copies/duplicates one file (i.e. the fs) neither of
the two's contents should get corrupted.
But if you actively write *into* the file by yourself,... then you're
simply on your own, either you know what you do, or just may just
corrupt *that* specific file. Of course it should again not lead to any
of it's clones or become corrupted as well.



> And some recovery situations (think along the lines of no recovery
> disk, 
> and you only have busybox or something similar to work with).
(*) which is however also, why you may not be able to unmount the
device anymore or unload btrfs.
Maybe you have reasons you must/want to do any forensics in the running
system.


> > AFAIK, there's not even a solution right now, that copies a
> > complete
> > btrfs, with snapshots, etc. preserving all ref-links. At least
> > nothing
> > official that works in one command.
> Send-receive kind of works for that
I've added the "in one command" for that... O:-)
In case the btrfs would have subvols/snapshots... the user would need
to make the recursion himself... 


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-14 Thread Qu Wenruo



Ivan Sizov wrote on 2015/12/14 20:55 +0300:

2015-12-14 5:28 GMT+03:00 Qu Wenruo :

Not completely sure, but it may be related to a regression in 4.2.
The regression it self is already fixed, but is not backported to 4.2 as far
as I know.

So, I'd recommend to revert to 4.1 and see if things get better.
Fortunately, btrfs already aborted the transaction before things get worse.


Nothing changed, mount also fails on 4.1.3.



I checked the filesystem extents:

$ sudo btrfs check --subvol-extents 5 /dev/sda1
Print extent state for subvolume 5 on /dev/sda1
UUID: 6de5c663-bc65-4120-8cf6-5309fd25aa7e
checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
bytenr mismatch, want=159708168192, have=16968404070778227820
ERROR: while mapping refs: -5
extent_io.c:582: free_extent_buffer: Assertion `eb->refs < 0` failed.
btrfs(+0x51e9e)[0x56283f4bde9e]
btrfs(free_extent_buffer+0xc0)[0x56283f4be9b0]
btrfs(btrfs_free_fs_root+0x11)[0x56283f4aef11]
btrfs(rb_free_nodes+0x21)[0x56283f4d7cc1]
btrfs(close_ctree+0x194)[0x56283f4b0214]
btrfs(cmd_check+0x486)[0x56283f49ace6]
btrfs(main+0x82)[0x56283f47fad2]
/lib64/libc.so.6(__libc_start_main+0xf0)[0x7f8cbea98580]
btrfs(_start+0x29)[0x56283f47fbd9]
$



Did you tried it without the '--subvol-extents 5' options?
And what's the output?


Yes, I tried it. The output is normal, nothing problem found (shows
UUID, then "checking extents" and that's all)!


Then, it means it hit an assertion and no backtrace support is compiled.
So I consider that's the same with your backtrace.




And it may be a good idea to run btrfs-find-root -a, trying to find a good
copy of old btrfs root tree.
It may cause miracle to make it RW again.


Thanks for advice. "btrfs-find-root -a" is running at the moment. What
should I do after its completion? Should I just try RW mounting of the
found root or it isn't safe?


You'll see output like the following:
Well block 29491200(gen: 5 level: 0) seems good, and it matches superblock
Well block 29376512(gen: 4 level: 0) seems good, but generation/level 
doesn't match, want gen: 5 level: 0


The match one is not what you're looking for.
Try the one whose generation is a little smaller than match one.

Then use btrfsck to test if it's OK:
$ btrfsck -r  /dev/sda1

Try 2~5 times with bytenr whose generation is near the match one.
If you're in good luck, you will find one doesn't crash btrfsck.

And if that doesn't produce much error, then you can try btrfsck 
--repair -r  to fix it and try mount.





+1 for the advice if you just want to use back up things and get back to
normal life.


I already backed up the most important data (the whole disk space is
1,82 TB). But I want to solve this strange problem.


At least the direct cause is quite straightforward:

 checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
 checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
 bytenr mismatch, want=159708168192, have=16968404070778227820

The tree block at bytenr 159708168192 is damaged.
Its csum mismatched, and bytenr doesn't match either.
Maybe the tree is damaged.
(And apparently, btrfs abort transaction didn't do its job well)

But hard to find out the root cause though.


If you still want to try btrfs converted from ext*, I'd recommend to use 
next release of btrfs-progs, and kernel 4.4 or 4.1.


Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Duncan
Austin S. Hemmelgarn posted on Mon, 14 Dec 2015 15:27:11 -0500 as
excerpted:

> FWIW, both Duncan and I have our own copy of the sources patched to
> default to noatime, and I know a number of embedded Linux developers who
> do likewise, and I've even heard talk in the past of some distributions
> possibly using such patches themselves (although it always ends up not
> happening, because of Mutt).

And FWIW, while I was reasonably conservative with my original patch and 
simply defaulted to noatime, turning it off if any of the atime-enabling 
options were found, I'm beginning to think I might as well simply hard-
code noatime, removing the conditions.  This is due to initr* behavior 
that ends up not disabling atime for early, mostly virtual/memory-based 
filesystems like procfs, sysfs, devfs, tmp-on-tmpfs, etc, but could 
extend to initial initr* mount of the root filesystem as well, if I 
decide to make it rw on the kernel commandline or some such.

Of course atime on a memory-based-fs isn't normally a huge problem since 
its all memory-based anyway, and it would enable stuff like atime based 
tmpwatch since I do a tmpfs based tmp, so I've not worried about it 
much.  But at the same time, I'm now assuming noatime on my systems, and 
anything that breaks that assumption could trigger hard to trace down 
bugs, and hardcoding the noatime assumption would bring a consistency 
that I don't have ATM.

If/when I change my patch in that regard, I may look into adding other 
conditional options, perhaps defaulting to autodefrag if it's btrfs, for 
instance, if my limited sysadmin-not-developer-level patching/coding 
skills allow it.  I'd have to see...  But I'd certainly start with making 
autodefrag a default, not hard-coded, if I did patch in autodefrag, 
because while I don't have large VM images and the like, where autodefrag 
can be a performance bottleneck, to worry about now, I'd like to keep 
that option available for me in the future, and would thus make 
autodefrag the default, not hard-coded.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: do not create empty block group if we have allocated data

2015-12-14 Thread Liu Bo
Now we force to create empty block group to keep data profile alive,
however, in the below example, we eventually get an empty block group
while we're trying to get more space for other types (metadata/system),

- Before,
block group "A": size=2G, used=1.2G
block group "B": size=2G, used=512M

- After "btrfs balance start -dusage=50 mount_point",
block group "A": size=2G, used=(1.2+0.5)G
block group "C": size=2G, used=0

Since there is no data in block group C, it won't be deleted
automatically and we have to get the unused 2G until the next mount.

Balance itself just moves data and doesn't remove data, so it's safe
to not create such a empty block group if we already have data
 allocated in other block groups.

Signed-off-by: Liu Bo 
---
 fs/btrfs/volumes.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4564522..14139c9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3400,6 +3400,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u32 count_meta = 0;
u32 count_sys = 0;
int chunk_reserved = 0;
+   u64 bytes_used = 0;
 
/* step one make some room on all the devices */
devices = _info->fs_devices->devices;
@@ -3538,7 +3539,13 @@ again:
goto loop;
}
 
-   if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) && !chunk_reserved) {
+   ASSERT(fs_info->data_sinfo);
+   spin_lock(_info->data_sinfo->lock);
+   bytes_used = fs_info->data_sinfo->bytes_used;
+   spin_unlock(_info->data_sinfo->lock);
+
+   if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) &&
+   !chunk_reserved && !bytes_used) {
trans = btrfs_start_transaction(chunk_root, 0);
if (IS_ERR(trans)) {
mutex_unlock(_info->delete_unused_bgs_mutex);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:
> > When one starts to get a bit deeper into btrfs (from the admin/end-
> > user
> > side) one sooner or later stumbles across the recommendation/need
> > to
> > use nodatacow for certain types of data (DBs, VM images, etc.) and
> > the
> > reason, AFAIU, being the inherent fragmentation that comes along
> > with
> > the CoW, which is especially noticeable for those types of files
> > with
> > lots of random internal writes.
> It is worth pointing out that in the case of DB's at least, this is 
> because at least some of the do COW internally to provide the 
> transactional semantics that are required for many workloads.
Guess that also applies to some VM images then, IIRC qcow2 does CoW.



> > a) for performance reasons (when I consider our research software
> > which
> > often has IO as the limiting factor and where we want as much IO
> > being
> > used by actual programs as possible)...
> There are other things that can be done to improve this.  I would
> assume 
> of course that you're already doing some of them (stuff like using 
> dedicated storage controller cards instead of the stuff on the 
> motherboard), but some things often get overlooked, like actually
> taking 
> the time to fine-tune the I/O scheduler for the workload (Linux has 
> particularly brain-dead default settings for CFQ, and the deadline
> I/O 
> scheduler is only good in hard-real-time usage or on small hard
> drives 
> that actually use spinning disks).
Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.


> The big argument for defragmenting a SSD is that it makes it such
> that 
> you require fewer I/O requests to the device to read a file
I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)


> The problem is not entirely the lack of COW semantics, it's also the
> fact that it's impossible to implement an atomic write on a hard
> disk. 
Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).


> > but I wouldn't know that relational DBs really do cheksuming of the
> > data.
> All the ones I know of except GDBM and BerkDB do in fact provide the 
> option of checksumming.  It's pretty much mandatory if you want to be
> considered for usage in financial, military, or medical applications.
Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.


> > Long story short, it does happen every now and then, that a scrub
> > shows
> > file errors, for neither the RAID was broken, nor there were any
> > block
> > errors reported by the disks, or anything suspicious in SMART.
> > In other words, silent block corruption.
> Or a transient error in system RAM that ECC didn't catch, or a 
> undetected error in the physical link layer to the disks, or an error
> in 
> the disk cache or controller, or any number of other things.
Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.


>   BTRFS 
> could only protect against some cases, not all (for example, if you
> have 
> a big enough error in RAM that ECC doesn't catch it, you've got
> serious 
> issues that just about nothing short of a cold reboot can save you
> from).
Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.


> > But, AFAIU, not doing CoW, while not having a journal (or does it
> > have
> > one for these cases???) almost certainly means that the data (not
> > necessarily the fs) will be inconsistent in case of a crash during
> > a
> > no-CoWed write anyway, right?
> > Wouldn't it be basically like ext2?
> Kind of, but not quite.  Even with nodatacow, metadata is still COW, 
> which is functionally as safe as a traditional journaling filesystem 
> like XFS or ext4.
Sure, I was referring to the data part only, should have made that more
clear.


> Absolute worst case scenario for both nodatacow on 
> BTRFS, and a traditional journaling filesystem, the contents of the
> file 
> are inconsistent.  However, almost all of the things that are 
> recommended use cases for nodatacow (primarily database files and VM 
> images) have some internal method of detecting and dealing with 
> corruption (because of the traditional filesystem semantics ensuring 
> metadata consistency, but not data 

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-14 Thread Duncan
Christoph Anton Mitterer posted on Mon, 14 Dec 2015 02:44:55 +0100 as
excerpted:

> Two more on these:
> 
> On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote:
>> 3) When I would actually disable datacow for e.g. a subvolume that
>> > holds VMs or DBs... what are all the implications?

>> After snapshotting, modifications are CoWed precisely once, and
>> then it reverts to nodatacow again. This means that making a snapshot
>> of a nodatacow object will cause it to fragment as writes are made to
>> it.

> AFAIU, the one the get's fragmented then is the snapshot, right, and the
> "original" will stay in place where it was? (Which is of course good,
> because one probably marked it nodatacow, to avoid that fragmentation
> problem on internal writes).

No.  Or more precisely, keep in mind that from btrfs' perspective, in 
terms of reflinks, once made, there's no "original" in terms of special 
treatment, all references to the extent are treated the same.

What a snapshot actually does is create another reference (reflink) to an 
extent.  What btrfs normally does on change as a cow-based filesystem is 
of course copy-on-write the change.  What nocow does, in the absence of 
other references to that extent, is rewrite the change in-place.

But if there's another reference to that extent, the change can't be in-
place because that would change the file reached by that other reference 
as well, and the change was only to be made to one of them.  So in the 
case of nocow, a cow1 (one-time-cow) exception must be made, rewriting 
the changed data to a new location, as the old location continues to be 
referenced by at least one other reflink.

So (with the fact that writable snapshots are available and thus it can 
be the snapshot that changed if it's what was written to) the one that 
gets the changed fragment written elsewhere, thus getting fragmented, is 
the one that changed, whether that's the working copy or the snapshot of 
that working copy.

> I'd assume the same happens when I do a reflink cp.

Yes.  It's the same reflinking mechanism, after all.  If there's other 
reflinks to the extent, snapshot or otherwise, changes must be written 
elsewhere, even if they'd otherwise be nocow.

> Can one make a copy, where one still has atomicity (which I guess
> implies CoW) but where the destination file isn't heavily fragmented
> afterwards,... i.e. there's some pre-allocation, and then cp really does
> copy each block (just everything's at the state of time where I stared
> cp, not including any other internal changes made on the source in
> between).

The way that's handled is via ro snapshots which are then copied, which 
of course is what btrfs send does (at least in non-incremental mode, and 
incremental mode still uses the ro snapshot part to get atomicity), in 
effect.

> And one more:
> You both said, auto-defrag is generally recommended.
> Does that also apply for SSDs (where we want to avoid unnecessary
> writes)?
> It does seem to get enabled, when SSD mode is detected.
> What would it actually do on an SSD?

Did you mean it does _not_ seem to get (automatically) enabled, when SSD 
mode is detected, or that it _does_ seem to get enabled, when 
specifically included in the mount options, even on SSDs?

Or did you actually mean it the way you wrote it, that it seems to be 
enabled (implying automatically, along with ssd), when ssd mode is 
detected?

Because the latter would be a shock to me, as that behavior hasn't been 
documented anywhere, but I can't imagine it's actually doing it and that 
you actually meant what you actually wrote.


If you look waaayyy back to shortly before I did my first more or less 
permanent deployment (I had initially posted some questions and did an 
initial experimental deployment several months earlier, but it didn't 
last long, because $reasons), you'll see a post I made to the list with 
pretty much the same general question, autodefrag on ssd, or not.

I believe the most accurate short answer is that the benefit of 
autodefrag on SSD is fuzzy, and thus left to local choice/policy, without 
an official recommendation either way.

There are two points that we know for certain: (1) the zero-seek-time of 
SSD effectively nullifies the biggest and most direct cost associated 
with fragmentation on spinning rust, thereby lessening the advantage of 
autodefrag as seen on spinning rust by an equally large degree, and (2) 
autodefrag will without question lead to a relatively limited number of 
near-time additional writes, as the rewrite is queued and eventually 
processed.

To the extent that an admin considers these undisputed factors alone, or 
weighs them less heavily than the more controversial factors below, 
they're likely to consider autodefrag on ssd a net negative and leave it 
off.

But I was persuaded by the discussion when I asked the question, to 
enable autodefrag on my all-ssd btrfs deployment here.  Why?  Those 
other, less direct and arguably less directly 

Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-12-14 Thread Duncan
Christoph Anton Mitterer posted on Mon, 14 Dec 2015 03:46:01 +0100 as
excerpted:

>> Same here.  In fact, my most anticipated feature is N-way-mirroring,
> Hmm ... not totally sure about that...
> AFAIU, N-way-mirroring is what currently the currently wrongly called
> RAID1 is in btrfs, i.e. having N replicas of everything on M devices,
> right?
> In other words, not being a N-parity-RAID and not guaranteeing that
> *any* N disks could fail, right?

No.  N-way-mirroring, at least in simplest form (as in md/raid1) is N 
replicas on N devices, so loss of N-1 devices is permitted without loss 
of data.

Normally the best thing about this is that unlike parity, once the 
general support is in, you can increase redundancy at will, with 
guaranteed device-loss protection of as many devices as you care to 
insure against.

At one point with somewhat old devices that I didn't particularly trust 
any more and because I had them from a previous raid6 setup, I was 
running 4-way-md/raid1.

Of course with md/raid1, the problem is lack of any sort of data 
integrity assurance, even scrubbing just arbitrarily chooses one and in 
the case of difference, simply copies that to the others, not even 
plurality-vote most authoritative version.

With btrfs checksumming, the value of N-way-mirroring is increased 
dramatically, since it allows individual block verification and fallback, 
as opposed to whole-device-loss.

While my own sweet-spot balance will tend to be three-way, avoiding the 
"if one copy is bad (perhaps because of a device that's known failing/
failed), you better /hope/ your only remaining copy is good" problem of 
the present two-way-only solution, I could easily see people finding 
value in 4/5/6-way mirroring as well.

And of course if that is extended to raid10, three-way-mirroring, two-way-
striping, on six total devices, would be my preferred, over the three-way-
striped, two-way-mirrored, that's the only current choice for six-device 
btrfs raid10.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel lockup, might be helpful log.

2015-12-14 Thread Duncan
Hugo Mills posted on Mon, 14 Dec 2015 08:35:24 + as excerpted:

> It's not just btrfs. Invalid opcode is the way that the kernel's BUG and
> BUG_ON macro is implemented.

Thanks.  I indicated that I suspected broader kernel use further down the 
reply, but it's very nice to have confirmation, both of invalid opcode 
use elsewhere, and of it being the kernel's general implementation for 
BUG and BUG_ON.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: freeze_bdev and scrub/re-balance

2015-12-14 Thread Wang, Zhiye
Thank you liubo for your reply.

But I am very clear with your meaning of "It should be like that with COW 
enabled"

I'd like to confirm, if defragment/scrub/rebalance is in progress, and my code 
calls "freeze_bdev" (in kernel code, or in user space code via ioctl), I can 
get a consistent file system state. "consistent file system state" means, if I 
run a LVM snapshot (or hardware snapshot, or even "dd" if it can do that 
quickly) after call freeze_bdev, the snapshot is file system consistent.


Thanks
Mike


-Original Message-
From: Liu Bo [mailto:bo.li@oracle.com] 
Sent: Thursday, December 10, 2015 1:22 AM
To: Wang, Zhiye
Cc: linux-btrfs@vger.kernel.org
Subject: Re: freeze_bdev and scrub/re-balance

On Sat, Dec 05, 2015 at 09:57:18AM +, Wang, Zhiye wrote:
> Hi all,
> 
> 
> If I understand it correctly, defragment operation is done in user space 
> tools, while scrub/re-balance is done in kernel thread.

Defragment is done via a IOCTL, so it also works in the kernel.

> 
> 
> So, if my kernel module calls freeze_bdev when scrub/re-balance is in 
> progress, will I still be able to get a consistent file system state?

It should be like that with COW enabled.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-14 Thread Duncan
Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:

> Oh, my poor English... :(

Well, as I said, native English speakers commonly enough mis-negate...

The real issue seems to be that English simply lacks proper support for 
the double-negatives feature that people keep wanting to use, despite the 
fact that it yields an officially undefined result that compilers (people 
reading/hearing) don't quite know what to do with, with actual results 
often throwing warnings and generally changing from compiler to 
compiler . =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check inconsistency with raid1, part 1

2015-12-14 Thread Duncan
Chris Murphy posted on Mon, 14 Dec 2015 00:24:21 -0700 as excerpted:

>> Personally speaking, it may be a false alert from btrfsck.
>> So in this case, I can't provide much help.
>>
>> If you're brave enough, mount it rw to see what will happen(although it
>> may mount just OK).
> 
> I'm brave enough. I'll give it a try tomorrow unless there's another
> request for more info before then.

Given the off-by-one generations and my own btrfs raid1 experience, I'm 
guessing the likely result is a good mount and either no problems or a 
good initial mount but lockup once you try actually doing too much (like 
actually reading the affected blocks) with the filesystem.

Looks like a normal generation-out-of-sync condition, common with forced 
unsynced/not-remounted-ro shutdowns.  If so, btrfs should redirect reads 
to the updated current generation device, but you'll need to do a scrub 
to get everything 100% back in sync.

The catch I found, at least when I still had the then-failing (but not 
failed, it was just finding more and more sectors that needed redirected 
to spares) ssd still in my raid1, also with an on-boot service that read 
a rather large dir into cache, was that after so many errors from the 
failing device, instead of continuing to redirect errors to the good 
device, btrfs just gives up, which resulted in a system crash, here.

But when there weren't that many errors on the failing device, or when I 
intercepted the boot process and mounted everything but didn't run normal 
post-mount services (systemd emergency target instead of my usual default 
multi-user) so the service that cached that dir didn't have a chance to 
run, so all those errors didn't trigger, I could still mount normally, 
and from there, I could run scrub, which took care of the problem without 
triggering the usual too many errors crash, and after scrub, I could 
invoke normal multi-user mode and start all services including the 
caching service, and go about my usual business.

So if I'm correct, mount normally and scrub, and you should be fine, tho 
you may have to abort a normal boot if it accesses too many bad files, in 
ordered to be able to finish the scrub before a crash.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk

2015-12-14 Thread Laurent Bonnaud
On 11/12/2015 15:21, Laurent Bonnaud wrote:

> The next step will we to run a "btrfs scrub" to check if data loss did 
> happen...

Scrubbing is now finished and it detected no errors.

-- 
Laurent.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


safety of journal based fs (was: Re: still kworker at 100% cpu…)

2015-12-14 Thread Martin Steigerwald
Hi!

Using a different subject for the journal fs related things which are off 
topic, but still interesting. Might make sense to move to fsdevel-ml or ext4/
XFS mailing lists? Otherwise, I suggest we focus on BTRFS here. Still wanted 
to reply.

Am Montag, 14. Dezember 2015, 16:48:58 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/14 09:18 +0100:
> > Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> >> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
[…]
> >>> I am seriously consider to switch to XFS for my production laptop again.
> >>> Cause I never saw any of these free space issues with any of the XFS or
> >>> Ext4 filesystems I used in the last 10 years.
> >> 
> >> Yes, xfs and ext4 is very stable for normal use case.
> >> 
> >> But at least, I won't recommend xfs yet, and considering the nature or
> >> journal based fs, I'll recommend backup power supply in crash recovery
> >> for both of them.
> >> 
> >> Xfs already messed up several test environment of mine, and an
> >> unfortunate double power loss has destroyed my whole /home ext4
> >> partition years ago.
> > 
> > Wow. I have never seen this. Actual I teach journal filesystems being
> > quite
> > safe on power losses as long as cache flushes (former barrier)
> > functionality is active and working. With one caveat: It relies on one
> > sector being either completely written or not. I never seen any
> > scientific proof for that on usual storage devices.
> 
> The journal is used to be safe against power loss.
> That's OK.
> 
> But the problem is, when recovering journal, there is no journal of
> journal, to keep journal recovering safe from power loss.

But the journal should be safe due to a journal commit being one sector? Of 
course for the last changes without a journal commit its: The stuff is gone.

> And that's the advantage of COW file system, no need of journal completely.
> Although Btrfs is less safe than stable journal based fs yet.
> 
> >> [xfs story]
> >> After several crash, xfs makes several corrupted file just to 0 size.
> >> Including my kernel .git directory. Then I won't trust it any longer.
> >> No to mention that grub2 support for xfs v5 is not here yet.
> > 
> > That is no filesystem metadata structure crash. It is a known issue with
> > delayed allocation. Same with Ext4. I teach this as well in my performance
> > analysis & tuning course.
> 
> Unfortunately, it's not about delayed allocation, as it's not a new
> file, it's file already here with contents in previous transaction.
> The workload should only rewrite the files.(Not sure though)

For what I know the overwriting after truncating case is also related to the 
delayed allocation, deferred write thing: File has been truncated to zero 
bytes in journal, while no data has been written.

But well for Ext4 / XFS it doesn´t need to reallocate in this case.

> And for ext4 case, I'll see corrupted files, but not truncated to 0 size.
> So IMHO it may be related to xfs recovery behavior.
> But not sure as I never read xfs codes.

Journals online provide *metadata* consistency. Unless you use Ext4 with 
data=journal, which is supposed to be much slower, but in some workloads its 
actually faster. Even Andrew Morton had not explaination for that, however I 
do have an idea about it. Also data=journal is interesting, if you put journal 
for harddisk based Ext4 onto an SSD or an SSD RAID 1 or so.

> > Also BTRFS in principle has this issue I believe.  As far as I am aware it
> > has a fix for the rename case, not using delayed allocation in the case.
> > Due to its COW nature it may not be affected at all however, I don´t
> > know.
> Anyway for rewrite case, none of these fs should truncate fs size to 0.
> However, it seems xfs doesn't follow the way though.
> Although I'm not 100% sure, as after that disaster I reinstall my test
> box using ext4.
> 
> (Maybe next time I should try btrfs, at least when it fails, I have my
> chance to submit new patches to kernel or btrfsck)

I do think its the applications doing that on overwriting a file. Rewriting a 
config file for example. Its either write new file, rename to old, or truncate 
to zero bytes and rewrite.

Of course, its different for databases or other files written into without 
rewriting them. But there you need data=journal on Ext4. XFS doesn´t guarentee 
file consistency at all in that case, unless the application serializes 
changes with fsync() properly by using an in application journal for the data 
to write.

> >> [ext4 story]
> >> For ext4, when recovering my /home partition after a power loss, a new
> >> power loss happened, and my home partition is doomed.
> >> Only several non-sense files are savaged.
> > 
> > During a fsck? Well that is quite a special condition I´d say. Of course I
> > think aborting an fsck should be safe at all time, but I wouldn´t be
> > surprised if it wasn´t.
> 
> Not only a fsck, any timing doing journal replay will be affected, like
> mounting a dirty 

Re: Kernel lockup, might be helpful log.

2015-12-14 Thread Filipe Manana
On Sun, Dec 13, 2015 at 10:55 PM, Birdsarenice  wrote:
> I've finally finished deleting all those nasty unreliable Seagate drives
> from my array. During the process I crashed my server - over, and over, and
> over. Completely gone - screen blank, controls unresponsive, no network
> activity (no, I don't have root on btrfs - data only). Most annoying, but I
> think btrfs survived it all somehow - it's scrubbing now.
>
> Meanwhile, I did get lucky: At one crash I happened to be logged in and was
> able to hit dmesg seconds before it went completely. So what I have here is
> information that looks like it'll help you track down a rarely-encountered
> and hard-to-reproduce bug which can cause the system to lock up completely
> in event of certain types of hard drive failure. It might be nothing, but
> perhaps someone will find it of use - because it'd be a tricky one to both
> reproduce and get a good error report if it did occur.
>
> I see an 'invalid opcode' error in here, that's pretty unusual - and again
> it even gives a file name and line number to look at. The root cause of all
> my issues is the NCQ issue with Seagate 8TB archive drives, which is Someone
> Else's Problem - but I think some good can come of this, as these exotic
> forms of corruption and weird drive semi-failures have revealed ways in
> which btrfs's error handling could be made more graceful.
>
> Meanwhile I remain impressed that btrfs appears to have kept all my data
> intact even though all these issues.

Regarding the trace you got, from a BUG_ON, it's due a regression
present in 4.2 and 4.3 kernels that got fixed in 4.4-rc. The fixes are
scheduled for the next stable releases of 4.2.x and 4.3.x. A ton of
people have hit this (one example report
http://www.spinics.net/lists/linux-btrfs/msg49766.html).



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Chris Murphy
On Mon, Dec 14, 2015 at 7:24 AM, Austin S. Hemmelgarn
 wrote:

>
> If you have software that actually depends on atimes, then that software is
> broken (and yes, I even feel this way about Mutt).  The way atimes are
> implemented on most systems breaks the semantics that almost everyone
> expects from them, because they get updated for anything that even looks
> sideways at the inode from across the room.  Most software that uses them
> expects them to answer the question 'When were the contents of this file
> last read?', but they can get updated even for stuff like calculating file
> sizes, listing directory contents, or modifying the file's metadata.

This Jonathan Corbet article still applies:
http://lwn.net/Articles/397442/

What a mess!

Hey. The 5 year anniversary was in July. Wanna bring it up again, Austin? Haha.
http://thread.gmane.org/gmane.linux.kernel.cifs/294

Users want file creation time. Specifically, an immutable time for
that file that persists across file system copies. The time of its
first occurrence on a particular volume is not useful information.
Getting that requires what seems to be an unlikely consensus.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-13 19:27, Christoph Anton Mitterer wrote:

On Fri, 2015-12-11 at 16:06 -0700, Chris Murphy wrote:

For anything but a new and empty Btrfs volume

What's the influence of the fs being new/empty?


this hypothetical
attack would be a ton easier to do on LVM and mdadm raid because they
have a tiny amount of metadata to spoof compared to a Btrfs volume
with even a little bit of data on it.

Uhm I haven't said that other systems properly handle this kind of
attack. ;-)
Guess that would need to be evaluated...



  I think this concern is overblown.

I don't think so. Let me give you an example: There is an attack[0]
against crypto, where the attacker listens via a smartphone's
microphone, and based on the acoustics of a computer where gnupg runs.
This is surely not an attack many people would have considered even
remotely possible, but in fact it works, at least under lab conditions.

I guess the same applies for possible attack vectors like this here.
The stronger actual crypto and the strong software gets in terms of
classical security holes (buffer overruns and so), the more attackers
will try to go alternative ways.
The reason that this isn't quite as high of a concern is because 
performing this attack requires either root access, or direct physical 
access to the hardware, and in either case, your system is already 
compromised.


I still think that that isn't a sufficient excuse for not fixing the 
issue, as there are a number of non-security related issues that can 
result from this (there are some things that are common practice with 
LVM or mdraid that can't be done with BTRFS because of this).



I'm suggesting bitwise identical copies being created is not what is
wanted most of the time, except in edge cases.

mhh,.. well there's the VM case, e.g. duplicating a template VM,
booting it deploying software. Guess that's already common enough.
There are people who want to use btrfs on top of LVM and using the
snapshot functionality of that... another use case.
Some people may want to use it on top of MD (for whatever reason)... at
least in the mirroring RAID case, the kernel would see the same btrfs
twice.
Also, using flat DM-RAID (and yes, people do use DM-RAID without LVM), 
using the DM-cache target, some multi-path setups, some shared storage 
setups, a couple of other DM targets, and probably a number of other 
things I haven't thought of yet.


Apart from that, btrfs should be a general purpose fs, and not just a
desktop or server fs.
So edge cases like forensics (where it's common that you create bitwise
identical images) shouln't be forgotten either.
While I would normally agree, there are ways to work around this in the 
forensics case that don't work for any other case (namely, if BTRFS is 
built as a module, you can unmount everything, unload the module, reload 
it, and only scan the devices you want).




If your workflow requires making an exact copy (for the shelf or
for
an emergency) then dd might be OK. But most often it's used
because
it's been easy, not because it's a good practice.

Ufff.. I wouldn't got that far to call something here bad or good
practice.


It's not just bad practice, it's sufficiently sloppy that it's very
nearly user sabotage. That this is due to innocent ignorance, and a
long standing practice that's bad advice being handed down from
previous generations doesn't absolve the practice and mean we should
invent esoteric work arounds for what is not a good practice. We have
all sorts of exhibits why it's not a good idea.

Well if you don't give any real arguments or technical reasons (apart
from "working around software that doesn't handle this well") I
consider this just repetition of the baseless claim that long standing
practise would be bad.
Agreed, if yo9u can't substantiate _why_ it's bad practice, then you 
aren't making a valid argument.  The fact that there is software that 
doesn't handle it well would say to me based on established practice 
that that software is what's broken, not common practice.


The assumption that a UUID is actually unique is an inherently flawed 
one, because it depends both on the method of generation guaranteeing 
it's unique (and none of the defined methods guarantee that), and a 
distinct absence of malicious intent.



I disagree. It was due to the rudimentary nature of earlier
filesystems' metadata paradigm that it worked. That's no longer the
case.

Well in the end it's of course up to the developers to decide whether
this is acceptable or not, but being on the admin/end-user side, I can
at least say that not everyone on there would accept "this is no longer
the case" as valid explanation when their fs was corrupted or attacked.
On that note, why exactly is it better to make the filesystem UUID such 
an integral part of the filesystem?  The other thing I'm reading out of 
this all, is that by writing a total of 64 bytes to a specific location 
in a single disk in a multi-device BTRFS filesystem, you can make the 

Re: dear developers, can we have notdatacow + checksumming, plz?

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-13 23:59, Christoph Anton Mitterer wrote:

(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts
label them with a more attracting subject and give it some bigger
attention.
O:-)




1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)


I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)


When one starts to get a bit deeper into btrfs (from the admin/end-user
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.
It is worth pointing out that in the case of DB's at least, this is 
because at least some of the do COW internally to provide the 
transactional semantics that are required for many workloads.


Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
There are other things that can be done to improve this.  I would assume 
of course that you're already doing some of them (stuff like using 
dedicated storage controller cards instead of the stuff on the 
motherboard), but some things often get overlooked, like actually taking 
the time to fine-tune the I/O scheduler for the workload (Linux has 
particularly brain-dead default settings for CFQ, and the deadline I/O 
scheduler is only good in hard-real-time usage or on small hard drives 
that actually use spinning disks).

b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
This really isn't possible with an SSD.  Except for NVMe and Open 
Channel SSD's, they use the same interfaces as a regular hard drive, 
which means you get absolutely no information about the data layout on 
the device.


The big argument for defragmenting a SSD is that it makes it such that 
you require fewer I/O requests to the device to read a file, and in most 
cases, the device will outlive it's usefulness because of performance 
long before it dies due to wearing out the flash storage.

Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.
The problem is not entirely the lack of COW semantics, it's also the 
fact that it's impossible to implement an atomic write on a hard disk. 
If we could tell the disk 'ensure that this set of writes either all 
happen, or none of them happen', then we could do 

Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-12 17:15, Christoph Anton Mitterer wrote:

On Sat, 2015-11-28 at 06:49 +, Duncan wrote:

Christoph Anton Mitterer posted on Sat, 28 Nov 2015 04:57:05 +0100 as
excerpted:

Still, specifically for snapshots that's a bit unhandy, as one
typically
doesn't mount each of them... one rather mount e.g. the top level
subvol
and has a subdir snapshots there...
So perhaps the idea of having snapshots that are per se noatime is
still
not too bad.

Read-only snapshots?

So you basically mean that ro snapshots won't have their atime updated
even without noatime?
Well I guess that was anyway the recent behaviour of Linux filesystems,
and only very old UNIX systems updated the atime even when the fs was
set ro.
Unless things have changed very recently, even many modern systems 
update atime on read-only filesystems, unless the media itself is 
read-only.  This is part of the reason for some of the forensics tools 
out there that drop write commands to the block devices connected to them.



That'd do it, and of course you can toggle the read-
only property (see btrfs property and its btrfs-property manpage).

Sure, but then it would still be nice for rw snapshots.

I guess what I probably actually want is the ability to set noatime as
a property.
I'll add that in a "feature request" on the project ideas wiki.


Alternatively, mount the toplevel subvol read-only or noatime on one
mountpoint, and bind-mount it read-write or whatever other
appropriate

Well it's of course somehow possible... but that seems a bit ugly to
me... the best IMHO, would really be if one could set a property on
snapshots that marks them noatime.
If you have software that actually depends on atimes, then that software 
is broken (and yes, I even feel this way about Mutt).  The way atimes 
are implemented on most systems breaks the semantics that almost 
everyone expects from them, because they get updated for anything that 
even looks sideways at the inode from across the room.  Most software 
that uses them expects them to answer the question 'When were the 
contents of this file last read?', but they can get updated even for 
stuff like calculating file sizes, listing directory contents, or 
modifying the file's metadata.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: pull btrfs clone API to vfs layer

2015-12-14 Thread Christoph Hellwig
On Wed, Dec 09, 2015 at 12:40:33PM -0800, Darrick J. Wong wrote:
> I tried this patch series on ppc64 (w/ 32-bit powerpc userland) and I think
> it needs to fix up the compat ioctl to make the vfs call...

Might need a proper signoff for Al, unless he wants to directly fold it..
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Enhance chunk validation check

2015-12-14 Thread David Sterba
On Tue, Dec 08, 2015 at 05:05:22PM +0800, Qu Wenruo wrote:
> +#define IS_ALIGNED(x, a)(((x) & ((typeof(x))(a) - 1)) == 0)
> +

> +static inline int is_power_of_2(unsigned long n)
> +{
> + return (n != 0 && ((n & (n - 1)) == 0));
> +}

Please move them to kerncompat.h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: use linux/sizes.h to represent constants

2015-12-14 Thread Byongho Lee
We use many constants to represent size and offset value.  And to make
code readable we use '256 * 1024 * 1024' instead of '268435456' to
represent '256MB'.  However we can make far more readable with 'SZ_256MB'
which is defined in the 'linux/sizes.h'.

So this patch replaces 'xxx * 1024 * 1024' kind of expression with
single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
not a power of 2. And I haven't touched to '4096' & '8192' because it's
more intuitive than 'SZ_4KB' & 'SZ_8KB'.

Signed-off-by: Byongho Lee 
---
 fs/btrfs/ctree.c  |   2 +-
 fs/btrfs/ctree.h  |   5 +-
 fs/btrfs/disk-io.c|   2 +-
 fs/btrfs/disk-io.h|   4 +-
 fs/btrfs/extent-tree.c|  29 +++---
 fs/btrfs/extent_io.c  |   2 +-
 fs/btrfs/free-space-cache.c   |  10 +-
 fs/btrfs/inode-map.c  |   2 +-
 fs/btrfs/inode.c  |  22 ++---
 fs/btrfs/ioctl.c  |  23 +++--
 fs/btrfs/send.h   |   4 +-
 fs/btrfs/super.c  |   2 +-
 fs/btrfs/tests/extent-io-tests.c  |  11 ++-
 fs/btrfs/tests/free-space-tests.c | 186 --
 fs/btrfs/tests/inode-tests.c  |   2 +-
 fs/btrfs/volumes.c|  16 ++--
 fs/btrfs/volumes.h|   2 +-
 17 files changed, 147 insertions(+), 177 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 5b8e235c4b6d..cb7720f91a4a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1555,7 +1555,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
return 0;
}
 
-   search_start = buf->start & ~((u64)(1024 * 1024 * 1024) - 1);
+   search_start = buf->start & ~((u64)SZ_1G - 1);
 
if (parent)
btrfs_set_lock_blocking(parent);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0165c6e6243..f8fd2a761d52 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -196,9 +197,9 @@ static int btrfs_csum_sizes[] = { 4 };
 /* ioprio of readahead is set to idle */
 #define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0))
 
-#define BTRFS_DIRTY_METADATA_THRESH(32 * 1024 * 1024)
+#define BTRFS_DIRTY_METADATA_THRESHSZ_32M
 
-#define BTRFS_MAX_EXTENT_SIZE (128 * 1024 * 1024)
+#define BTRFS_MAX_EXTENT_SIZE SZ_128M
 
 /*
  * The key defines the order in the tree, and so it also defines (optimal)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1eb08393bff0..79e8cff82212 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2809,7 +2809,7 @@ int open_ctree(struct super_block *sb,
 
fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super);
fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
-   4 * 1024 * 1024 / PAGE_CACHE_SIZE);
+   SZ_4M / PAGE_CACHE_SIZE);
 
tree_root->nodesize = nodesize;
tree_root->sectorsize = sectorsize;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index adeb31830b9c..a407d1bcf821 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -19,7 +19,7 @@
 #ifndef __DISKIO__
 #define __DISKIO__
 
-#define BTRFS_SUPER_INFO_OFFSET (64 * 1024)
+#define BTRFS_SUPER_INFO_OFFSET SZ_64K
 #define BTRFS_SUPER_INFO_SIZE 4096
 
 #define BTRFS_SUPER_MIRROR_MAX  3
@@ -35,7 +35,7 @@ enum btrfs_wq_endio_type {
 
 static inline u64 btrfs_sb_offset(int mirror)
 {
-   u64 start = 16 * 1024;
+   u64 start = SZ_16K;
if (mirror)
return start << (BTRFS_SUPER_MIRROR_SHIFT * mirror);
return BTRFS_SUPER_INFO_OFFSET;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680a1923..8cefe2c1a936 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -521,7 +521,7 @@ next:
else
last = key.objectid + key.offset;
 
-   if (total_found > (1024 * 1024 * 2)) {
+   if (total_found > SZ_2M) {
total_found = 0;
if (wakeup)
wake_up(_ctl->wait);
@@ -3328,7 +3328,7 @@ static int cache_save_setup(struct 
btrfs_block_group_cache *block_group,
 * If this block group is smaller than 100 megs don't bother caching the
 * block group.
 */
-   if (block_group->key.offset < (100 * 1024 * 1024)) {
+   if (block_group->key.offset < (100 * SZ_1M)) {
spin_lock(_group->lock);
block_group->disk_cache_state = BTRFS_DC_WRITTEN;
spin_unlock(_group->lock);
@@ -3428,7 +3428,7 @@ again:
 * taking up quite a bit since it's not folded into the other space
 * cache.
 */
-   num_pages = 

Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread David Sterba
On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:
> Introduce a new mount option "nologreplay" to co-operate with "ro" mount
> option to get real readonly mount, like "norecovery" in ext* and xfs.
> 
> Since the new parse_options() need to check new flags at remount time,
> so add a new parameter for parse_options().
> 
> Signed-off-by: Qu Wenruo 
> Reviewed-by: Chandan Rajendra 
> Tested-by: Austin S. Hemmelgarn 

I've read the discussions around the change and from the user's POV I'd
suggest to add another mount option that would be just an alias for any
mount options that would implement the 'hard-ro' semantics.

Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
more options in the future.

 mount -o ro,nowr /dev/sdx /mnt

would work when switching kernels.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs: Enhance super validation check

2015-12-14 Thread David Sterba
On Tue, Dec 08, 2015 at 03:35:57PM +0800, Qu Wenruo wrote:
> @@ -4005,31 +3989,47 @@ static int btrfs_check_super_valid(struct 
> btrfs_fs_info *fs_info,
>   }
>  
>   /*
> -  * The common minimum, we don't know if we can trust the 
> nodesize/sectorsize
> -  * items yet, they'll be verified later. Issue just a warning.
> +  * Check sectorsize and nodesize first, some other check will need it.
> +  * XXX: Just do a favor for later subpage size check. Check all

Same as in v1: Please do not add new XXX or TODO markers to the sources.
The comment would be fine with just:

 "Check all possible sectorsizes (4K, 8K, 16K, 32K, 64K) here."

With that fixed,

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/4] vfs: return EINVAL for unsupported file types in clone

2015-12-14 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 

diff --git a/fs/read_write.c b/fs/read_write.c
index 1f0d3f1..6268ebc 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1528,7 +1528,7 @@ int vfs_clone_file_range(struct file *file_in, loff_t 
pos_in,
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
-   return -EOPNOTSUPP;
+   return -EINVAL;
 
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: use linux/sizes.h to represent constants

2015-12-14 Thread David Sterba
On Tue, Dec 15, 2015 at 01:42:10AM +0900, Byongho Lee wrote:
> We use many constants to represent size and offset value.  And to make
> code readable we use '256 * 1024 * 1024' instead of '268435456' to
> represent '256MB'.  However we can make far more readable with 'SZ_256MB'
> which is defined in the 'linux/sizes.h'.
> 
> So this patch replaces 'xxx * 1024 * 1024' kind of expression with
> single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is
> not a power of 2. And I haven't touched to '4096' & '8192' because it's
> more intuitive than 'SZ_4KB' & 'SZ_8KB'.
> 
> Signed-off-by: Byongho Lee 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfs: pull btrfs clone API to vfs layer

2015-12-14 Thread Darrick J. Wong
On Wed, Dec 09, 2015 at 12:40:33PM -0800, Darrick J. Wong wrote:
> On Thu, Dec 03, 2015 at 12:59:50PM +0100, Christoph Hellwig wrote:
> > The btrfs clone ioctls are now adopted by other file systems, with NFS
> > and CIFS already having support for them, and XFS being under active
> > development.  To avoid growth of various slightly incompatible
> > implementations, add one to the VFS.  Note that clones are different from
> > file copies in several ways:
> > 
> >  - they are atomic vs other writers
> >  - they support whole file clones
> >  - they support 64-bit legth clones
> >  - they do not allow partial success (aka short writes)
> >  - clones are expected to be a fast metadata operation
> > 
> > Because of that it would be rather cumbersome to try to piggyback them on
> > top of the recent clone_file_range infrastructure.  The converse isn't
> > true and the clone_file_range system call could try clone file range as
> > a first attempt to copy, something that further patches will enable.
> > 
> > Based on earlier work from Peng Tao.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  fs/btrfs/ctree.h|   3 +-
> >  fs/btrfs/file.c |   1 +
> >  fs/btrfs/ioctl.c|  49 ++-
> >  fs/cifs/cifsfs.c|  63 
> >  fs/cifs/cifsfs.h|   1 -
> >  fs/cifs/ioctl.c | 126 
> > +++-
> >  fs/ioctl.c  |  29 +++
> 
> I tried this patch series on ppc64 (w/ 32-bit powerpc userland) and I think
> it needs to fix up the compat ioctl to make the vfs call...

Bah, forgot to add:
Signed-off-by: Darrick J. Wong 

(Feel free to fold this three line chunk into the original patch...)

--D

> diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
> index dcf2653..70d4b10 100644
> --- a/fs/compat_ioctl.c
> +++ b/fs/compat_ioctl.c
> @@ -1580,6 +1580,10 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, 
> unsigned int, cmd,
> goto out_fput;
>  #endif
>  
> +   case FICLONE:
> +   case FICLONERANGE:
> +   goto do_ioctl;
> +
> case FIBMAP:
> case FIGETBSZ:
> case FIONREAD:
> 
> --D
> 
> >  fs/nfs/nfs4file.c   |  87 -
> >  fs/read_write.c |  72 +++
> >  include/linux/fs.h  |   7 ++-
> >  include/uapi/linux/fs.h |   9 
> >  11 files changed, 254 insertions(+), 193 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index ede7277..dd4733f 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -4025,7 +4025,6 @@ void btrfs_get_block_group_info(struct list_head 
> > *groups_list,
> >  void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
> >struct btrfs_ioctl_balance_args *bargs);
> >  
> > -
> >  /* file.c */
> >  int btrfs_auto_defrag_init(void);
> >  void btrfs_auto_defrag_exit(void);
> > @@ -4058,6 +4057,8 @@ int btrfs_fdatawrite_range(struct inode *inode, 
> > loff_t start, loff_t end);
> >  ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >   struct file *file_out, loff_t pos_out,
> >   size_t len, unsigned int flags);
> > +int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
> > +  struct file *file_out, loff_t pos_out, u64 len);
> >  
> >  /* tree-defrag.c */
> >  int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index e67fe6a..232e300 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2925,6 +2925,7 @@ const struct file_operations btrfs_file_operations = {
> > .compat_ioctl   = btrfs_ioctl,
> >  #endif
> > .copy_file_range = btrfs_copy_file_range,
> > +   .clone_file_range = btrfs_clone_file_range,
> >  };
> >  
> >  void btrfs_auto_defrag_exit(void)
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 0f92735..85b1cae 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, 
> > loff_t pos_in,
> > return ret;
> >  }
> >  
> > -static noinline long btrfs_ioctl_clone(struct file *file, unsigned long 
> > srcfd,
> > -  u64 off, u64 olen, u64 destoff)
> > +int btrfs_clone_file_range(struct file *src_file, loff_t off,
> > +   struct file *dst_file, loff_t destoff, u64 len)
> >  {
> > -   struct fd src_file;
> > -   int ret;
> > -
> > -   /* the destination must be opened for writing */
> > -   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
> > -   return -EINVAL;
> > -
> > -   ret = mnt_want_write_file(file);
> > -   if (ret)
> > -   return ret;
> > -
> > -   src_file = fdget(srcfd);
> > -   if (!src_file.file) {
> > -   ret = -EBADF;
> > -   goto out_drop_write;
> > -   }
> 

Re: [4.3-rc4] scrubbing aborts before finishing

2015-12-14 Thread Henk Slager
[...]
>> > merkaba:~> btrfs fi sh /daten
>> > Label: 'daten'  uuid: […]
>> >
>> > Total devices 1 FS bytes used 227.23GiB
>> > devid1 size 230.00GiB used 230.00GiB path
[...]
>> > merkaba:~> btrfs fi df /daten
>> > Data, single: total=228.99GiB, used=226.79GiB
>> > System, single: total=4.00MiB, used=48.00KiB
>> > Metadata, single: total=1.01GiB, used=449.50MiB
>> > GlobalReserve, single: total=160.00MiB, used=0.00B

If this is still the fill-level of the storage device, then also with
4.4-rcX and new enough tools it will fail I think.
AFAIK, scrub does writes (in metadata?) so I think a non-read-only
scrub command can't allocate space. See all other comments/threads
w.r.t. allocated / free space.
Especially an fs of this size, I would keep ~10% free on
'device-level' ( 227.23GiB would need to be 207.00GiB ) and also ~10%
on 'chunk-level' ( 226.79GiB would need to be 186.30GiB ).

Assuming you don't have snapshots, a   btrfs fi defrag -r /daten
might give some more room short-term, after you just (re)moved files
off the fs first.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Chris Murphy
On Mon, Dec 14, 2015 at 6:23 AM, Austin S. Hemmelgarn
 wrote:
>
> Agreed, if yo9u can't substantiate _why_ it's bad practice, then you aren't
> making a valid argument.  The fact that there is software that doesn't
> handle it well would say to me based on established practice that that
> software is what's broken, not common practice.

The automobile is invented and due to the ensuing chaos, common
practice of doing whatever the F you wanted came to an end in favor of
rules of the road and traffic lights. I'm sure some people went
ballistic, but for the most part things were much better without the
brokenness or prior common practice.

So the fact we're going to have this problem with all file systems
that incorporate the volume UUID into the metadata stream, tells me
that the very rudimentary common practice of using dd needs to go
away, in general practice. I've already said data recovery (including
forensics) and sticking drives away on a shelf could be reasonable.

> The assumption that a UUID is actually unique is an inherently flawed one,
> because it depends both on the method of generation guaranteeing it's unique
> (and none of the defined methods guarantee that), and a distinct absence of
> malicious intent.

http://www.ietf.org/rfc/rfc4122.txt
"A UUID is 128 bits long, and can guarantee uniqueness across space and time."

Also see security considerations in section 6.


> On that note, why exactly is it better to make the filesystem UUID such an
> integral part of the filesystem?  The other thing I'm reading out of this
> all, is that by writing a total of 64 bytes to a specific location in a
> single disk in a multi-device BTRFS filesystem, you can make the whole
> filesystem fall apart, which is absolutely absurd.


OK maybe I'm  missing something.

1. UUID is 128 bits. So where are you getting the additional 48 bytes from?
2. The volume UUID is in every superblock, which for all practical
purposes means at least two instances of that UUID per device.

Are you saying the file system falls apart when changing just one of
those volume UUIDs in one superblock? And how does it fall apart? I'd
say all volume UUID instances (each superblock, on every device)
should be checked and if any of them mismatch then fail to mount.

There could be some leveraging of the device WWN, or absent that its
serial number, propogated into all of the volume's devices (cross
referencing each other's devid to WWN or serial). And then that way
there's a way to differentiate. In the dd case, there would be
mismatching real device WWN/serial number and the one written in
metadata on all drives, including the copy. This doesn't say what
policy should happen next, just that at least it's known there's a
mismatch.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Lionel Bouton
Le 14/12/2015 21:27, Austin S. Hemmelgarn a écrit :
> AFAIUI, the _only_ reason that that is still the default is because of
> Mutt, and that won't change as long as some of the kernel developers
> are using Mutt for e-mail and the Mutt developers don't realize that
> what they are doing is absolutely stupid.
>

Mutt is often used as an example but tmpwatch uses atime by default too
and it's quite useful.

If you have a local cache of remote files for which you want a good hit
ratio and don't care too much about its exact size (you should have
Nagios/Zabbix/... alerting you when a filesystem reaches a %free limit
if you value your system's availability anyway), using tmpwatch with
cron to maintain it is only one single line away and does the job. For
an example of this particular case, on Gentoo the /usr/portage/distfiles
directory is used in one of the tasks you can uncomment to activate in
the cron.daily file provided when installing tmpwatch.
Using tmpwatch/cron is far more convenient than using a dedicated cache
(which might get tricky if the remote isn't HTTP-based, like an
rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for example).
Some http frameworks put sessions in /tmp: in this case if you want
sessions to expire based on usage and not creation time, using tmpwatch
or similar with atime is the only way to clean these files. This can
even become a performance requirement: I've seen some servers slowing
down with tens/hundreds of thousands of session files in /tmp because it
was only cleaned at boot and the systems were almost never rebooted...

I use noatime and nodiratime on some BTRFS filesystems for performance
reasons: Ceph OSDs, heavily snapshotted first-level backup servers and
filesystems dedicated to database server files (in addition to
nodatacow) come to mind, but the cases where these options are really
useful even with BTRFS doesn't seem to be the common ones.

Finally Linus Torvalds has been quite vocal and consistent on the
general subject of the kernel not breaking user-space APIs no matter
what so I wouldn't have much hope for default kernel mount options
changes...

Lionel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-14 14:44, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:

The traditional reasoning was that read-only meant that users
couldn't
change anything

Where I'd however count the atime changes to.
The atimes wouldn't change magically, but only because the user stared
some program, configured some daemon, etc. ... which reads/writes/etc.
the file.
But reading the file is allowed, which is where this starts to get 
ambiguous.  Reading a file updates the atime (and in fact, this is the 
way that most stuff that uses them cares about them), but even a ro 
mount allows reading the file.  The traditional meaning of ro on UNIX 
was (AFAIUI) that directory structure couldn't change, new files 
couldn't be created, existing files couldn't be deleted, flags on the 
inodes couldn't be changed, and file data couldn't be changed.  TBH, I'm 
not even certain that atime updates on ro filesystems was even an 
intentional thing in the first place, it really sounds to me like the 
type of thing that somebody forgot to put in a permissions check for, 
and then people thought it was a feature.




, not that the actual data on disk wouldn't change.
That, and there's been some really brain-dead software over the years
that depended on atimes being right (now, the only remaining software
I
know of that even uses them at all is Mutt).

Wasn't tmpwatcher anoterh candidate?
Most such software can use it, but doesn't depend on it.  TBH, many 
people these days run /tmp (and even /var/tmp) on an in memory 
filesystem, so atime updates aren't as much of an issue there.  Also, 
even with noatime, I'm pretty sure the VFS updates the atime every time 
the mtime changes (because not doing so would be somewhat stupid, and 
you're writing the inode anyway), which technically means that stuff 
could work around this by opening the file, truncating it to the size it 
already is, and then closing it.




This should be 'Nothing on the backing device may change as a result
of
the FS', nitpicking I know, but we should be specific so that we
hopefully avoid ending up in the same situation again.

Of course, you're right! :-)

(especially when btrfs should ever be formalised in a standards
document, this should read like:

hard-ro: Nothing on the backing device may change as a result of the
FS, however, e.g. maleware, may directly destroy the data on the
blockdevice ;-)



Chris.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-14 14:39, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:

Unless things have changed very recently, even many modern systems
update atime on read-only filesystems, unless the media itself is
read-only.

Seriously? Oh... *sigh*...
You mean as in Linux, ext*, xfs?
Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD do 
it, but I'm not sure about Linux.



If you have software that actually depends on atimes, then that
software
is broken (and yes, I even feel this way about Mutt).

I don't disagree here :D


The way atimes
are implemented on most systems breaks the semantics that almost
everyone expects from them, because they get updated for anything
that
even looks sideways at the inode from across the room.  Most software
that uses them expects them to answer the question 'When were the
contents of this file last read?', but they can get updated even for
stuff like calculating file sizes, listing directory contents, or
modifying the file's metadata.

Sure... my point here again was, that I try to look every now and then
at the whole thing from the pure-end-user side:
For them, the default is relatime, and they likely may not want to
change that because they have no clue on how much further effects this
may have (or not).
So as long as Linux doesn't change it's defaults to noatime, leaving
things up to broken software (i.e. to get fixed), I think it would be
nice for the end-user, to have e.g. snapshots be "save" (from the
write-amplification on read) out of the box.
AFAIUI, the _only_ reason that that is still the default is because of 
Mutt, and that won't change as long as some of the kernel developers are 
using Mutt for e-mail and the Mutt developers don't realize that what 
they are doing is absolutely stupid.


FWIW, both Duncan and I have our own copy of the sources patched to 
default to noatime, and I know a number of embedded Linux developers who 
do likewise, and I've even heard talk in the past of some distributions 
possibly using such patches themselves (although it always ends up not 
happening, because of Mutt).


My idea would be basically, that having a noatime btrfs-property, which
is perhaps even set automatically, would be an elegant way of doing
that.
I just haven't had time to properly write that up and add is as a
"feature request" to the projects idea wiki page.

I like this idea.



Cheers,
Chris.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-14 14:08, Chris Murphy wrote:

On Mon, Dec 14, 2015 at 5:10 AM, Duncan <1i5t5.dun...@cox.net> wrote:

Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:


Oh, my poor English... :(


Well, as I said, native English speakers commonly enough mis-negate...

The real issue seems to be that English simply lacks proper support for
the double-negatives feature that people keep wanting to use, despite the
fact that it yields an officially undefined result that compilers (people
reading/hearing) don't quite know what to do with, with actual results
often throwing warnings and generally changing from compiler to
compiler . =:^)


It's a trap! Haha. Yeah like you say, it's not a matter of poor
English. Qu writes very understandable English. Officially in English
the negatives should cancel, which is different in many other
languages where additional negatives amplify. But even native English
speakers have dialects where it amplifies, rather than cancels. So I'd
consider the double or multiple negative in English as a
colloquialism. And a trap!


Some days I really wish Esperanto or Interlingua had actually caught on...

Or even Lojban, at least then the language would be more like the 
systems being discussed, even if it would be a serious pain to learn and 
use.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.3-rc4] scrubbing aborts before finishing

2015-12-14 Thread Henk Slager
On Mon, Dec 14, 2015 at 6:31 PM, Henk Slager  wrote:
> [...]
>>> > merkaba:~> btrfs fi sh /daten
>>> > Label: 'daten'  uuid: […]
>>> >
>>> > Total devices 1 FS bytes used 227.23GiB
>>> > devid1 size 230.00GiB used 230.00GiB path
> [...]
>>> > merkaba:~> btrfs fi df /daten
>>> > Data, single: total=228.99GiB, used=226.79GiB
>>> > System, single: total=4.00MiB, used=48.00KiB
>>> > Metadata, single: total=1.01GiB, used=449.50MiB
>>> > GlobalReserve, single: total=160.00MiB, used=0.00B
>
> If this is still the fill-level of the storage device, then also with
> 4.4-rcX and new enough tools it will fail I think.
> AFAIK, scrub does writes (in metadata?) so I think a non-read-only
> scrub command can't allocate space. See all other comments/threads
> w.r.t. allocated / free space.
> Especially an fs of this size, I would keep ~10% free on
> 'device-level' ( 227.23GiB would need to be 207.00GiB ) and also ~10%
> on 'chunk-level' ( 226.79GiB would need to be 186.30GiB ).
>
> Assuming you don't have snapshots, a   btrfs fi defrag -r /daten
> might give some more room short-term, after you just (re)moved files
> off the fs first.
# btrfs fi defrag -r -clzo /daten
I meant.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still not production ready

2015-12-14 Thread Chris Murphy
On Mon, Dec 14, 2015 at 5:10 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:
>
>> Oh, my poor English... :(
>
> Well, as I said, native English speakers commonly enough mis-negate...
>
> The real issue seems to be that English simply lacks proper support for
> the double-negatives feature that people keep wanting to use, despite the
> fact that it yields an officially undefined result that compilers (people
> reading/hearing) don't quite know what to do with, with actual results
> often throwing warnings and generally changing from compiler to
> compiler . =:^)

It's a trap! Haha. Yeah like you say, it's not a matter of poor
English. Qu writes very understandable English. Officially in English
the negatives should cancel, which is different in many other
languages where additional negatives amplify. But even native English
speakers have dialects where it amplifies, rather than cancels. So I'd
consider the double or multiple negative in English as a
colloquialism. And a trap!


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-progs: enhanced btrfsck progress patch proposition

2015-12-14 Thread Stéphane Lesimple

Hi,

I've been forking btrfs-progs locally to add an enhanced progress 
indicator, based on the work from Silvio Fricke posted here in 
September. I'm using it routinely, it has been of help when I was 
debugging my multi-Tb btrfs system, where I often had to use btrfs 
check. So I thought it might be of interest to others.


The patch replaces the .oOo. progress indicator with a count of the 
walked items (depending on the "thing" being inspected, this way if the 
walks stops or there's an infinite loop somewhere, you'll notice), adds 
an elapsed time indicator, and a step counter (currently 6).


Once you're used to your FS, you'll know roughly how many remaining time 
to expect. Here's how it looks like:


Opening filesystem to check...
Checking filesystem on /dev/mapper/luks-WD30EZRX-WCAWZ3013164
UUID: 428b20da-dcb1-403e-b407-ba984fd07ebd
[1/6] checking extents (0:00:59 elapsed, 199559 chunk items checked)
[2/6] checking free space cache (0:00:30 elapsed, 2547 cache objects 
checked)

[3/6] checking fs roots (0:02:33 elapsed, 19731 tree blocks checked)
[4/6] checking csums (0:01:37 elapsed, 391340 csums checked)
[5/6] checking root refs (0:00:00 elapsed, 7 root refs checked)
[6/6] checking quota groups skipped (not enabled on this FS)
found 2786490501673 bytes used err is 0
total csum bytes: 2717987856
total tree bytes: 3270934528
total fs tree bytes: 324255744
total extent tree bytes: 35897344
btree space waste bytes: 208202162
file data blocks allocated: 2783900614656
 referenced 2783900614656
btrfs-progs v4.2.3-dirty

It's available here https://github.com/speed47/btrfs-progs/tree/progress

Or, alternatively:

$ git clone https://github.com/speed47/btrfs-progs.git
$ cd btrfs-progs
$ git checkout origin/progress

The master branch is the kdave's one.
The patch is kind of hacky (especially for the qgroup step, which I'm 
not proud of), but if it builds interest here, I'll clean it up and post 
it the right way to the mailing-list.


Regards,

--
Stéphane.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> Unless things have changed very recently, even many modern systems
> update atime on read-only filesystems, unless the media itself is 
> read-only.
Seriously? Oh... *sigh*...
You mean as in Linux, ext*, xfs?

> If you have software that actually depends on atimes, then that
> software 
> is broken (and yes, I even feel this way about Mutt).
I don't disagree here :D

> The way atimes 
> are implemented on most systems breaks the semantics that almost 
> everyone expects from them, because they get updated for anything
> that 
> even looks sideways at the inode from across the room.  Most software
> that uses them expects them to answer the question 'When were the 
> contents of this file last read?', but they can get updated even for 
> stuff like calculating file sizes, listing directory contents, or 
> modifying the file's metadata.
Sure... my point here again was, that I try to look every now and then
at the whole thing from the pure-end-user side:
For them, the default is relatime, and they likely may not want to
change that because they have no clue on how much further effects this
may have (or not).
So as long as Linux doesn't change it's defaults to noatime, leaving
things up to broken software (i.e. to get fixed), I think it would be
nice for the end-user, to have e.g. snapshots be "save" (from the
write-amplification on read) out of the box.

My idea would be basically, that having a noatime btrfs-property, which
is perhaps even set automatically, would be an elegant way of doing
that.
I just haven't had time to properly write that up and add is as a
"feature request" to the projects idea wiki page.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 18:32 +0100, David Sterba wrote:
> I've read the discussions around the change and from the user's POV
> I'd
> suggest to add another mount option that would be just an alias for
> any
> mount options that would implement the 'hard-ro' semantics.
Nice to hear... 


> Say it's called 'nowr'
though I'm deeply saddened, that you don't like my proposed "hard-ro"
which I though about for nearly 1s ;-)

>  mount -o ro,nowr /dev/sdx /mnt
Sounds reasonable... especially I mean that, as long ro's documentation
points to "nowr" and clearly states whether both (ro+nowr) are required
to get the desired behaviour, I have no very strong opinion, whether
both (ro+nowr) should be required, or whether nowr, should imply ro.
Though I think, the later may be better.

Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 12:50 -0500, Austin S. Hemmelgarn wrote:
> It should also imply noatime.  I'm not sure how BTRFS handles atime
> when 
> mounted RO, but I know a lot of old UNIX systems updated atime even
> on 
> filesystems mounted RO, and I know that at least at one point Linux
> did too.
I stumbled over that recently myself, and haven't bothered to try it
out, yet.
But Duncan's argument, why at least ro-snapshots (yes I know, this may
not be exactly the same as RO mount option) would need to imply
noatime, is pretty convincing. :)

Anyway, if it "ro" wouldn't imply noatime, I would ask why, because the
atime is definitely something the fs exports normally to userland,...
and that's how I'd basically consider hard-ro vs. (soft-)ro:

soft-ro: data as visible by the mounted fs must not change (unless
         perhaps for necessary repair/replay operations to get the 
         filesystem back in a consistent state)
hard-ro: soft-ro + nothing on the backing devices may change (bitwise)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-14 14:16, Christoph Anton Mitterer wrote:

On Mon, 2015-12-14 at 12:50 -0500, Austin S. Hemmelgarn wrote:

It should also imply noatime.  I'm not sure how BTRFS handles atime
when
mounted RO, but I know a lot of old UNIX systems updated atime even
on
filesystems mounted RO, and I know that at least at one point Linux
did too.

I stumbled over that recently myself, and haven't bothered to try it
out, yet.
But Duncan's argument, why at least ro-snapshots (yes I know, this may
not be exactly the same as RO mount option) would need to imply
noatime, is pretty convincing. :)
The traditional reasoning was that read-only meant that users couldn't 
change anything, not that the actual data on disk wouldn't change. 
That, and there's been some really brain-dead software over the years 
that depended on atimes being right (now, the only remaining software I 
know of that even uses them at all is Mutt).


Anyway, if it "ro" wouldn't imply noatime, I would ask why, because the
atime is definitely something the fs exports normally to userland,...
and that's how I'd basically consider hard-ro vs. (soft-)ro:

soft-ro: data as visible by the mounted fs must not change (unless
  perhaps for necessary repair/replay operations to get the
  filesystem back in a consistent state)
hard-ro: soft-ro + nothing on the backing devices may change (bitwise)
This should be 'Nothing on the backing device may change as a result of 
the FS', nitpicking I know, but we should be specific so that we 
hopefully avoid ending up in the same situation again.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 14:33 -0500, Austin S. Hemmelgarn wrote:
> The traditional reasoning was that read-only meant that users
> couldn't 
> change anything
Where I'd however count the atime changes to.
The atimes wouldn't change magically, but only because the user stared
some program, configured some daemon, etc. ... which reads/writes/etc.
the file.


> , not that the actual data on disk wouldn't change. 
> That, and there's been some really brain-dead software over the years
> that depended on atimes being right (now, the only remaining software
> I 
> know of that even uses them at all is Mutt).
Wasn't tmpwatcher anoterh candidate?


> This should be 'Nothing on the backing device may change as a result
> of 
> the FS', nitpicking I know, but we should be specific so that we 
> hopefully avoid ending up in the same situation again.
Of course, you're right! :-)

(especially when btrfs should ever be formalised in a standards
document, this should read like:
>hard-ro: Nothing on the backing device may change as a result of the
>FS, however, e.g. maleware, may directly destroy the data on the
>blockdevice ;-)


Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3] btrfs: Introduce new mount option to disable tree log replay

2015-12-14 Thread Austin S. Hemmelgarn

On 2015-12-14 12:32, David Sterba wrote:

On Thu, Dec 10, 2015 at 10:34:06AM +0800, Qu Wenruo wrote:

Introduce a new mount option "nologreplay" to co-operate with "ro" mount
option to get real readonly mount, like "norecovery" in ext* and xfs.

Since the new parse_options() need to check new flags at remount time,
so add a new parameter for parse_options().

Signed-off-by: Qu Wenruo 
Reviewed-by: Chandan Rajendra 
Tested-by: Austin S. Hemmelgarn 


I've read the discussions around the change and from the user's POV I'd
suggest to add another mount option that would be just an alias for any
mount options that would implement the 'hard-ro' semantics.

Say it's called 'nowr'. Now it would imply 'nologreplay', but may cover
more options in the future.
It should also imply noatime.  I'm not sure how BTRFS handles atime when 
mounted RO, but I know a lot of old UNIX systems updated atime even on 
filesystems mounted RO, and I know that at least at one point Linux did too.


  mount -o ro,nowr /dev/sdx /mnt

would work when switching kernels.


I like this idea, but I think that having a name like true-ro or hard-ro 
and making it imply ro (and noatime) would probably be better (or at 
least, simpler to use from a user perspective).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Will "btrfs check --repair" fix the mounting problem?

2015-12-14 Thread Ivan Sizov
2015-12-14 5:28 GMT+03:00 Qu Wenruo :
> Not completely sure, but it may be related to a regression in 4.2.
> The regression it self is already fixed, but is not backported to 4.2 as far
> as I know.
>
> So, I'd recommend to revert to 4.1 and see if things get better.
> Fortunately, btrfs already aborted the transaction before things get worse.

Nothing changed, mount also fails on 4.1.3.


>>> I checked the filesystem extents:
>>>
>>> $ sudo btrfs check --subvol-extents 5 /dev/sda1
>>> Print extent state for subvolume 5 on /dev/sda1
>>> UUID: 6de5c663-bc65-4120-8cf6-5309fd25aa7e
>>> checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
>>> checksum verify failed on 159708168192 found 3659C180 wanted 8EE67C14
>>> bytenr mismatch, want=159708168192, have=16968404070778227820
>>> ERROR: while mapping refs: -5
>>> extent_io.c:582: free_extent_buffer: Assertion `eb->refs < 0` failed.
>>> btrfs(+0x51e9e)[0x56283f4bde9e]
>>> btrfs(free_extent_buffer+0xc0)[0x56283f4be9b0]
>>> btrfs(btrfs_free_fs_root+0x11)[0x56283f4aef11]
>>> btrfs(rb_free_nodes+0x21)[0x56283f4d7cc1]
>>> btrfs(close_ctree+0x194)[0x56283f4b0214]
>>> btrfs(cmd_check+0x486)[0x56283f49ace6]
>>> btrfs(main+0x82)[0x56283f47fad2]
>>> /lib64/libc.so.6(__libc_start_main+0xf0)[0x7f8cbea98580]
>>> btrfs(_start+0x29)[0x56283f47fbd9]
>>> $
>
>
> Did you tried it without the '--subvol-extents 5' options?
> And what's the output?

Yes, I tried it. The output is normal, nothing problem found (shows
UUID, then "checking extents" and that's all)!

> And it may be a good idea to run btrfs-find-root -a, trying to find a good
> copy of old btrfs root tree.
> It may cause miracle to make it RW again.

Thanks for advice. "btrfs-find-root -a" is running at the moment. What
should I do after its completion? Should I just try RW mounting of the
found root or it isn't safe?

> +1 for the advice if you just want to use back up things and get back to
> normal life.

I already backed up the most important data (the whole disk space is
1,82 TB). But I want to solve this strange problem.


-- 
Ivan Sizov
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs check inconsistency with raid1, part 1

2015-12-14 Thread Chris Murphy
On Mon, Dec 14, 2015 at 1:04 AM, Qu Wenruo  wrote:
>
>
> Chris Murphy wrote on 2015/12/14 00:24 -0700:
>> What is a full disk dump? I can try to see if it's possible.
>
>
> Just a dd dump.

OK, yeah. That's 750GB per drive.

>t won't be an easy
> thing to find a place to upload them.

Right. I have no ideas. I'll give you the rest of what you asked for,
and won't do the rw mount yet in case you need more.


> Got the result, and things is very interesting.
>
> It seems all these tree blocks (search by the bytenr) shares the same crc32
> by coincidence.
> Or we won't be able to read them all (and their contents all seems valid).
>
>
> I hope if I can have some raw blocks dump of that bytenr.
> Here is the procedure:
> $ btrfs-map-logical -l  -n 16384 -c 2 
> mirror 1 logical  physical  device 
> mirror 2 logical  physical  device 

Option -n is invalid, I'll use option -b.

##btrfs fi show has this mapping, seems opposite from
btrfs-map-logical (although it uses the term mirror rather than
devid). So I will use devid and ignore mirror number.
/dev/sdb = devid1
/dev/sdc = devid2


# btrfs-map-logical -l 714189357056 -b 16384 -c 2 /dev/sdb
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
mirror 1 logical 714189357056 physical 356605018112 device /dev/sdc
mirror 2 logical 714189357056 physical 3380658176 device /dev/sdb



# btrfs-map-logical -l 714189471744 -b 16384 -c 2 /dev/sdb
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
mirror 1 logical 714189471744 physical 356605132800 device /dev/sdc
mirror 2 logical 714189471744 physical 3380772864 device /dev/sdb


>
> $ dd if= of=dev1_.img bs=1 count=16384 skip=XXX
> $ dd if= of=dev2_.img bs=1 count=16384 skip=YYY
>
> In your output, there are 12 different bytenr, but the most interesting ones
> are *714189357056* and *714189471744*.


dd if=/dev/sdb of=dev1_714189357056.img bs=1 count=16384 skip=3380658176
dd if=/dev/sdc of=dev2_714189357056.img bs=1 count=16384 skip=356605018112

dd if=/dev/sdb of=dev1_714189471744.img bs=1 count=16384 skip=3380772864
dd if=/dev/sdc of=dev2_714189471744.img bs=1 count=16384 skip=356605132800

Files are attached to this email.


-- 
Chris Murphy


dev2_714189471744.img
Description: application/raw-disk-image


dev2_714189357056.img
Description: application/raw-disk-image


dev1_714189471744.img
Description: application/raw-disk-image


dev1_714189357056.img
Description: application/raw-disk-image


still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready)

2015-12-14 Thread Martin Steigerwald
Am Sonntag, 13. Dezember 2015, 15:19:14 CET schrieb Marc MERLIN:
> On Sun, Dec 13, 2015 at 11:35:08PM +0100, Martin Steigerwald wrote:
> > Hi!
> > 
> > For me it is still not production ready. Again I ran into:
> > 
> > btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Sorry you're having issues. I haven't seen this before myself.
> I couldn't find the kernel version you're using in your Email or the bug
> you filed (quick scan).
> 
> That's kind of important :)

I definately know this much. :) It happened with 4.3 yesterday. The other 
kernel version was 3.18. Information should be in the bug report. Yeah, 3.18 
as mentioned in the Kernel Version field. And 4.3 as I mentioned in the last 
comment of the bug report.

The scrubbing issue is I think since 4.3, I also seen it with 4.4-rc2/rc4 I 
believe, but I didn´t go back then to check more toroughly. I didn´t report 
the scrubbing issue with bugzilla yet as I got no feedback on my mailing list 
posts so far. I will bump the thread in a moment and suggest we discuss free 
space issue here and scrubbing issue in the other thread. I went back to 4.3 
cause 4.4-rc2/4 does not even boot on my machine most of the times. I also 
reported this (BTRFS unrelated one).

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.3-rc4] scrubbing aborts before finishing

2015-12-14 Thread Martin Steigerwald
Am Mittwoch, 25. November 2015, 16:35:39 CET schrieben Sie:
> Am Samstag, 31. Oktober 2015, 12:10:37 CET schrieb Martin Steigerwald:
> > Am Donnerstag, 22. Oktober 2015, 10:41:15 CET schrieb Martin Steigerwald:
> > > I get this:
> > > 
> > > merkaba:~> btrfs scrub status -d /
> > > scrub status for […]
> > > scrub device /dev/mapper/sata-debian (id 1) history
> > > 
> > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted after
> > > 00:00:00
> > > total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > scrub device /dev/dm-2 (id 2) history
> > > 
> > > scrub started at Thu Oct 22 10:05:49 2015 and was aborted after
> > > 00:01:30
> > > total bytes scrubbed: 23.81GiB with 0 errors
> > > 
> > > For / scrub aborts for sata SSD immediately.
> > > 
> > > For /home scrub aborts for both SSDs at some time.
> > > 
> > > merkaba:~> btrfs scrub status -d /home
> > > scrub status for […]
> > > scrub device /dev/mapper/msata-home (id 1) history
> > > 
> > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted after
> > > 00:01:31
> > > total bytes scrubbed: 22.03GiB with 0 errors
> > > 
> > > scrub device /dev/dm-3 (id 2) history
> > > 
> > > scrub started at Thu Oct 22 10:09:37 2015 and was aborted after
> > > 00:03:34
> > > total bytes scrubbed: 53.30GiB with 0 errors
> > > 
> > > Also single volume BTRFS is affected:
> > > 
> > > merkaba:~> btrfs scrub status /daten
> > > scrub status for […]
> > > 
> > > scrub started at Thu Oct 22 10:36:38 2015 and was aborted after
> > > 00:00:00
> > > total bytes scrubbed: 0.00B with 0 errors
> > > 
> > > No errors in dmesg, btrfs device stat or smartctl -a.
> > > 
> > > Any known issue?
> > 
> > I am still seeing this in 4.3-rc7. It happens so that on one SSD BTRFS
> > doesn´t even start scrubbing. But in the end it aborts it scrubbing
> > anyway.
> > 
> > I do not see any other issue so far. But I would really like to be able to
> > scrub my BTRFS filesystems completely again. Any hints? Any further
> > information needed?
> > 
> > merkaba:~> btrfs scrub status -d /
> > scrub status for […]
> > scrub device /dev/dm-5 (id 1) history
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > total bytes scrubbed: 0.00B with 0 errors
> > 
> > scrub device /dev/mapper/msata-debian (id 2) status
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:20
> > total bytes scrubbed: 5.27GiB with 0 errors
> > 
> > merkaba:~> btrfs scrub status -d /
> > scrub status for […]
> > scrub device /dev/dm-5 (id 1) history
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > total bytes scrubbed: 0.00B with 0 errors
> > 
> > scrub device /dev/mapper/msata-debian (id 2) status
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:25
> > total bytes scrubbed: 6.59GiB with 0 errors
> > 
> > merkaba:~> btrfs scrub status -d /
> > scrub status for […]
> > scrub device /dev/dm-5 (id 1) history
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:00:00
> > total bytes scrubbed: 0.00B with 0 errors
> > 
> > scrub device /dev/mapper/msata-debian (id 2) status
> > 
> > scrub started at Sat Oct 31 11:58:45 2015, running for 00:01:25
> > total bytes scrubbed: 21.97GiB with 0 errors
> > 
> > merkaba:~> btrfs scrub status -d /
> > scrub status for […]
> > scrub device /dev/dm-5 (id 1) history
> > 
> > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after
> > 
> > 00:00:00 total bytes scrubbed: 0.00B with 0 errors
> > scrub device /dev/mapper/msata-debian (id 2) history
> > 
> > scrub started at Sat Oct 31 11:58:45 2015 and was aborted after
> > 
> > 00:01:32 total bytes scrubbed: 23.63GiB with 0 errors
> > 
> > 
> > For the sake of it I am going to btrfs check one of the filesystem where
> > BTRFS aborts scrubbing (which is all of the laptop filesystems, not only
> > the RAID 1 one).
> > 
> > I will use the /daten filesystem as I can unmount it during laptop runtime
> > easily. There scrubbing aborts immediately:
> > 
> > merkaba:~> btrfs scrub start /daten
> > scrub started on /daten, fsid […] (pid=13861)
> > merkaba:~> btrfs scrub status /daten
> > scrub status for […]
> > 
> > scrub started at Sat Oct 31 12:04:25 2015 and was aborted after
> > 
> > 00:00:00 total bytes scrubbed: 0.00B with 0 errors
> > 
> > It is single device:
> > 
> > merkaba:~> btrfs fi sh /daten
> > Label: 'daten'  uuid: […]
> > 
> > Total devices 1 FS bytes used 227.23GiB
> > devid1 size 230.00GiB used 230.00GiB path
> > 
> > /dev/mapper/msata-daten
> > 
> > btrfs-progs v4.2.2
> > merkaba:~> btrfs fi df /daten
> > Data, single: total=228.99GiB, used=226.79GiB
> > System, single: total=4.00MiB, used=48.00KiB
> > Metadata, single: total=1.01GiB, used=449.50MiB
> > 

Re: btrfs check inconsistency with raid1, part 1

2015-12-14 Thread Qu Wenruo



Chris Murphy wrote on 2015/12/14 00:24 -0700:

Thanks for the reply.


On Sun, Dec 13, 2015 at 10:48 PM, Qu Wenruo  wrote:



Chris Murphy wrote on 2015/12/13 21:16 -0700:

btrfs check with devid 1 and 2 present produces thousands of scary
messages, e.g.
checksum verify failed on 714189357056 found E4E3BDB6 wanted 



Checked the full output.
The interesting part is, the calculated result is always E4E3BDB6, and
wanted is always all 0.

I assume E4E3BDB6 is crc32 of all 0 data.


If there is a full disk dump, it will be much easier to find where the
problem is.
But I'm a afraid it won't be possible.


What is a full disk dump? I can try to see if it's possible.


Just a dd dump.

dd if= of=disk1.img bs=1M


Main
thing though is only if it can make Btrfs overall better, because I
don't need this volume repaired, there's no data loss (backups!) so
this volume's purpose now is for study.


But please also consider your privacy before doing this.

And more important thing is the size...

Considering how large your -t 2 dump is, I won't ever try to do the dump 
even I have enough spare space to contain the image, it won't be an easy 
thing to find a place to upload them.






At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with
the bytenr in the warning.


Both devs attached (not mounted).

[root@f23a ~]# btrfs-debug-tree -t 2 /dev/sdb > btrfsdebugtreet2_verb.txt
checksum verify failed on 714189570048 found E4E3BDB6 wanted 
checksum verify failed on 714189570048 found E4E3BDB6 wanted 
checksum verify failed on 714189471744 found E4E3BDB6 wanted 
checksum verify failed on 714189471744 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189357056 found E4E3BDB6 wanted 
checksum verify failed on 714189750272 found E4E3BDB6 wanted 
checksum verify failed on 714189750272 found E4E3BDB6 wanted 

https://drive.google.com/open?id=0B_2Asp8DGjJ9NUdmdXZFQ1Myek0



Got the result, and things is very interesting.

It seems all these tree blocks (search by the bytenr) shares the same 
crc32 by coincidence.

Or we won't be able to read them all (and their contents all seems valid).


I hope if I can have some raw blocks dump of that bytenr.
Here is the procedure:
$ btrfs-map-logical -l  -n 16384 -c 2 
mirror 1 logical  physical  device 
mirror 2 logical  physical  device 

$ dd if= of=dev1_.img bs=1 count=16384 skip=XXX
$ dd if= of=dev2_.img bs=1 count=16384 skip=YYY

In your output, there are 12 different bytenr, but the most interesting 
ones are *714189357056* and *714189471744*.
They are extent tree blocks. If they are really broken, btrfsck should 
complain about it.


Others are mostly csum tree block, less interesting.

And unlike the super large disk dump, it's very small, exactly 16K each.
64K in total.






The good news is, the fs seems to be OK without major problem.
As except the csum error, btrfsck doesn't give other error/warning.


Yes, I think so. Main issue here seems to be the scary warnings and
uncertainty what the user should do next, if anything at all.


I guess btrfsck did the wrong device assemble, but that's just my personal
guess.
And since I can't reproduce in my test environment, it won't be easy to find
the root cause.


It might be reproducible. More on that in the next email. Easy to get
you remote access if useful.



So. What's the theory in this case? And then does it differ from reality?



Personally speaking, it may be a false alert from btrfsck.
So in this case, I can't provide much help.

If you're brave enough, mount it rw to see what will happen(although it may
mount just OK).


I'm brave enough. I'll give it a try tomorrow unless there's another
request for more info before then.



Great!

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready)

2015-12-14 Thread Martin Steigerwald
Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> > Hi!
> > 
> > For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> 
> > Again I ran into:
> > 
> > btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Not sure about guideline for other fs, but it will attract more dev's
> attention if it can be posted to maillist.

I did, as mentioned in the bug report:

BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald | 26 Dec 14:37 2014
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790

> > No matter whether SLES 12 uses it as default for root, no matter whether
> > Fujitsu and Facebook use it: I will not let this onto any customer machine
> > without lots and lots of underprovisioning and rigorous free space
> > monitoring. Actually I will renew my recommendations in my trainings to
> > be careful with BTRFS.
> > 
> >  From my experience the monitoring would check for:
> > merkaba:~> btrfs fi show /home
> > Label: 'home'  uuid: […]
> > 
> >  Total devices 2 FS bytes used 156.31GiB
> >  devid1 size 170.00GiB used 164.13GiB path
> >  /dev/mapper/msata-home
> >  devid2 size 170.00GiB used 164.13GiB path
> >  /dev/mapper/sata-home
> > 
> > If "used" is same as "size" then make big fat alarm. It is not sufficient
> > for it to happen. It can run for quite some time just fine without any
> > issues, but I never have seen a kworker thread using 100% of one core for
> > extended period of time blocking everything else on the fs without this
> > condition being met.
> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be
> at most 10G and metadata to be 1G.
> 
> I have seen a lot of users with about 100~200G device, and hit
> unbalanced chunk allocation (10G data chunk easily takes the last
> available space and makes later metadata no where to store)

Interesting, but in my case there is still quite some free space in already 
allocated metadata chunks. Anyway, I did had enospc issues on trying to 
balance the chunks.

> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs
> with SINGLE data/meta)

Well, I know for any FS its not recommended to let it run to full and leave 
about 10-15% free at least, but while it is not 10-15% anymore, its still a 
whopping 11-12 GiB of free space. I would accept a somewhat slower operation 
in this case, but no kworker at 100% for about 10-30 seconds blocking 
everything else on going on on the filesystem. For whatever reason Plasma 
seems to access the fs on almost every action I do with it, so not even panels 
slide out anymore or activity switcher works during that time.

> > In addition to that last time I tried it aborts scrub any of my BTRFS
> > filesstems. Reported in another thread here that got completely ignored so
> > far. I think I could go back to 4.2 kernel to make this work.
> 
> Unfortunately, this happens a lot of times, even you posted it to mail list.
> Devs here are always busy locating bugs or adding new features or
> enhancing current behavior.
> 
> So *PLEASE* be patient about such slow response.

Okay, thanks at least for the acknowledgement of this. I try to be even more 
patient.
 
> BTW, you may not want to revert to 4.2 until some bug fix is backported
> to 4.2.
> As qgroup rework in 4.2 has broken delayed ref and caused some scrub
> bugs. (My fault)

Hm, well scrubbing does not work for me either. But since 4.3/4.4rc2/4. I just 
bumped the thread:

Re: [4.3-rc4] scrubbing aborts before finishing

by replying a well by replying a third time to it (not fourth, miscounted:). 

> > I am not going to bother to go into more detail on any on this, as I get
> > the impression that my bug reports and feedback get ignored. So I spare
> > myself the time to do this work for now.
> > 
> > 
> > Only thing I wonder now whether this all could be cause my /home is
> > already
> > more than one and a half year old. Maybe newly created filesystems are
> > created in a way that prevents these issues? But it already has a nice
> > global reserve:
> > 
> > merkaba:~> btrfs fi df /
> > Data, RAID1: total=27.98GiB, used=24.07GiB
> > System, RAID1: total=19.00MiB, used=16.00KiB
> > Metadata, RAID1: total=2.00GiB, used=536.80MiB
> > GlobalReserve, single: total=192.00MiB, used=0.00B
> > 
> > 
> > Actually when I see that this free space thing is still not fixed for good
> > I wonder whether it is fixable at all. Is this an inherent issue of BTRFS
> > or more generally COW filesystem design?
> 
> GlobalReserve is just a reserved space *INSIDE* metadata for some 

Re: Kernel lockup, might be helpful log.

2015-12-14 Thread Birdsarenice
I've no need for a fix. I know exactly what the underlying cause is: 
Those Seagate 8TB Archive drives and their known compatibility issues 
with some kernel versions. I just shared the log because it's a 
situation that btrfs handles very, very poorly, and the error handling 
could be improved. If a drive is unresponsive, btrfs really should be 
able to just cease using it and treat it as failed, or even unmount the 
entire filesystem - either would be preferable to what actually happens 
(at least for me), a system hang that leaves nothing functional whatsoever.


I've 'solved' it by removing all drives of that model. It's been running 
without issue since I did that.


On 14/12/15 07:36, Chris Murphy wrote:

I can't help with the call traces. But several (not all) of the hard
resetting link messages are hallmark cases where the SCSI command
timer default of 30 seconds looks like it's being hit while the drive
itself is hung up doing a sector read recovery (multiple attempts).
It's worth seeing if 'smartctl -l scterc ' will report back that
SCT is supported and that it's just disabled, meaning you can change
this to something sane like with 'smartctl -l 70,70 ' which will
make the drive time out before the linux kernel command timer. That'll
let Btrfs do the right thing, rather than constantly getting poked in
both eyes by link resets.


Chris Murphy



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions?

2015-12-14 Thread Chris Murphy
On Sun, Dec 13, 2015 at 5:27 PM, Christoph Anton Mitterer
 wrote:
> On Fri, 2015-12-11 at 16:06 -0700, Chris Murphy wrote:
>> For anything but a new and empty Btrfs volume
> What's the influence of the fs being new/empty?
>
>> this hypothetical
>> attack would be a ton easier to do on LVM and mdadm raid because they
>> have a tiny amount of metadata to spoof compared to a Btrfs volume
>> with even a little bit of data on it.
> Uhm I haven't said that other systems properly handle this kind of
> attack. ;-)
> Guess that would need to be evaluated...
>
>
>>  I think this concern is overblown.
> I don't think so. Let me give you an example: There is an attack[0]
> against crypto, where the attacker listens via a smartphone's
> microphone, and based on the acoustics of a computer where gnupg runs.
> This is surely not an attack many people would have considered even
> remotely possible, but in fact it works, at least under lab conditions.

I'm aware of this proof of concept. I'd put it, and this one, in the
realm of a targeted attack, so it's not nearly as likely as other
problems needing fixing. That doesn't mean don't understand it better
so it can be fixed. It means understand before arriving at risk
assessment let alone conclusions.



> Apart from that, btrfs should be a general purpose fs, and not just a
> desktop or server fs.
> So edge cases like forensics (where it's common that you create bitwise
> identical images) shouln't be forgotten either.

I didn't. I did state there are edge cases, not normal use. My
criticism of dd for copying a volume is for general purpose copying,
not edge cases.



>
>
>> > >If your workflow requires making an exact copy (for the shelf or
>> > > for
>> > > an emergency) then dd might be OK. But most often it's used
>> > > because
>> > > it's been easy, not because it's a good practice.
>> > Ufff.. I wouldn't got that far to call something here bad or good
>> > practice.
>>
>> It's not just bad practice, it's sufficiently sloppy that it's very
>> nearly user sabotage. That this is due to innocent ignorance, and a
>> long standing practice that's bad advice being handed down from
>> previous generations doesn't absolve the practice and mean we should
>> invent esoteric work arounds for what is not a good practice. We have
>> all sorts of exhibits why it's not a good idea.
> Well if you don't give any real arguments or technical reasons (apart
> from "working around software that doesn't handle this well") I
> consider this just repetition of the baseless claim that long standing
> practise would be bad.

I already have, as have others.

Does the user want cake or pie? The computer doesn't have that level
of granular information when there are two apparently bitwise
identical devices. The file system sees them both as dessert, without
other distinction. So option a is to simply fail and let the user
resolve the ambiguity. Option b is maybe to leveral btrfs check code
and find out if there's more to the story, some indication that one of
the apparently identical copies isn't really identical. But that's a
lot of work for something that probably won't happen. What's more
likely is they aren't just apparently identical, they are in fact
identical because it's an LVM snapshot or a dd copy that's making them
appear identical. That's not something btrfs can resolve alone.

To automate the distinction, requires more information. If it's LVM,
possibly LVM and Btrfs could work together where LVM LV UUID * Btrfs
volume UUID = Btrfs volume UUID'  (as in a derivative) and to treat it
internally with a new temp UUID that's throw away.

If it's a raw device, I still see this as the user's problem. They
created it, they'll have to help resolve the ambiguity by yanking one
of the drives.



> Long story, short, I think we can agree, that - dd or not - corruptions
> or attack vectors shouldn't be possible.

Yes.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel lockup, might be helpful log.

2015-12-14 Thread Hugo Mills
On Mon, Dec 14, 2015 at 06:51:41AM +, Duncan wrote:
> Birdsarenice posted on Sun, 13 Dec 2015 22:55:19 + as excerpted:
> 
> > Meanwhile, I did get lucky: At one crash I happened to be logged in and
> > was able to hit dmesg seconds before it went completely. So what I have
> > here is information that looks like it'll help you track down a
> > rarely-encountered and hard-to-reproduce bug which can cause the system
> > to lock up completely in event of certain types of hard drive failure.
> > It might be nothing, but perhaps someone will find it of use - because
> > it'd be a tricky one to both reproduce and get a good error report if it
> > did occur.
> > 
> > I see an 'invalid opcode' error in here, that's pretty unusual
> 
> Disclaimer:  I'm a list regular and (small-scale) sysadmin, not a dev, 
> and most certainly not a btrfs dev.  Take what I saw with that in mind, 
> tho I've been active on-list for over a year and thus now have a 
> reasonable level of practical sysadmin configuration and crisis recovery 
> level btrfs experience.
> 
> You could well be quite correct with the unusual crash log and its value, 
> I'll leave that up to the devs to decide, but that "invalid opcode: " 
> bit is in fact not at all unusual on btrfs.  Tho I can say it fooled me 
> originally as well, because it certainly /looks/ both suspicious and in 
> general unusual.
> 
> Based on how a dev explained it to me, I believe btrfs actually 
> deliberately uses opcode  to trigger a semi-controlled crash in 
> instances where code that "should never happen" actually gets executed 
> for some reason, leaving the kernel is an unknown and thus not 
> trustworthy enough to reliably write to storage devices and do a 
> controlled shutdown.  That's of course why the tracebacks are there, to 
> help the devs figure out where it was and what triggered it, but the  
> opcode itself is actually quite frequently found in these tracebacks, 
> because it's the method chosen to deliberately trigger them.

   It's not just btrfs. Invalid opcode is the way that the kernel's
BUG and BUG_ON macro is implemented.

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 10:
hugo@... carfax.org.uk | Business Ethics
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: still kworker at 100% cpu in all of device size allocated with chunks situations with write load

2015-12-14 Thread Qu Wenruo



Martin Steigerwald wrote on 2015/12/14 09:18 +0100:

Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:

Martin Steigerwald wrote on 2015/12/13 23:35 +0100:

Hi!

For me it is still not production ready.


Yes, this is the *FACT* and not everyone has a good reason to deny it.


Again I ran into:

btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
random write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401


Not sure about guideline for other fs, but it will attract more dev's
attention if it can be posted to maillist.


I did, as mentioned in the bug report:

BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald | 26 Dec 14:37 2014
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790


No matter whether SLES 12 uses it as default for root, no matter whether
Fujitsu and Facebook use it: I will not let this onto any customer machine
without lots and lots of underprovisioning and rigorous free space
monitoring. Actually I will renew my recommendations in my trainings to
be careful with BTRFS.

  From my experience the monitoring would check for:
merkaba:~> btrfs fi show /home
Label: 'home'  uuid: […]

  Total devices 2 FS bytes used 156.31GiB
  devid1 size 170.00GiB used 164.13GiB path
  /dev/mapper/msata-home
  devid2 size 170.00GiB used 164.13GiB path
  /dev/mapper/sata-home

If "used" is same as "size" then make big fat alarm. It is not sufficient
for it to happen. It can run for quite some time just fine without any
issues, but I never have seen a kworker thread using 100% of one core for
extended period of time blocking everything else on the fs without this
condition being met.

And specially advice on the device size from myself:
Don't use devices over 100G but less than 500G.
Over 100G will leads btrfs to use big chunks, where data chunks can be
at most 10G and metadata to be 1G.

I have seen a lot of users with about 100~200G device, and hit
unbalanced chunk allocation (10G data chunk easily takes the last
available space and makes later metadata no where to store)


Interesting, but in my case there is still quite some free space in already
allocated metadata chunks. Anyway, I did had enospc issues on trying to
balance the chunks.


And unfortunately, your fs is already in the dangerous zone.
(And you are using RAID1, which means it's the same as one 170G btrfs
with SINGLE data/meta)


Well, I know for any FS its not recommended to let it run to full and leave
about 10-15% free at least, but while it is not 10-15% anymore, its still a
whopping 11-12 GiB of free space. I would accept a somewhat slower operation
in this case, but no kworker at 100% for about 10-30 seconds blocking
everything else on going on on the filesystem. For whatever reason Plasma
seems to access the fs on almost every action I do with it, so not even panels
slide out anymore or activity switcher works during that time.


In addition to that last time I tried it aborts scrub any of my BTRFS
filesstems. Reported in another thread here that got completely ignored so
far. I think I could go back to 4.2 kernel to make this work.


Unfortunately, this happens a lot of times, even you posted it to mail list.
Devs here are always busy locating bugs or adding new features or
enhancing current behavior.

So *PLEASE* be patient about such slow response.


Okay, thanks at least for the acknowledgement of this. I try to be even more
patient.


BTW, you may not want to revert to 4.2 until some bug fix is backported
to 4.2.
As qgroup rework in 4.2 has broken delayed ref and caused some scrub
bugs. (My fault)


Hm, well scrubbing does not work for me either. But since 4.3/4.4rc2/4. I just
bumped the thread:

Re: [4.3-rc4] scrubbing aborts before finishing

by replying a well by replying a third time to it (not fourth, miscounted:).


I am not going to bother to go into more detail on any on this, as I get
the impression that my bug reports and feedback get ignored. So I spare
myself the time to do this work for now.


Only thing I wonder now whether this all could be cause my /home is
already
more than one and a half year old. Maybe newly created filesystems are
created in a way that prevents these issues? But it already has a nice
global reserve:

merkaba:~> btrfs fi df /
Data, RAID1: total=27.98GiB, used=24.07GiB
System, RAID1: total=19.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=536.80MiB
GlobalReserve, single: total=192.00MiB, used=0.00B


Actually when I see that this free space thing is still not fixed for good
I wonder whether it is fixable at all. Is this an inherent issue of BTRFS
or more generally COW filesystem design?


GlobalReserve is just a reserved space *INSIDE* metadata for some corner
case. So its profile is always single.

The real problem is, how we represent it in btrfs-progs.

If it output like below, I think you won't complain about it more:
  > merkaba:~> btrfs fi df /
  

Re: still kworker at 100% cpu in all of device size allocated with chunks situations with write load

2015-12-14 Thread Martin Steigerwald
Hi Qu.

I reply to the journal fs things in a mail with a different subject.

Am Montag, 14. Dezember 2015, 16:48:58 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/14 09:18 +0100:
> > Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> >> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
[…]
> >> GlobalReserve is just a reserved space *INSIDE* metadata for some corner
> >> case. So its profile is always single.
> >> 
> >> The real problem is, how we represent it in btrfs-progs.
> >> 
> >> If it output like below, I think you won't complain about it more:
> >>   > merkaba:~> btrfs fi df /
> >>   > Data, RAID1: total=27.98GiB, used=24.07GiB
> >>   > System, RAID1: total=19.00MiB, used=16.00KiB
> >>   > Metadata, RAID1: total=2.00GiB, used=728.80MiB
> >> 
> >> Or
> >> 
> >>   > merkaba:~> btrfs fi df /
> >>   > Data, RAID1: total=27.98GiB, used=24.07GiB
> >>   > System, RAID1: total=19.00MiB, used=16.00KiB
> >>   > Metadata, RAID1: total=2.00GiB, used=(536.80 + 192.00)MiB
> >>   > 
> >>   >  \ GlobalReserve: total=192.00MiB, used=0.00B
> > 
> > Oh, the global reserve is *inside* the existing metadata chunks? Thats
> > interesting. I didn´t know that.
> 
> And I have already submit btrfs-progs patch to change the default output
> of 'fi df'.
> 
> Hopes to solve the problem.

Nice. Thank you. It clarifies it quite a bit. I always wondered why its 
single. On which device does it allocate it in a RAID 1? Also can the data 
stored in there temporarily be recreated in case of loosing a device? In case 
that not, BTRFS would not guarantee that one device of a RAID 1 can fail at 
all times.

Ciao,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> On 2015-12-14 14:39, Christoph Anton Mitterer wrote:
> > On Mon, 2015-12-14 at 09:24 -0500, Austin S. Hemmelgarn wrote:
> > > Unless things have changed very recently, even many modern
> > > systems
> > > update atime on read-only filesystems, unless the media itself is
> > > read-only.
> > Seriously? Oh... *sigh*...
> > You mean as in Linux, ext*, xfs?
> Possibly, I know that Windows 7 does it, and I think OS X and OpenBSD
> do 
> it, but I'm not sure about Linux.
I've just checked it via loopback image and strictatime:

- ro snapshot doesn't get atime updated
- rw snapshot does atime get update
- ro mounted fs (top level subvol) doesn't get atimes updated (neither
  in subvols)
- rw mounted fs (top level subvol) does get atimes updated

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


project idea: per-object default mount-options / more btrfs-properties / chattr attributes (was: btrfs: poor performance on deleting many large files)

2015-12-14 Thread Christoph Anton Mitterer
Just FYI:

On Mon, 2015-12-14 at 15:27 -0500, Austin S. Hemmelgarn wrote:
> > My idea would be basically, that having a noatime btrfs-property,
> > which
> > is perhaps even set automatically, would be an elegant way of doing
> > that.
> > I just haven't had time to properly write that up and add is as a
> > "feature request" to the projects idea wiki page.
> I like this idea.

I've just compiled some thoughts and ideas into:
https://btrfs.wiki.kernel.org/index.php/Project_ideas#Per-object_default_mount-options_.2F_btrfs-properties_.2F_chattr.281.29_attributes_and_reasonable_userland_defaults

As usual, this is mostly from my admin/end-user side, i.e. what I could
imagine would ease in the maintenance of large/complex (in terms of
subvols, nesting, snapshots) btrfs filesystems...

And of course, any developer or more expert user than me is happily
invited to comment/remove any (possibly stupid) ideas of mine therein,
or summon the inquisition for my heresy ;)


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-12-14 Thread Christoph Anton Mitterer
On Mon, 2015-12-14 at 22:30 +0100, Lionel Bouton wrote:
> Mutt is often used as an example but tmpwatch uses atime by default
> too
> and it's quite useful.
Hmm one could probably argue that these few cases justify the use of
separate filesystems (or btrfs subvols ;) ), so that the majority could
benefit of noatime.

> If you have a local cache of remote files for which you want a good
> hit
> ratio and don't care too much about its exact size (you should have
> Nagios/Zabbix/... alerting you when a filesystem reaches a %free
> limit
> if you value your system's availability anyway), using tmpwatch with
> cron to maintain it is only one single line away and does the job.
> For
> an example of this particular case, on Gentoo the
> /usr/portage/distfiles
> directory is used in one of the tasks you can uncomment to activate
> in
> the cron.daily file provided when installing tmpwatch.
> Using tmpwatch/cron is far more convenient than using a dedicated
> cache
> (which might get tricky if the remote isn't HTTP-based, like an
> rsync/ftp/nfs/... server or doesn't support HTTP IMS requests for
> example).
> Some http frameworks put sessions in /tmp: in this case if you want
> sessions to expire based on usage and not creation time, using
> tmpwatch
> or similar with atime is the only way to clean these files. This can
> even become a performance requirement: I've seen some servers slowing
> down with tens/hundreds of thousands of session files in /tmp because
> it
> was only cleaned at boot and the systems were almost never
> rebooted...
Okay there are probably some usecases, ... the session cleaning I'd
however rather consider a bug in the respective software, especially if
it really depends on it to expire the session (what if for some reason
tmpwatch get's broken, uninstalled, etc.)


> I use noatime and nodiratime
FYI: noatime implies nodiratime :-)


> Finally Linus Torvalds has been quite vocal and consistent on the
> general subject of the kernel not breaking user-space APIs no matter
> what so I wouldn't have much hope for default kernel mount options
> changes...
He surely is right in general,... but when the point has been reached,
where only a minority actually requires the feature... and the minority
actually starts to suffer from that... it may change.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature