Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Kai Krakow
Am Mon, 28 Nov 2016 01:38:29 +0100
schrieb Ulli Horlacher :

> On Sat 2016-11-26 (11:27), Kai Krakow wrote:
> 
> > > I have vmware and virtualbox VMs on btrfs SSD.  
> 
> > As a side note: I don't think you can use "nodatacow" just for one
> > subvolume while the other subvolumes of the same btrfs are mounted
> > different. The wiki is just wrong here.
> > 
> > The list of possible mount options in the wiki explicitly lists
> > "nodatacow" as not working per subvolume - just globally for the
> > whole fs.  
> 
> Thanks for pointing this out!
> I have misunderstood this, first.

You can, however, use chattr to make the subvolume root directory (that
one where it is mounted) nodatacow (chattr +C) _before_ placing any
files or directories in there. That way, newly created files and
directories will inherit the flag. Take note that this flag can only
applied to directories and empty (zero-sized) files.

That way, you get the intended benefit and your next question applies a
little less because:

> Ok, then next question :-)
> 
> What is better (for a single user workstation): using mount option
> "autodefrag" or call "btrfs filesystem defragment -r" (-t ?) via
> nightly cronjob?
> 
> So far, I use neither.

When using the above method to make your VM images nodatacow, the only
fragmentation issue you need to handle is when doing snapshots.
Snapshots are subject to copy-on-write. If you do heavy snapshotting,
you'll be getting heavy fragmentation based on the write-patterns. I
don't know if autodefrag will handle nodatacow files. You may want to
use a dedupe utility after defragmentation, like duperemove (running
it manually) or bees (a background daemon also trying to keep
fragmentation low).

If you are doing no or infrequent snapshots, I won't bother with manual
defragging at all for your VM images since you're on SSD. If you aren't
going to use snapshots at all, you even won't have to think about
autodefrag, tho I still recommend to enable it (see post from Duncan).

Manual defrag is a highly write-intensive operation, rewriting multiple
gigabytes of data. I strongly recommend against using it on a daily
basis for SSD.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Not TLS] Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Graham Cobb
On 28/11/16 02:56, Duncan wrote:
> It should still be worth turning on autodefrag on an existing somewhat 
> fragmented filesystem.  It just might take some time to defrag files you 
> do modify, and won't touch those you don't, which in some cases might 
> make it worth defragging those manually.  Or simply create new 
> filesystems, mount them with autodefrag, and copy everything over so 
> you're starting fresh, as I do.

Could that "copy" be (a series of) send/receive, so that snapshots and
reflinks are preserved?  Does autodefrag work in that case or does the
send/receive somehow override that and end up preserving the original
(fragmented) extent structure?

Graham

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Niccolò Belli

On lunedì 28 novembre 2016 09:20:15 CET, Kai Krakow wrote:

You can, however, use chattr to make the subvolume root directory (that
one where it is mounted) nodatacow (chattr +C) _before_ placing any
files or directories in there. That way, newly created files and
directories will inherit the flag. Take note that this flag can only
applied to directories and empty (zero-sized) files.


Do I keep checksumming for this directory such a way?

Niccolò Belli
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-11-28 Thread Roman Mamedov
On Mon, 28 Nov 2016 00:03:12 -0500
Zygo Blaxell  wrote:

> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8e3a5a2..b1314d6 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> btrfs_path *path,
>   max_size = min_t(unsigned long, PAGE_SIZE, max_size);
>   ret = btrfs_decompress(compress_type, tmp, page,
>  extent_offset, inline_size, max_size);
> + WARN_ON(max_size > PAGE_SIZE);
> + if (max_size < PAGE_SIZE) {
> + char *map = kmap(page);
> + memset(map + max_size, 0, PAGE_SIZE - max_size);
> + kunmap(page);
> + }
>   kfree(tmp);
>   return ret;
>  }

Wasn't this already posted as:

btrfs: fix silent data corruption while reading compressed inline extents
https://patchwork.kernel.org/patch/9371971/

but you don't indicate that's a V2 or something, and in fact the patch seems
exactly the same, just the subject and commit message are entirely different.
Quite confusing.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe

2016-11-28 Thread Darrick J. Wong
On Thu, Nov 24, 2016 at 11:20:39PM -0500, Zygo Blaxell wrote:
> On Wed, Nov 23, 2016 at 05:26:18PM -0800, Darrick J. Wong wrote:
> [...]
> > Keep in mind that the number of bytes deduped is returned to userspace
> > via file_dedupe_range.info[x].bytes_deduped, so a properly functioning
> > userspace program actually /can/ detect that its 128MB request got cut
> > down to only 16MB and re-issue the request with the offsets moved up by
> > 16MB.  The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c)
> > implements this strategy.  duperemove (the only other user I know of)
> > also does this.
> > 
> > So it's really no big deal to increase the limit beyond 16MB, eliminate
> > it entirely, or even change it to cap the total request size while
> > dropping the per-item IO limit.
> > 
> > As I mentioned in my other reply, the only hesitation I have for not
> > killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a
> > single ioctl call.
> 
> Everything's relative.  btrfs has ioctls that will do hundreds of
> terabytes of IO and take months to run.  2GB of data is nothing.
> 
> Deduping entire 100TB files with a single ioctl call makes as much
> sense to me as reflink copying them with a single ioctl call.  The only
> reason I see to keep the limit is to work around something wrong with
> the implementation.

Ok.  I'll post patches removing the 16MB limitation for XFS and ocfs2 in 4.10
if nobody objects.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix hole read corruption for compressed inline extents

2016-11-28 Thread Zygo Blaxell
On Mon, Nov 28, 2016 at 05:27:10PM +0500, Roman Mamedov wrote:
> On Mon, 28 Nov 2016 00:03:12 -0500
> Zygo Blaxell  wrote:
> 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 8e3a5a2..b1314d6 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -6803,6 +6803,12 @@ static noinline int uncompress_inline(struct 
> > btrfs_path *path,
> > max_size = min_t(unsigned long, PAGE_SIZE, max_size);
> > ret = btrfs_decompress(compress_type, tmp, page,
> >extent_offset, inline_size, max_size);
> > +   WARN_ON(max_size > PAGE_SIZE);
> > +   if (max_size < PAGE_SIZE) {
> > +   char *map = kmap(page);
> > +   memset(map + max_size, 0, PAGE_SIZE - max_size);
> > +   kunmap(page);
> > +   }
> > kfree(tmp);
> > return ret;
> >  }
> 
> Wasn't this already posted as:
> 
> btrfs: fix silent data corruption while reading compressed inline extents
> https://patchwork.kernel.org/patch/9371971/
> 
> but you don't indicate that's a V2 or something, and in fact the patch seems
> exactly the same, just the subject and commit message are entirely different.
> Quite confusing.

The previous commit message discussed the related hole-creation bug,
including a reproducer; however, this patch does not fix the hole-creation
bug and was never intended to.  Despite my follow-up clarification,
reviewers got distracted by the hole-creation bug discussion and didn't
recover, so the patch didn't go anywhere.

This patch only fixes _reading_ the holes after they are created, and
the new commit message and subject line state that much more clearly.

The patch didn't change, so I didn't add 'v2'.  There's no 'v1' with
the same title, so I thought a 'v2' tag would be more confusing than
just starting over.

The hole-creation bug is a very old, low-urgency issue.  btrfs filesystems
in the field have the buggy holes already, and have been creating new
ones from 2009(*) to the present.  I had to ask a few people before I found
one who know whether it was even a bug, or intentional behavior from
the beginning.


(*) 2009 is the oldest commit date I can find that introduces a change
which would only be necessary in the presence of the hole-creation bug.
I have not been able to test kernels before 2012 because they crash
while running my reproducer.

> -- 
> With respect,
> Roman
> 


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Goffredo Baroncelli
On 2016-11-28 04:37, Christoph Anton Mitterer wrote:
> I think for safety it's best to repair as early as possible (and thus
> on read when a damage is detected), as further  blocks/devices may fail
> till eventually a scrub(with repair) would be run manually.
> 
> However, there may some workloads under which such auto-repair is
> undesirable as it may cost performance and safety may be less important
> than that.

I am assuming that a corruption is a quite rare event. So occasionally it could 
happens that a page is corrupted and the system corrects it. This shouldn't  
have an impact on the workloads.

BR
G.Baroncelli
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Goffredo Baroncelli
On 2016-11-28 01:40, Qu Wenruo wrote:
> 
> At 11/27/2016 07:16 AM, Goffredo Baroncelli wrote:
>> On 2016-11-26 19:54, Zygo Blaxell wrote:
>>> On Sat, Nov 26, 2016 at 02:12:56PM +0100, Goffredo Baroncelli wrote:
 On 2016-11-25 05:31, Zygo Blaxell wrote:
>> [...]

 BTW Btrfs in RAID1 mode corrects the data even in the read case. So
>>>
>>> Have you tested this?  I think you'll find that it doesn't.
>>
>> Yes I tested it; and it does the rebuild automatically.
>> I corrupted a disk of mirror, then I read the related file. The log  says:
>>
>> [   59.287748] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 
>> 12813760 expected csum 3114703128
>> [   59.291542] BTRFS warning (device vdb): csum failed ino 257 off 0 csum 
>> 12813760 expected csum 3114703128
>> [   59.294950] BTRFS info (device vdb): read error corrected: ino 257 off 0 
>> (dev /dev/vdb sector 2154496)
>> ^
>>
>> IIRC In case of RAID5/6 the last line is missing. However in both the case 
>> the data returned is good; but in RAID1 the data is corrected also on the 
>> disk.
>>
>> Where you read that the data is not rebuild automatically ?
>>
>> In fact I was surprised that RAID5/6 behaves differently
>>
> 
> Yes, I also tried that and realized that RAID1 is recovering corrupted data 
> at *READ* time.
> 
> The main difference between RAID1 and RAID56 seems to be the complexity.
> 
> For RAID56, we have different read/write behavior, for read, we use flag 
> BTRFS_RBIO_READ_REBUILD, which will only rebuild data but not write them into 
> disk.
> And I'm a little concern about the race between read time fix and write.
> 
> I assume it's possible to change the behavior to follow RAID1, but I'd like 
> to do it in the following steps:
> 1) Fix known RAID56 bugs
>With the v3 patch and previous 2 patches, it seems OK now.
> 2) Full fstests test case, with all possible corruption combination
>(WIP)
> 3) Rework current RAID56 code to a cleaner and more readable status
>(long term)
> 4) Add the support to fix things at read time.
> 
> So the behavior change is not something we will see in short time.

+1
I am understanding that the status of RAID5/6 code is so badly that we need to 
correct all the more critical bugs and then increase the tests for not 
regression.
On the point 3, I don't know the code well enough to say something, the code is 
very complex.
I see the point 4 as the less urgent. 

Let me to make a request: I would like to know your opinion about my email 
"RFC: raid with a variable stripe size", which started a little thread. I am 
asking this because you now have the hands on this code: my suggestion (use 
different BG, with different stripe size to avoid RMW cycles) or the Zygo's one 
(don't fill a stripe if you don't need to avoid RMW cycles) are difficult to 
implement ?

BR
G.Baroncelli


> 
> Thanks,
> Qu


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: mkfs, balance convert: warn about RAID5/6 in fiery letters

2016-11-28 Thread Adam Borowski
People who don't frequent IRC nor the mailing list tend to believe RAID 5/6
are stable; this leads to data loss.  Thus, let's do warn them.

At this point, I think fiery letters that won't be missed are warranted.

Kernel 4.9 and its -progs will be a part of LTS of multiple distributions,
so leaving experimental features without a warning is inappropriate.

Signed-off-by: Adam Borowski 
---
 cmds-balance.c |  2 ++
 mkfs.c |  2 ++
 utils.c| 46 ++
 utils.h|  2 ++
 4 files changed, 52 insertions(+)

diff --git a/cmds-balance.c b/cmds-balance.c
index f17345e..0967162 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -48,8 +48,10 @@ static int parse_one_profile(const char *profile, u64 *flags)
} else if (!strcmp(profile, "raid10")) {
*flags |= BTRFS_BLOCK_GROUP_RAID10;
} else if (!strcmp(profile, "raid5")) {
+   print_raid56_warning();
*flags |= BTRFS_BLOCK_GROUP_RAID5;
} else if (!strcmp(profile, "raid6")) {
+   print_raid56_warning();
*flags |= BTRFS_BLOCK_GROUP_RAID6;
} else if (!strcmp(profile, "dup")) {
*flags |= BTRFS_BLOCK_GROUP_DUP;
diff --git a/mkfs.c b/mkfs.c
index e501a93..8d460b7 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -378,8 +378,10 @@ static u64 parse_profile(const char *s)
} else if (strcasecmp(s, "raid1") == 0) {
return BTRFS_BLOCK_GROUP_RAID1;
} else if (strcasecmp(s, "raid5") == 0) {
+   print_raid56_warning();
return BTRFS_BLOCK_GROUP_RAID5;
} else if (strcasecmp(s, "raid6") == 0) {
+   print_raid56_warning();
return BTRFS_BLOCK_GROUP_RAID6;
} else if (strcasecmp(s, "raid10") == 0) {
return BTRFS_BLOCK_GROUP_RAID10;
diff --git a/utils.c b/utils.c
index 69b580a..b9e44c2 100644
--- a/utils.c
+++ b/utils.c
@@ -4204,6 +4204,52 @@ out:
return ret;
 }
 
+#define NFL 5
+#define CYCLE (NFL * 2 - 2)
+/* force text-on-black even if reversed */
+#define BG "\e[0;40;"
+static const char *fire_levels[NFL] =
+{ BG"1;30m", BG"31m", BG"1;31m", BG"1;33m", BG"1;37m", };
+#undef BG
+
+/* If this ever gets localized, we'll need to split by characters not bytes. */
+static void text_on_fire(const char *str)
+{
+   int s = 0, sl, chunks, len = strlen(str);
+
+   /* no colors if our output is redirected */
+   if (!isatty(fileno(stderr)))
+   return (void)fputs(str, stderr);
+
+   /* repeat the scheme if too long, try 2-3 letters per level */
+   chunks = (len / CYCLE / 3 + 1) * CYCLE + 1;
+   for (int ch = 0; ch < chunks; ++ch) {
+   int lev = ch % CYCLE;
+   fputs(fire_levels[(lev < NFL) ? lev : CYCLE - lev],
+ stderr);
+   sl = len * (ch + 1) / chunks;
+   for (; s < sl; ++s)
+   fputc(*str++, stderr);
+   }
+   fputs("\e[0m", stderr); /* reset color */
+}
+#undef NFL
+#undef CYCLE
+
+void print_raid56_warning(void)
+{
+   static int given = 0;
+   if (given)
+   return;
+   given = 1;
+
+   fprintf(stderr,
+"WARNING: btrfs RAID5 and 6 are still experimental, suffering from serious\n"
+"problems.  Thus, you are advised not to continue unless you know what 
you're\n"
+"doing.  In all cases, do keep up-to-date backups!\n");
+   text_on_fire("Don't use RAID5/6 for anything important!\n");
+}
+
 void init_rand_seed(u64 seed)
 {
int i;
diff --git a/utils.h b/utils.h
index 366ca29..00a253a 100644
--- a/utils.h
+++ b/utils.h
@@ -423,6 +423,8 @@ static inline int __error_on(int condition, const char 
*fmt, ...)
return 1;
 }
 
+void print_raid56_warning(void);
+
 /* Pseudo random number generator wrappers */
 u32 rand_u32(void);
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: man mkfs: warn about RAID5/6 being experimental

2016-11-28 Thread Adam Borowski
Signed-off-by: Adam Borowski 
---
 Documentation/mkfs.btrfs.asciidoc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/Documentation/mkfs.btrfs.asciidoc 
b/Documentation/mkfs.btrfs.asciidoc
index 9b1d45a..c92d730 100644
--- a/Documentation/mkfs.btrfs.asciidoc
+++ b/Documentation/mkfs.btrfs.asciidoc
@@ -164,6 +164,9 @@ root partition created with RAID1/10/5/6 profiles. The 
mount action can happen
 before all block devices are discovered. The waiting is usually done on the
 initramfs/initrd systems.
 
+As of kernel 4.9, RAID5/6 is still considered experimental and shouldn't be
+employed for production use.
+
 FILESYSTEM FEATURES
 ---
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2016-11-28 Thread foss
Hello,

I have a multi-device btrfs (with problems, more on that later). I looked into 
btrfs-image and was surprised to find that "for i in 5 6 7 8 ; do sudo 
btrfs-image -t2 /dev/sda$i - | md5sum;done" returns a different hash for sda7.  
The other three hashes are the same, as I believe they all should be, or am I 
mistaken?

$ for i in 5 6 7 8 ; do echo -n sda$i:" " ; sudo btrfs-image -t2 -c9 /dev/sda$i 
- | tee /tmp/sda$i.img | md5sum;done
sda5: 047a1da23edeb9661e79d1006f17eab0  -
sda6: 047a1da23edeb9661e79d1006f17eab0  -
sda7: 9cc7afdc1476daa6f671db8d3c855164  -
sda8: 047a1da23edeb9661e79d1006f17eab0  -

$ sudo btrfs fi show
Label: 'SSD128'  uuid: f110a925-6ad9-4b40-9207-6bf09ce0cb23
Total devices 4 FS bytes used 105.96GiB
devid1 size 100.00GiB used 100.00GiB path /dev/sda5
devid2 size 10.00GiB used 10.00GiB path /dev/sda6
devid3 size 5.00GiB used 5.00GiB path /dev/sda7
devid4 size 1.99GiB used 1.00GiB path /dev/sda8

The difference is somewhere very early on in the file:
$ wc -c /tmp/sda?.img
223763456 /tmp/sda5.img
223763456 /tmp/sda6.img
223763456 /tmp/sda7.img
223763456 /tmp/sda8.img
895053824 total
$ for file in /tmp/sda?.img;do echo -n $file: " "; tail -c 22295 $file | 
md5sum;done
/tmp/sda5.img:  afe8010fa089415aaa55f2d549c3b5d4  -
/tmp/sda6.img:  afe8010fa089415aaa55f2d549c3b5d4  -
/tmp/sda7.img:  afe8010fa089415aaa55f2d549c3b5d4  -
/tmp/sda8.img:  afe8010fa089415aaa55f2d549c3b5d4  -
$ for file in /tmp/sda?.img;do echo -n $file: " "; tail -c 22296 $file | 
md5sum;done
/tmp/sda5.img:  2bc5fe13e5f2dd8592a24eecb322482e  -
/tmp/sda6.img:  2bc5fe13e5f2dd8592a24eecb322482e  -
/tmp/sda7.img:  ee26d34afa50ccb33e95e869b3e18af3  -
/tmp/sda8.img:  2bc5fe13e5f2dd8592a24eecb322482e  -

How could this be?

Regards

Rolf


PS: While I will check back for replies in the archive, a courtesy cc will be 
appreciated since I am not subscribed to the list.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Christoph Anton Mitterer
On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote:
> I am understanding that the status of RAID5/6 code is so badly
Just some random thought:

If the code for raid56 is really as bad as it's often claimed (I
haven't read it, to be honest)  could it perhaps make sense to
consider to start it from scratch? And/or merge it with a more generic
approach that allows n-way-parity RAIDs (I think such patch was posted
hear some year(s) ago).


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Christoph Anton Mitterer
On Mon, 2016-11-28 at 19:32 +0100, Goffredo Baroncelli wrote:
> I am assuming that a corruption is a quite rare event. So
> occasionally it could happens that a page is corrupted and the system
> corrects it. This shouldn't  have an impact on the workloads.

Probably, but it still make sense to make it configurable, especially
as an option like this would be needed to make a btrfs truly non-
writing to the device again...

The ro mount option just says that the files/permissions/etc. don't
change - not that any internas doesn't change, so it would IMO be an
abuse if ro would be used to disable repairs.

And some people simply may want that - and if it's just to rule out the
possibility of corruptions on a ro-fs in case of bad memory ;)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Austin S. Hemmelgarn

On 2016-11-28 14:01, Christoph Anton Mitterer wrote:

On Mon, 2016-11-28 at 19:45 +0100, Goffredo Baroncelli wrote:

I am understanding that the status of RAID5/6 code is so badly

Just some random thought:

If the code for raid56 is really as bad as it's often claimed (I
haven't read it, to be honest)  could it perhaps make sense to
consider to start it from scratch? And/or merge it with a more generic
approach that allows n-way-parity RAIDs (I think such patch was posted
hear some year(s) ago).
Such a suggestion for higher order parity support was made some time ago 
(at least a year and a half I believe, probably more).  It was at the 
time stated that it was planned after n-way replication and raid5/6.


Personally I feel that sort of road-map is misguided, as all three are 
functionally interdependent as the original proposal had suggested, and 
I'm also of the opinion that the raid5/6 code probably should be redone 
from scratch (I wasn't the one who wrote it, and can't contribute much 
more than some attempts at testing as a third-party, so I obviously 
can't make that decision myself).  Doing so would likely make it much 
easier to implement higher order parity (and arbitrary 
striping/replication, which is something I'm _very_ interested in).  The 
existing code isn't bad in a stylistic or even traditional coding sense, 
but makes a significant number of poor choices in the high-level side of 
things (all the issues with parity for example).  If we just want 
working raid5/6, then working on the existing implementation is fine. 
If we want support for arbitrary combinations of 
striping/replication/parity (which would be a killer feature and 
something that BTRFS could actually say nothing else has), then 
rewriting from scratch is probably easier because of some of the 
low-level design choices made in the raid5/6 code.


Part of the problem though is that there are more admins and support 
focused people than coders involved.  In my case I very much do not have 
the expertise to work on kernel code beyond tweaking constants and 
twiddling default values for things (I'm technically a programmer, but 
90% of the work I do is sysops type stuff written in a mix of sh, 
Python, and about 7 different data serialization formats).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] Btrfs: add more valid checks for superblock

2016-11-28 Thread Liu Bo
On Fri, Nov 25, 2016 at 05:50:19PM +0100, David Sterba wrote:
> On Fri, Jun 03, 2016 at 12:05:14PM -0700, Liu Bo wrote:
> > @@ -6648,6 +6648,7 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
> > struct btrfs_key found_key;
> > int ret;
> > int slot;
> > +   u64 total_dev = 0;
> >  
> > root = root->fs_info->chunk_root;
> >  
> > @@ -6689,6 +6690,7 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
> > ret = read_one_dev(root, leaf, dev_item);
> > if (ret)
> > goto error;
> > +   total_dev++;
> > } else if (found_key.type == BTRFS_CHUNK_ITEM_KEY) {
> > struct btrfs_chunk *chunk;
> > chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
> > @@ -6698,6 +6700,28 @@ int btrfs_read_chunk_tree(struct btrfs_root *root)
> > }
> > path->slots[0]++;
> > }
> > +
> > +   /*
> > +* After loading chunk tree, we've got all device information,
> > +* do another round of validation check.
> > +*/
> > +   if (total_dev != root->fs_info->fs_devices->total_devices) {
> > +   btrfs_err(root->fs_info,
> > +  "super_num_devices(%llu) mismatch with num_devices(%llu) found here",
> > + btrfs_super_num_devices(root->fs_info->super_copy),
> > + total_dev);
> > +   ret = -EINVAL;
> 
> Turns out this check is too strict in combination with an intermediate
> state and a bug in device deletion.
> 
> We've had several reports where a filesystem becomes unmountable due to
> the above check, where the wrong value is just stored in the superblock
> and is fixable by simply setting it to the right value. The right value
> is number of dev items found in the dev tree, which is exactly the
> total_dev that we read here.
> 
> The intermediate state can be caused by removing the device item in
> btrfs_rm_device (see comment before btrfs_rm_dev_item call). This
> apparently happens, when the device deletion is not finished, ie. the
> superblock is not overwritten with a new copy.

I see.

Looks like currently it does commit_transaction in btrfs_rm_devi_item() while it
doesn't in btrfs_add_device, so it turns out that the logics in adding a device
and removing a device are a little bit different,

btrfs_rm_device  btrfs_init_new_devices
  ...  ...
  ->btrfs_rm_dev_item  ->btrfs_start_transaction()
->btrfs_start_transaction  ->mutex_lock(&device_list_mutex)
->#rm dev item in chunk tree   ->list_add
->btrfs_commit_transaction ->btrfs_set_super_num_device
  ->mutex_lock(&device_list_mutex) 
->mutex_unlock(&device_list_mutex)
  ->list_del() ->btrfs_add_device
  ->btrfs_set_super_num_device ->btrfs_commit_transaction()
  ->mutext_unlock(&device_list_mutex)


Not sure why we made it different around commit_transaction, but seems that we
can avoid the situation described above by changing how we play with transaction
in rm_device.

> 
> So: do you agree to make it just a warning, and fixup the superblock
> num_devices here? A userspace fix is also possible, but when this
> happens and the filesystem is not mountable, a fix from outside of the
> filesystem might be hard.

I'm OK with setting up a warning.

Thanks,

-liubo

> 
> > +   goto error;
> > +   }
> > +   if (btrfs_super_total_bytes(root->fs_info->super_copy) <
> > +   root->fs_info->fs_devices->total_rw_bytes) {
> > +   btrfs_err(root->fs_info,
> > +   "super_total_bytes(%llu) mismatch with fs_devices total_rw_bytes(%llu)",
> > + btrfs_super_total_bytes(root->fs_info->super_copy),
> > + root->fs_info->fs_devices->total_rw_bytes);
> > +   ret = -EINVAL;
> > +   goto error;
> > +   }
> > ret = 0;
> >  error:
> > unlock_chunks(root);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: still lockdep splat for 4.9-rc5+ (btrfs_log_inode)

2016-11-28 Thread Liu Bo
On Sat, Nov 26, 2016 at 08:46:38AM -0500, Chris Mason wrote:
> On Fri, Nov 25, 2016 at 10:03:25AM +0100, Christian Borntraeger wrote:
> > FWIW, I still see the lockdep splat in btrfs in 4.9-rc5+
> 
> Filipe reworked the code to avoid taking the same lock twice.  As far as I
> can tell, this just needs some annotation.

Besides this log_mutex, we have another false lockdep warning around
block_group_cache->data_rwsem, we can use down_read_nested() to get rid
of it.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix uninitialized variable access after ASSERT

2016-11-28 Thread Arnd Bergmann
In btrfs, ASSERT() has no effect if CONFIG_BTRFS_ASSERT is disabled,
and gcc notices that this can lead to using an uninitialized variable:

fs/btrfs/inode.c: In function 'run_delalloc_range':
fs/btrfs/inode.c:1190:18: error: 'cur_end' may be used uninitialized in this 
function [-Werror=maybe-uninitialized]

I assume the condition that the ASSERT checks for is actually
correct and we won't get there in practice, but it's easy to
modify the function to make it simpler and avoid the condition
that the warning is for.

Fixes: e5249f75cfd0 ("btrfs: Introduce COMPRESS reserve type to fix false 
enospc for compression")
Signed-off-by: Arnd Bergmann 
---
 fs/btrfs/inode.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6f35d923d67..b1d2b38d29aa 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1175,18 +1175,14 @@ static int cow_file_range_async(struct inode *inode, 
struct page *locked_page,
clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, EXTENT_LOCKED,
 1, 0, NULL, GFP_NOFS);
while (start < end) {
+   ASSERT(reserve_type == BTRFS_RESERVE_COMPRESS);
async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
BUG_ON(!async_cow); /* -ENOMEM */
async_cow->inode = igrab(inode);
async_cow->root = root;
async_cow->locked_page = locked_page;
async_cow->start = start;
-
-   if (reserve_type == BTRFS_RESERVE_COMPRESS)
-   cur_end = min(end, start + SZ_512K - 1);
-   else
-   ASSERT(0);
-
+   cur_end = min(end, start + SZ_512K - 1);
async_cow->end = cur_end;
INIT_LIST_HEAD(&async_cow->extents);
 
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/8] btrfs: calculate end of bio offset properly

2016-11-28 Thread Omar Sandoval
On Fri, Nov 25, 2016 at 09:07:51AM +0100, Christoph Hellwig wrote:
> Use the bvec offset and len members to prepare for multipage bvecs.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/compression.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 77042a8..3c1c25c 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -445,6 +445,13 @@ int btrfs_submit_compressed_write(struct inode *inode, 
> u64 start,
>   return 0;
>  }
>  
> +static u64 bio_end_offset(struct bio *bio)
> +{
> + struct bio_vec *last = &bio->bi_io_vec[bio->bi_vcnt - 1];
> +
> + return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
> +}
> +
>  static noinline int add_ra_bio_pages(struct inode *inode,
>u64 compressed_end,
>struct compressed_bio *cb)
> @@ -463,8 +470,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>   u64 end;
>   int misses = 0;
>  
> - page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
> - last_offset = (page_offset(page) + PAGE_SIZE);
> + last_offset = bio_end_offset(cb->orig_bio);
>   em_tree = &BTRFS_I(inode)->extent_tree;
>   tree = &BTRFS_I(inode)->io_tree;
>  
> -- 
> 2.1.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Zygo Blaxell
On Mon, Nov 28, 2016 at 07:32:38PM +0100, Goffredo Baroncelli wrote:
> On 2016-11-28 04:37, Christoph Anton Mitterer wrote:
> > I think for safety it's best to repair as early as possible (and thus
> > on read when a damage is detected), as further  blocks/devices may fail
> > till eventually a scrub(with repair) would be run manually.
> > 
> > However, there may some workloads under which such auto-repair is
> > undesirable as it may cost performance and safety may be less important
> > than that.
> 
> I am assuming that a corruption is a quite rare event. So occasionally
> it could happens that a page is corrupted and the system corrects
> it. This shouldn't  have an impact on the workloads.

Depends heavily on the specifics of the failure case.  If a drive's
embedded controller RAM fails, you get corruption on the majority of
reads from a single disk, and most writes will be corrupted (even if they
were not before).  If there's a transient failure due to environmental
issues (e.g. short-term high-amplitude vibration or overheating) then
writes may pause for mechanical retry loops.  If there is bitrot in SSDs
(particularly in the address translation tables) it looks like a wall
of random noise that only ends when the disk goes offline.  You can get
combinations of these (e.g. RAM failures caused by transient overheating)
where the drive's behavior changes over time.

When in doubt, don't write.

> BR
> G.Baroncelli
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


signature.asc
Description: Digital signature


Re: RFC: raid with a variable stripe size

2016-11-28 Thread Qu Wenruo



At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:

Hello,

these are only my thoughts; no code here, but I would like to share it hoping 
that it could be useful.

As reported several times by Zygo (and others), one of the problem of raid5/6 
is the write hole. Today BTRFS is not capable to address it.


I'd say, no need to address yet, since current soft RAID5/6 can't handle 
it yet.


Personally speaking, Btrfs should implementing RAID56 support just like 
Btrfs on mdadm.

See how badly the current RAID56 works?

The marginally benefit of btrfs RAID56 to scrub data better than 
tradition RAID56 is just a joke in current code base.




The problem is that the stripe size is bigger than the "sector size" (ok sector 
is not the correct word, but I am referring to the basic unit of writing on the disk, 
which is 4k or 16K in btrfs).
So when btrfs writes less data than the stripe, the stripe is not filled; when 
it is filled by a subsequent write, a RMW of the parity is required.

On the best of my understanding (which could be very wrong) ZFS try to solve 
this issue using a variable length stripe.


Did you mean ZFS record size?
IIRC that's file extent minimum size, and I didn't see how that can 
handle the write hole problem.


Or did ZFS handle the problem?

Anyway, it should be a low priority thing, and personally speaking,
any large behavior modification involving  both extent allocator and bg 
allocator will be bug prone.




On BTRFS this could be achieved using several BGs (== block group or chunk), 
one for each stripe size.

For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
should have three BGs:
BG #1,composed by two disks (1 data+ 1 parity)
BG #2 composed by three disks (2 data + 1 parity)
BG #3 composed by four disks (3 data + 1 parity).


Too complicated bg layout and further extent allocator modification.

More code means more bugs, and I'm pretty sure it will be bug prone.


Although the idea of variable stripe size can somewhat reduce the 
problem under certain situation.


For example, if sectorsize is 64K, and we make stripe len to 32K, and 
use 3 disc RAID5, we can avoid such write hole problem.

Withouth modification to extent/chunk allocator.

And I'd prefer to make stripe len mkfs time parameter, not possible to 
modify after mkfs. To make things easy.


Thanks,
Qu



If the data to be written has a size of 4k, it will be allocated to the BG #1.
If the data to be written has a size of 8k, it will be allocated to the BG #2
If the data to be written has a size of 12k, it will be allocated to the BG #3
If the data to be written has a size greater than 12k, it will be allocated to 
the BG3, until the data fills a full stripes; then the remainder will be stored 
in BG #1 or BG #2.


To avoid unbalancing of the disk usage, each BG could use all the disks, even 
if a stripe uses less disks: i.e

DISK1 DISK2 DISK3 DISK4
S1S1S1S2
S2S2S3S3
S3S4S4S4
[]

Above is show a BG which uses all the four disks, but has a stripe which spans 
only 3 disks.


Pro:
- btrfs already is capable to handle different BG in the filesystem, only the 
allocator has to change
- no more RMW are required (== higher performance)

Cons:
- the data will be more fragmented
- the filesystem, will have more BGs; this will require time-to time a 
re-balance. But is is an issue which we already know (even if may be not 100% 
addressed).


Thoughts ?

BR
G.Baroncelli






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Christoph Anton Mitterer
On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> If a drive's
> embedded controller RAM fails, you get corruption on the majority of
> reads from a single disk, and most writes will be corrupted (even if
> they
> were not before).

Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
a number of disks for nearly 10 years now, I'd have never stumbled on
such a case of breakage so far...

Actually most cases are as simple as HDD fails to work and this is
properly signalled to the controller.



> If there's a transient failure due to environmental
> issues (e.g. short-term high-amplitude vibration or overheating) then
> writes may pause for mechanical retry loops.  If there is bitrot in
> SSDs
> (particularly in the address translation tables) it looks like a wall
> of random noise that only ends when the disk goes offline.  You can
> get
> combinations of these (e.g. RAM failures caused by transient
> overheating)
> where the drive's behavior changes over time.
> 
> When in doubt, don't write.

Sorry, but these cases as any cases of memory issues (be it main memory
or HDD controller) would also kick in at any normal writes.

So there's no point in protecting against this on the storage side...

Either never write at all... or have good backups for these rare cases.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> > If a drive's
> > embedded controller RAM fails, you get corruption on the majority of
> > reads from a single disk, and most writes will be corrupted (even if
> > they
> > were not before).
> 
> Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
> a number of disks for nearly 10 years now, I'd have never stumbled on
> such a case of breakage so far...

In data centers you won't see breakages that are common on desktop and
laptop drives.  Laptops in particular sometimes (often?) go to places
that are much less friendly to hardware.

All my NAS and enterprise drives in server racks and data centers just
wake up one morning stone dead or with a few well-behaved bad sectors,
with none of this drama.  Boring!

> Actually most cases are as simple as HDD fails to work and this is
> properly signalled to the controller.
> 
> 
> 
> > If there's a transient failure due to environmental
> > issues (e.g. short-term high-amplitude vibration or overheating) then
> > writes may pause for mechanical retry loops.  If there is bitrot in
> > SSDs
> > (particularly in the address translation tables) it looks like a wall
> > of random noise that only ends when the disk goes offline.  You can
> > get
> > combinations of these (e.g. RAM failures caused by transient
> > overheating)
> > where the drive's behavior changes over time.
> > 
> > When in doubt, don't write.
> 
> Sorry, but these cases as any cases of memory issues (be it main memory
> or HDD controller) would also kick in at any normal writes.

Yes, but in a RAID1 context there will be another disk with a good copy
(or if main RAM is failing, the entire filesystem will be toast no matter
what you do).

> So there's no point in protecting against this on the storage side...
> 
> Either never write at all... or have good backups for these rare cases.
> 
> 
> 
> Cheers,
> Chris.




signature.asc
Description: Digital signature


Re: RFC: raid with a variable stripe size

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >Hello,
> >
> >these are only my thoughts; no code here, but I would like to share it 
> >hoping that it could be useful.
> >
> >As reported several times by Zygo (and others), one of the problem
> of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> 
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
> 
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.

Even mdadm doesn't implement it the way btrfs does (assuming all bugs
are fixed) any more.

> See how badly the current RAID56 works?

> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

> >The problem is that the stripe size is bigger than the "sector size"
> (ok sector is not the correct word, but I am referring to the basic
> unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> btrfs writes less data than the stripe, the stripe is not filled; when
> it is filled by a subsequent write, a RMW of the parity is required.
> >
> >On the best of my understanding (which could be very wrong) ZFS try
> to solve this issue using a variable length stripe.
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
> 
> Or did ZFS handle the problem?

ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
parity blocks within extents, so it behaves more like btrfs compression
in the sense that the data in a RAID-Z extent is encoded differently
from the data in the file, and the kernel has to transform it on reads
and writes.

No ZFS stripe can contain blocks from multiple different
transactions because the RAID-Z stripes begin and end on extent
(single-transaction-write) boundaries, so there is no write hole on ZFS.

There is some space waste in ZFS because the minimum allocation unit
is two blocks (one data one parity) so any free space that is less
than two blocks long is unusable.  Also the maximum usable stripe width
(number of disks) is the size of the data in the extent plus one parity
block.  It means if you write a lot of discontiguous 4K blocks, you
effectively get 2-disk RAID1 and that may result in disappointing
storage efficiency.

(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
for additional parity blocks).

One could implement RAID-Z on btrfs, but it's by far the most invasive
proposal for fixing btrfs's write hole so far (and doesn't actually fix
anything, since the existing raid56 format would still be required to
read old data, and it would still be broken).

> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

My proposal requires only a modification to the extent allocator.
The behavior at the block group layer and scrub remains exactly the same.
We just need to adjust the allocator slightly to take the RAID5 CoW
constraints into account.

It's not as efficient as the ZFS approach, but it doesn't require an
incompatible disk format change either.

> >On BTRFS this could be achieved using several BGs (== block group or chunk), 
> >one for each stripe size.
> >
> >For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
> >should have three BGs:
> >BG #1,composed by two disks (1 data+ 1 parity)
> >BG #2 composed by three disks (2 data + 1 parity)
> >BG #3 composed by four disks (3 data + 1 parity).
> 
> Too complicated bg layout and further extent allocator modification.
> 
> More code means more bugs, and I'm pretty sure it will be bug prone.
> 
> 
> Although the idea of variable stripe size can somewhat reduce the problem
> under certain situation.
> 
> For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
> disc RAID5, we can avoid such write hole problem.
> Withouth modification to extent/chunk allocator.
> 
> And I'd prefer to make stripe len mkfs time parameter, not possible to
> modify after mkfs. To make things easy.
> 
> Thanks,
> Qu
> 
> >
> >If the data to be written has a size of 4k, it will be allocated to the BG 
> >#1.
> >If the data to be written has a size of 8k, it will be allocated to the BG #2
> >If the data to be written has a size of 12k, it will be allocated to the BG 
> >#3
> >If the data to be written has a size greater than 12k, it will be allocated 
> >to the BG3, until the data fills a full stripes; then the remainder will be 
> >stored in BG #1 or BG #2.
> >
> >
> >To avoid unbalancing of the disk usage, each BG could use all the disks, 
> >even if a stripe uses less disks: i.e
> >
> >DISK1 DISK2 DISK3 DISK4
> >S1S1S1S2
> >S2S2S3S3
> >S3S4S4S4
> >[]
> >
> >Above is show a BG which uses all the four disks, but h

Re: RFC: raid with a variable stripe size

2016-11-28 Thread Qu Wenruo



At 11/29/2016 11:53 AM, Zygo Blaxell wrote:

On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:

At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:

Hello,

these are only my thoughts; no code here, but I would like to share it hoping 
that it could be useful.

As reported several times by Zygo (and others), one of the problem

of raid5/6 is the write hole. Today BTRFS is not capable to address it.

I'd say, no need to address yet, since current soft RAID5/6 can't handle it
yet.

Personally speaking, Btrfs should implementing RAID56 support just like
Btrfs on mdadm.


Even mdadm doesn't implement it the way btrfs does (assuming all bugs
are fixed) any more.


See how badly the current RAID56 works?



The marginally benefit of btrfs RAID56 to scrub data better than tradition
RAID56 is just a joke in current code base.



The problem is that the stripe size is bigger than the "sector size"

(ok sector is not the correct word, but I am referring to the basic
unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
btrfs writes less data than the stripe, the stripe is not filled; when
it is filled by a subsequent write, a RMW of the parity is required.


On the best of my understanding (which could be very wrong) ZFS try

to solve this issue using a variable length stripe.

Did you mean ZFS record size?
IIRC that's file extent minimum size, and I didn't see how that can handle
the write hole problem.

Or did ZFS handle the problem?


ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
parity blocks within extents, so it behaves more like btrfs compression
in the sense that the data in a RAID-Z extent is encoded differently
from the data in the file, and the kernel has to transform it on reads
and writes.

No ZFS stripe can contain blocks from multiple different
transactions because the RAID-Z stripes begin and end on extent
(single-transaction-write) boundaries, so there is no write hole on ZFS.

There is some space waste in ZFS because the minimum allocation unit
is two blocks (one data one parity) so any free space that is less
than two blocks long is unusable.  Also the maximum usable stripe width
(number of disks) is the size of the data in the extent plus one parity
block.  It means if you write a lot of discontiguous 4K blocks, you
effectively get 2-disk RAID1 and that may result in disappointing
storage efficiency.

(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
for additional parity blocks).

One could implement RAID-Z on btrfs, but it's by far the most invasive
proposal for fixing btrfs's write hole so far (and doesn't actually fix
anything, since the existing raid56 format would still be required to
read old data, and it would still be broken).


Anyway, it should be a low priority thing, and personally speaking,
any large behavior modification involving  both extent allocator and bg
allocator will be bug prone.


My proposal requires only a modification to the extent allocator.
The behavior at the block group layer and scrub remains exactly the same.
We just need to adjust the allocator slightly to take the RAID5 CoW
constraints into account.


Then, you'd need to allow btrfs to split large buffered/direct write 
into small extents(not 128M anymore).

Not sure if we need to do extra work for DirectIO.

And in fact, you're going to support variant max file extent size.

This makes delalloc more complex (Wang enhanced dealloc support for 
variant file extent size, to fix ENOSPC problem for dedupe and compression).


This is already much more complex than you expected.


And this is the *BIGGEST* problem of current btrfs:
No good enough(if there is any) *ISOLATION* for such a complex fs.

So even "small" modification can lead to unexpected bugs.

That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
If not possible, I prefer not to do anything yet, until we are sure the 
very basic part of RAID56 is stable.


Thanks,
Qu



It's not as efficient as the ZFS approach, but it doesn't require an
incompatible disk format change either.


On BTRFS this could be achieved using several BGs (== block group or chunk), 
one for each stripe size.

For example, if a filesystem - RAID5 is composed by 4 DISK, the filesystem 
should have three BGs:
BG #1,composed by two disks (1 data+ 1 parity)
BG #2 composed by three disks (2 data + 1 parity)
BG #3 composed by four disks (3 data + 1 parity).


Too complicated bg layout and further extent allocator modification.

More code means more bugs, and I'm pretty sure it will be bug prone.


Although the idea of variable stripe size can somewhat reduce the problem
under certain situation.

For example, if sectorsize is 64K, and we make stripe len to 32K, and use 3
disc RAID5, we can avoid such write hole problem.
Withouth modification to extent/chunk allocator.

And I'd prefer to make stripe len mkfs time parameter, not possible to
modify after mkfs. To make things easy.

Thanks,
Qu



If 

[PATCH 0/3] Qgroup and inode_cache fix, with small cleanups

2016-11-28 Thread Qu Wenruo
This patchset fixes a qgroup bug when using with inode_cache mount option.

The bug reminds us that the design of separate
btrfs_qgroup_prepare_account_extents() is a bad practice especially the
safest timing is to call prepare just before btrfs_qgroup_account_extents().

So the patchset will cleanup btrfs_qgroup_prepare_account_extents() to
avoid further similar bugs.

And add a quick exit for non-fs extents for qgroup, mainly to reduce the
noise of qgroup trace events.

Qu Wenruo (3):
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
option
  btrfs: qgroup: Add quick exit for non-fs extents
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

 fs/btrfs/qgroup.c  | 82 +-
 fs/btrfs/qgroup.h  |  2 --
 fs/btrfs/transaction.c | 20 ++--
 3 files changed, 58 insertions(+), 46 deletions(-)

-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

2016-11-28 Thread Qu Wenruo
Quite a lot of qgroup corruption happens due to wrong timing of calling
btrfs_qgroup_prepare_account_extents().

Since the safest timing is calling it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
function.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c  | 41 +
 fs/btrfs/qgroup.h  |  2 --
 fs/btrfs/transaction.c |  9 -
 3 files changed, 9 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b9a2fd1..00e3c16 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1419,37 +1419,6 @@ out:
return ret;
 }
 
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_qgroup_extent_record *record;
-   struct btrfs_delayed_ref_root *delayed_refs;
-   struct rb_node *node;
-   u64 qgroup_to_skip;
-   int ret = 0;
-
-   delayed_refs = &trans->transaction->delayed_refs;
-   qgroup_to_skip = delayed_refs->qgroup_to_skip;
-
-   /*
-* No need to do lock, since this function will only be called in
-* btrfs_commit_transaction().
-*/
-   node = rb_first(&delayed_refs->dirty_extent_root);
-   while (node) {
-   record = rb_entry(node, struct btrfs_qgroup_extent_record,
- node);
-   ret = btrfs_find_all_roots(NULL, fs_info, record->bytenr, 0,
-  &record->old_roots);
-   if (ret < 0)
-   break;
-   if (qgroup_to_skip)
-   ulist_del(record->old_roots, qgroup_to_skip, 0);
-   node = rb_next(node);
-   }
-   return ret;
-}
-
 int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_qgroup_extent_record *record)
@@ -2038,6 +2007,11 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
trace_btrfs_qgroup_account_extents(fs_info, record);
 
if (!ret) {
+   /* Search commit root to find old_roots */
+   ret = btrfs_find_all_roots(NULL, fs_info,
+   record->bytenr, 0, &record->old_roots);
+   if (ret < 0)
+   goto cleanup;
/*
 * Use (u64)-1 as time_seq to do special search, which
 * doesn't lock tree or delayed_refs and search current
@@ -2047,8 +2021,11 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
record->bytenr, (u64)-1, &new_roots);
if (ret < 0)
goto cleanup;
-   if (qgroup_to_skip)
+   if (qgroup_to_skip) {
ulist_del(new_roots, qgroup_to_skip, 0);
+   ulist_del(record->old_roots, qgroup_to_skip,
+ 0);
+   }
ret = btrfs_qgroup_account_extent(trans, fs_info,
record->bytenr, record->num_bytes,
record->old_roots, new_roots);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 99c879d..2c7f701 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -90,8 +90,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
 void btrfs_free_qgroup_config(struct btrfs_fs_info *fs_info);
 struct btrfs_delayed_extent_op;
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 /*
  * Inform qgroup to trace one dirty extent, its info is recorded in @record.
  * So qgroup can account it at commit trans time.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f65659b..47a8f489 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1346,9 +1346,6 @@ static int qgroup_account_snapshot(struct 
btrfs_trans_handle *trans,
ret = commit_fs_roots(trans, src);
if (ret)
goto out;
-   ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-   if (ret < 0)
-   goto out;
ret = btrfs_qgroup_account_extents(trans, fs_info);
if (ret < 0)
goto out;
@@ -2153,12 +2150,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
goto scrub_continue;
}
 
-   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
-   if (ret) {
-   mutex_unl

[PATCH 1/3] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option

2016-11-28 Thread Qu Wenruo
[BUG]
The easist way to reproduce the bug is:
--
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
 ^^ Twice the correct value
--

And fstests/btrfs qgroup test group can easily detect them with
inode_cache mount option.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/transaction.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e66a18e..f65659b 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2102,13 +2102,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
goto scrub_continue;
}
 
-   /* Reocrd old roots for later qgroup accounting */
-   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
-   if (ret) {
-   mutex_unlock(&root->fs_info->reloc_mutex);
-   goto scrub_continue;
-   }
-
/*
 * make sure none of the code above managed to slip in a
 * delayed item
@@ -2151,6 +2144,22 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
btrfs_free_log_root_tree(trans, root->fs_info);
 
/*
+* commit_fs_roots() can call btrfs_save_ino_cache(), which generates
+* new delayed refs. Must handle them or qgroup can be wrong.
+*/
+   ret = btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
+   if (ret) {
+   mutex_unlock(&root->fs_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
+   if (ret) {
+   mutex_unlock(&root->fs_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   /*
 * Since fs roots are all committed, we can get a quite accurate
 * new_roots. So let's do quota accounting.
 */
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs: qgroup: Add quick exit for non-fs extents

2016-11-28 Thread Qu Wenruo
For btrfs_qgroup_account_extent(), modify make it exit quicker for
non-fs extents.

This will also reduce the noise in trace_btrfs_qgroup_account_extent
event.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e97f304..b9a2fd1 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1904,6 +1904,33 @@ static int qgroup_update_counters(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
+/*
+ * Helper to check if the @roots is a list of fs tree roots
+ * Return 0 for definitely not a fs/subvol tree roots ulist
+ * Return 1 for possible fs/subvol tree roots ulist(including empty)
+ */
+static int maybe_fs_roots(struct ulist *roots)
+{
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+
+   /* Empty one, still possible for fs roots */
+   if (!roots || roots->nnodes == 0)
+   return 1;
+
+   ULIST_ITER_INIT(&uiter);
+   unode = ulist_next(roots, &uiter);
+   if (!unode)
+   return 1;
+
+   /*
+* If it contains fs tree roots, then it must belongs to fs/subvol
+* trees.
+* If it contains non-fs tree, it won't be shared to fs/subvol trees.
+*/
+   return is_fstree(unode->val);
+}
+
 int
 btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
@@ -1917,10 +1944,20 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle 
*trans,
u64 nr_old_roots = 0;
int ret = 0;
 
-   if (new_roots)
+   if (new_roots) {
+   if (!maybe_fs_roots(new_roots))
+   goto out_free;
nr_new_roots = new_roots->nnodes;
-   if (old_roots)
+   }
+   if (old_roots) {
+   if (!maybe_fs_roots(old_roots))
+   goto out_free;
nr_old_roots = old_roots->nnodes;
+   }
+
+   /* Quick exit, either not fs tree roots, or won't affect any qgroup */
+   if (nr_old_roots == 0 && nr_new_roots == 0)
+   goto out_free;
 
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
goto out_free;
-- 
2.10.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-28 Thread Zygo Blaxell
On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:
> 
> 
> At 11/29/2016 11:53 AM, Zygo Blaxell wrote:
> >On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:
> >>At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
> >>>Hello,
> >>>
> >>>these are only my thoughts; no code here, but I would like to share it 
> >>>hoping that it could be useful.
> >>>
> >>>As reported several times by Zygo (and others), one of the problem
> >>of raid5/6 is the write hole. Today BTRFS is not capable to address it.
> >>
> >>I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> >>yet.
> >>
> >>Personally speaking, Btrfs should implementing RAID56 support just like
> >>Btrfs on mdadm.
> >
> >Even mdadm doesn't implement it the way btrfs does (assuming all bugs
> >are fixed) any more.
> >
> >>See how badly the current RAID56 works?
> >
> >>The marginally benefit of btrfs RAID56 to scrub data better than tradition
> >>RAID56 is just a joke in current code base.
> >
> >>>The problem is that the stripe size is bigger than the "sector size"
> >>(ok sector is not the correct word, but I am referring to the basic
> >>unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
> >>btrfs writes less data than the stripe, the stripe is not filled; when
> >>it is filled by a subsequent write, a RMW of the parity is required.
> >>>
> >>>On the best of my understanding (which could be very wrong) ZFS try
> >>to solve this issue using a variable length stripe.
> >>
> >>Did you mean ZFS record size?
> >>IIRC that's file extent minimum size, and I didn't see how that can handle
> >>the write hole problem.
> >>
> >>Or did ZFS handle the problem?
> >
> >ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
> >parity blocks within extents, so it behaves more like btrfs compression
> >in the sense that the data in a RAID-Z extent is encoded differently
> >from the data in the file, and the kernel has to transform it on reads
> >and writes.
> >
> >No ZFS stripe can contain blocks from multiple different
> >transactions because the RAID-Z stripes begin and end on extent
> >(single-transaction-write) boundaries, so there is no write hole on ZFS.
> >
> >There is some space waste in ZFS because the minimum allocation unit
> >is two blocks (one data one parity) so any free space that is less
> >than two blocks long is unusable.  Also the maximum usable stripe width
> >(number of disks) is the size of the data in the extent plus one parity
> >block.  It means if you write a lot of discontiguous 4K blocks, you
> >effectively get 2-disk RAID1 and that may result in disappointing
> >storage efficiency.
> >
> >(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
> >for additional parity blocks).
> >
> >One could implement RAID-Z on btrfs, but it's by far the most invasive
> >proposal for fixing btrfs's write hole so far (and doesn't actually fix
> >anything, since the existing raid56 format would still be required to
> >read old data, and it would still be broken).
> >
> >>Anyway, it should be a low priority thing, and personally speaking,
> >>any large behavior modification involving  both extent allocator and bg
> >>allocator will be bug prone.
> >
> >My proposal requires only a modification to the extent allocator.
> >The behavior at the block group layer and scrub remains exactly the same.
> >We just need to adjust the allocator slightly to take the RAID5 CoW
> >constraints into account.
> 
> Then, you'd need to allow btrfs to split large buffered/direct write into
> small extents(not 128M anymore).
> Not sure if we need to do extra work for DirectIO.

Nope, that's not my proposal.  My proposal is to simply ignore free
space whenever it's inside a partially filled raid stripe (optimization:
...which was empty at the start of the current transaction).

That avoids modifying a stripe with committed data and therefore plugs the
write hole.

For nodatacow, prealloc (and maybe directio?) extents the behavior
wouldn't change (you'd have write hole, but only on data blocks not
metadata, and only on files that were already marked as explicitly not
requiring data integrity).

> And in fact, you're going to support variant max file extent size.

The existing extent sizing behavior is not changed *at all* in my proposal,
only the allocator's notion of what space is 'free'.

We can write an extent across multiple RAID5 stripes so long as we
finish writing the entire extent before pointing committed metadata to
it.  btrfs does that already otherwise checksums wouldn't work.

> This makes delalloc more complex (Wang enhanced dealloc support for variant
> file extent size, to fix ENOSPC problem for dedupe and compression).
> 
> This is already much more complex than you expected.

The complexity I anticipate is having to deal with two implementations
of the free space search, one for free space cache and one for free
space tree.

It could be as simple as calling the existing allocation fun

Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Duncan
Niccolò Belli posted on Mon, 28 Nov 2016 12:11:49 +0100 as excerpted:

> On lunedì 28 novembre 2016 09:20:15 CET, Kai Krakow wrote:
>> You can, however, use chattr to make the subvolume root directory (that
>> one where it is mounted) nodatacow (chattr +C) _before_ placing any
>> files or directories in there. That way, newly created files and
>> directories will inherit the flag. Take note that this flag can only
>> applied to directories and empty (zero-sized) files.
> 
> Do I keep checksumming for this directory such a way?

No.  Keeping checksums current on NOCOW files is racy, and compression 
would be complex as well because rewritten data may compress more or less 
well than the original so the on-filesystem size could change, so both 
features are disabled in the presence of NOCOW, regardless of how it is 
set.

Put another way, btrfs assumes COW by default and many of its features 
depend on COW -- that's why these features don't tend to be implemented 
on conventional rewrite-in-place filesystems in the first place.  Both 
checksumming and compression are among these COW-dependent features.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Not TLS] Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Duncan
Graham Cobb posted on Mon, 28 Nov 2016 09:49:33 + as excerpted:

> On 28/11/16 02:56, Duncan wrote:
>> It should still be worth turning on autodefrag on an existing somewhat
>> fragmented filesystem.  It just might take some time to defrag files
>> you do modify, and won't touch those you don't, which in some cases
>> might make it worth defragging those manually.  Or simply create new
>> filesystems, mount them with autodefrag, and copy everything over so
>> you're starting fresh, as I do.
> 
> Could that "copy" be (a series of) send/receive, so that snapshots and
> reflinks are preserved?  Does autodefrag work in that case or does the
> send/receive somehow override that and end up preserving the original
> (fragmented) extent structure?

Very good question that I don't know the answer to as I've not seen it 
discussed previously.  (I'm not a dev, just a list regular and user of 
btrfs myself, and my personal use-case involves neither snapshots nor 
send/receive, so on those topics if I've not seen it covered previously 
either here or on the wiki, I won't know.)

Someone else know?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-28 Thread Qu Wenruo



At 11/29/2016 12:55 PM, Zygo Blaxell wrote:

On Tue, Nov 29, 2016 at 12:12:03PM +0800, Qu Wenruo wrote:



At 11/29/2016 11:53 AM, Zygo Blaxell wrote:

On Tue, Nov 29, 2016 at 08:48:19AM +0800, Qu Wenruo wrote:

At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:

Hello,

these are only my thoughts; no code here, but I would like to share it hoping 
that it could be useful.

As reported several times by Zygo (and others), one of the problem

of raid5/6 is the write hole. Today BTRFS is not capable to address it.

I'd say, no need to address yet, since current soft RAID5/6 can't handle it
yet.

Personally speaking, Btrfs should implementing RAID56 support just like
Btrfs on mdadm.


Even mdadm doesn't implement it the way btrfs does (assuming all bugs
are fixed) any more.


See how badly the current RAID56 works?



The marginally benefit of btrfs RAID56 to scrub data better than tradition
RAID56 is just a joke in current code base.



The problem is that the stripe size is bigger than the "sector size"

(ok sector is not the correct word, but I am referring to the basic
unit of writing on the disk, which is 4k or 16K in btrfs).  >So when
btrfs writes less data than the stripe, the stripe is not filled; when
it is filled by a subsequent write, a RMW of the parity is required.


On the best of my understanding (which could be very wrong) ZFS try

to solve this issue using a variable length stripe.

Did you mean ZFS record size?
IIRC that's file extent minimum size, and I didn't see how that can handle
the write hole problem.

Or did ZFS handle the problem?


ZFS's strategy does solve the write hole.  In btrfs terms, ZFS embeds the
parity blocks within extents, so it behaves more like btrfs compression
in the sense that the data in a RAID-Z extent is encoded differently

>from the data in the file, and the kernel has to transform it on reads

and writes.

No ZFS stripe can contain blocks from multiple different
transactions because the RAID-Z stripes begin and end on extent
(single-transaction-write) boundaries, so there is no write hole on ZFS.

There is some space waste in ZFS because the minimum allocation unit
is two blocks (one data one parity) so any free space that is less
than two blocks long is unusable.  Also the maximum usable stripe width
(number of disks) is the size of the data in the extent plus one parity
block.  It means if you write a lot of discontiguous 4K blocks, you
effectively get 2-disk RAID1 and that may result in disappointing
storage efficiency.

(the above is for RAID-Z1.  For Z2 and Z3 add an extra block or two
for additional parity blocks).

One could implement RAID-Z on btrfs, but it's by far the most invasive
proposal for fixing btrfs's write hole so far (and doesn't actually fix
anything, since the existing raid56 format would still be required to
read old data, and it would still be broken).


Anyway, it should be a low priority thing, and personally speaking,
any large behavior modification involving  both extent allocator and bg
allocator will be bug prone.


My proposal requires only a modification to the extent allocator.
The behavior at the block group layer and scrub remains exactly the same.
We just need to adjust the allocator slightly to take the RAID5 CoW
constraints into account.


Then, you'd need to allow btrfs to split large buffered/direct write into
small extents(not 128M anymore).
Not sure if we need to do extra work for DirectIO.


Nope, that's not my proposal.  My proposal is to simply ignore free
space whenever it's inside a partially filled raid stripe (optimization:
...which was empty at the start of the current transaction).


Still have problems.

Allocator must handle fs under device remove or profile converting (from 
4 disks raid5 to 5 disk raid5/6) correctly.

Which already seems complex for me.


And further more, for fs with more devices, for example, 9 devices RAID5.
It will be a disaster to just write a 4K data and take up the whole 8 * 
64K space.

It will  definitely cause huge ENOSPC problem.

If you really think it's easy, make a RFC patch, which should be easy if 
it is, then run fstest auto group on it.


Easy words won't turn emails into real patch.



That avoids modifying a stripe with committed data and therefore plugs the
write hole.

For nodatacow, prealloc (and maybe directio?) extents the behavior
wouldn't change (you'd have write hole, but only on data blocks not
metadata, and only on files that were already marked as explicitly not
requiring data integrity).


And in fact, you're going to support variant max file extent size.


The existing extent sizing behavior is not changed *at all* in my proposal,
only the allocator's notion of what space is 'free'.

We can write an extent across multiple RAID5 stripes so long as we
finish writing the entire extent before pointing committed metadata to
it.  btrfs does that already otherwise checksums wouldn't work.


This makes delalloc more complex (Wang enhanced dealloc support for vari

Re: RFC: raid with a variable stripe size

2016-11-28 Thread Chris Murphy
On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo  wrote:
>
>
> At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:
>>
>> Hello,
>>
>> these are only my thoughts; no code here, but I would like to share it
>> hoping that it could be useful.
>>
>> As reported several times by Zygo (and others), one of the problem of
>> raid5/6 is the write hole. Today BTRFS is not capable to address it.
>
>
> I'd say, no need to address yet, since current soft RAID5/6 can't handle it
> yet.
>
> Personally speaking, Btrfs should implementing RAID56 support just like
> Btrfs on mdadm.
> See how badly the current RAID56 works?
>
> The marginally benefit of btrfs RAID56 to scrub data better than tradition
> RAID56 is just a joke in current code base.

Btrfs is subject to the write hole problem on disk, but any read or
scrub that needs to reconstruct from parity that is corrupt results in
a checksum error and EIO. So corruption is not passed up to user
space. Recent versions of md/mdadm support a write journal to avoid
the write hole problem on disk in case of a crash.

>> The problem is that the stripe size is bigger than the "sector size" (ok
>> sector is not the correct word, but I am referring to the basic unit of
>> writing on the disk, which is 4k or 16K in btrfs).
>> So when btrfs writes less data than the stripe, the stripe is not filled;
>> when it is filled by a subsequent write, a RMW of the parity is required.
>>
>> On the best of my understanding (which could be very wrong) ZFS try to
>> solve this issue using a variable length stripe.
>
>
> Did you mean ZFS record size?
> IIRC that's file extent minimum size, and I didn't see how that can handle
> the write hole problem.
>
> Or did ZFS handle the problem?

ZFS isn't subject to the write hole. My understanding is they get
around this because all writes are COW, there is no RMW. But the
variable stripe size means they don't have to do the usual (fixed)
full stripe write for just, for example a 4KiB change in data for a
single file. Conversely Btrfs does do RMW in such a case.


> Anyway, it should be a low priority thing, and personally speaking,
> any large behavior modification involving  both extent allocator and bg
> allocator will be bug prone.

I tend to agree. I think the non-scalability of Btrfs raid10, which
makes it behave more like raid 0+1, is a higher priority because right
now it's misleading to say the least; and then the longer term goal
for scaleable huge file systems is how Btrfs can shed irreparably
damaged parts of the file system (tree pruning) rather than
reconstruction.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: raid with a variable stripe size

2016-11-28 Thread Qu Wenruo



At 11/29/2016 01:51 PM, Chris Murphy wrote:

On Mon, Nov 28, 2016 at 5:48 PM, Qu Wenruo  wrote:



At 11/19/2016 02:15 AM, Goffredo Baroncelli wrote:


Hello,

these are only my thoughts; no code here, but I would like to share it
hoping that it could be useful.

As reported several times by Zygo (and others), one of the problem of
raid5/6 is the write hole. Today BTRFS is not capable to address it.



I'd say, no need to address yet, since current soft RAID5/6 can't handle it
yet.

Personally speaking, Btrfs should implementing RAID56 support just like
Btrfs on mdadm.
See how badly the current RAID56 works?

The marginally benefit of btrfs RAID56 to scrub data better than tradition
RAID56 is just a joke in current code base.


Btrfs is subject to the write hole problem on disk, but any read or
scrub that needs to reconstruct from parity that is corrupt results in
a checksum error and EIO. So corruption is not passed up to user
space. Recent versions of md/mdadm support a write journal to avoid
the write hole problem on disk in case of a crash.


That's interesting.

So I think it's less worthy to support RAID56 in btrfs, especially 
considering the stability.


My widest dream is, btrfs calls device mapper to build a micro 
RAID1/5/6/10 device for each chunk.

Which should save us tons of codes and bugs.

And for better recovery, enhance device mapper to provide interface to 
judge which block is correct.


Although that's just dream anyway.

Thanks,
Qu



The problem is that the stripe size is bigger than the "sector size" (ok
sector is not the correct word, but I am referring to the basic unit of
writing on the disk, which is 4k or 16K in btrfs).
So when btrfs writes less data than the stripe, the stripe is not filled;
when it is filled by a subsequent write, a RMW of the parity is required.

On the best of my understanding (which could be very wrong) ZFS try to
solve this issue using a variable length stripe.



Did you mean ZFS record size?
IIRC that's file extent minimum size, and I didn't see how that can handle
the write hole problem.

Or did ZFS handle the problem?


ZFS isn't subject to the write hole. My understanding is they get
around this because all writes are COW, there is no RMW.
But the
variable stripe size means they don't have to do the usual (fixed)
full stripe write for just, for example a 4KiB change in data for a
single file. Conversely Btrfs does do RMW in such a case.



Anyway, it should be a low priority thing, and personally speaking,
any large behavior modification involving  both extent allocator and bg
allocator will be bug prone.


I tend to agree. I think the non-scalability of Btrfs raid10, which
makes it behave more like raid 0+1, is a higher priority because right
now it's misleading to say the least; and then the longer term goal
for scaleable huge file systems is how Btrfs can shed irreparably
damaged parts of the file system (tree pruning) rather than
reconstruction.






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Qgroup accounting issue on kdave/for-next branch

2016-11-28 Thread Chandan Rajendra
When executing btrfs/126 test on kdave/for-next branch on a ppc64 guest, I
noticed the following call trace.

[   77.335887] [ cut here ]
[   77.336115] WARNING: CPU: 0 PID: 8325 at 
/root/repos/linux/fs/btrfs/qgroup.c:2443 .btrfs_qgroup_free_refroot+0x188/0x220
[   77.336303] Modules linked in:
[   77.336393] CPU: 0 PID: 8325 Comm: umount Not tainted 
4.9.0-rc5-00062-g6b74e43 #22
[   77.336526] task: c0062b8d4880 task.stack: c0062b9a4000
[   77.336638] NIP: c05cf018 LR: c05ceff8 CTR: c05390c0
[   77.336771] REGS: c0062b9a7450 TRAP: 0700   Not tainted  
(4.9.0-rc5-00062-g6b74e43)
[   77.336908] MSR: 8282b032
[   77.336960] <
[   77.337027] SF
[   77.337053] ,VEC
[   77.337087] ,VSX
[   77.337114] ,EE
[   77.337146] ,FP
[   77.337173] ,ME
[   77.337207] ,IR
[   77.337233] ,DR
[   77.337267] ,RI
[   77.337294] >
[   77.337330]   CR: 88000842  XER: 
[   77.337392] CFAR: c05c9b5c
[   77.337443] SOFTE: 1
[   77.337477]
   GPR00:
[   77.337517] c05cefcc
[   77.337575] c0062b9a76d0
[   77.337626] c1103a00
[   77.337714] c0063e058d40
[   77.337765]
   GPR04:
[   77.337817] c0062b9a7740
[   77.337868] 0008
[   77.337927] 0005f000
[   77.337978] 00063eed
[   77.338037]
   GPR08:
[   77.338080] c0062f9629c8
[   77.338138] 0001
[   77.338191] f000
[   77.338248] c0063e058e80
[   77.338300]
   GPR12:
[   77.338384] 
[   77.338435] cfe0
[   77.338498] 2000
[   77.338548] 
[   77.338605]
   GPR16:
[   77.338645] 0008
[   77.338703] 4d5906fc
[   77.338754] 4d5a6c08
[   77.338811] 4d54b3d0
[   77.338866]
   GPR20:
[   77.338921] 01000bcf8440
[   77.338972] 
[   77.339030] 
[   77.339080] c0062523b078
[   77.339138]
   GPR24:
[   77.339178] c0062523b080
[   77.339240] c0da2b58
[   77.339290] 
[   77.339347] c0062e539600
[   77.339398]
   GPR28:
[   77.339485] 0006
[   77.339536] c0062523b000
[   77.339594] c0062e539600
[   77.339644] c0062e539688

[   77.339740] NIP [c05cf018] .btrfs_qgroup_free_refroot+0x188/0x220
[   77.339852] LR [c05ceff8] .btrfs_qgroup_free_refroot+0x168/0x220
[   77.339959] Call Trace:
[   77.339998] [c0062b9a76d0] [c05cefcc] 
.btrfs_qgroup_free_refroot+0x13c/0x220
[   77.340123]  (unreliable)
[   77.340193] [c0062b9a7790] [c054210c] 
.commit_fs_roots+0x19c/0x240
[   77.340355] [c0062b9a78a0] [c05451a0] 
.btrfs_commit_transaction.part.5+0x480/0xbe0
[   77.340554] [c0062b9a7970] [c0503bd4] .btrfs_sync_fs+0x74/0x1c0
[   77.340725] [c0062b9a7a10] [c02d6ce0] .sync_filesystem+0xd0/0x100
[   77.340891] [c0062b9a7a90] [c0294f88] 
.generic_shutdown_super+0x38/0x1a0
[   77.341052] [c0062b9a7b10] [c0295508] .kill_anon_super+0x18/0x30
[   77.342243] [c0062b9a7b90] [c0503db8] .btrfs_kill_super+0x18/0xd0
[   77.342411] [c0062b9a7c10] [c02958c8] 
.deactivate_locked_super+0x98/0xe0
[   77.342573] [c0062b9a7c90] [c02be114] .cleanup_mnt+0x54/0xa0
[   77.342723] [c0062b9a7d10] [c00d7e74] .task_work_run+0x124/0x180
[   77.342862] [c0062b9a7db0] [c001c354] .do_notify_resume+0xa4/0xb0
[   77.343030] [c0062b9a7e30] [c000c344] 
.ret_from_except_lite+0x70/0x74
[   77.343187] Instruction dump:
[   77.343276] 3b40 e87d0c78 38810070 4bffab1d 6000 2c23 4182ff40 
e8dfffc8
[   77.343522] eb630008 7d5c3010 7d294910 7d2900d0 <0b09> 7fbc3040 41dd0070 
f95fffc8
[   77.343771] ---[ end trace 8130fb71377d4ff8 ]---
[   77.343857] BTRFS warning (device loop1): qgroup 5 reserved space underflow, 
have: 389120, to free: 393216
[   77.735193] BTRFS info (device loop1): disabling disk space caching
[   77.736629] BTRFS info (device loop1): has skinny extents


The above call trace happens because of the following:

 |+---|
 | Task A | Task B|
 |+---|
 | Write 4k bytes to a file   |   |
 | at offset range [0, 4095]  |   |
 | - __btrfs_buffered_write   |   |
 | -- btrfs_check_data_free_space |   |
 | --- Set EXTENT_QGROUP_RESERVED |   |
 | bit for range [0, 64k-1]   |   |
 | -- Copy 4k data from userspace |   |
 |to page cache   |   |
 | 

Re: [PATCH 1/3] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option (fwd)

2016-11-28 Thread Julia Lawall
It looks like the tree_log_mutex needs to be unlocked at lines 2153 and
2159.

julia

-- Forwarded message --
Date: Tue, 29 Nov 2016 14:16:30 +0800
From: kbuild test robot 
To: kbu...@01.org
Cc: Julia Lawall 
Subject: Re: [PATCH 1/3] btrfs: qgroup: Fix qgroup corruption caused by
inode_cache mount option

Hi Qu,

[auto build test WARNING on next-20161128]
[also build test WARNING on v4.9-rc7]
[cannot apply to btrfs/next v4.9-rc7 v4.9-rc6 v4.9-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-qgroup-Fix-qgroup-corruption-caused-by-inode_cache-mount-option/20161129-133729
:: branch date: 39 minutes ago
:: commit date: 39 minutes ago

>> fs/btrfs/transaction.c:2307:1-7: preceding lock on line 2126

git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout a10797b831381f092c853c616f2eddd856ace99d
vim +2307 fs/btrfs/transaction.c

e02119d5a Chris Mason2008-09-05  2120* At this point in the commit, 
there can't be any tree-log
e02119d5a Chris Mason2008-09-05  2121* writers, but a little lower 
down we drop the trans mutex
e02119d5a Chris Mason2008-09-05  2122* and let new people in.  By 
holding the tree_log_mutex
e02119d5a Chris Mason2008-09-05  2123* from now until after the 
super is written, we avoid races
e02119d5a Chris Mason2008-09-05  2124* with the tree-log code.
e02119d5a Chris Mason2008-09-05  2125*/
e02119d5a Chris Mason2008-09-05 @2126   
mutex_lock(&root->fs_info->tree_log_mutex);
1a40e23b9 Zheng Yan  2008-09-26  2127
5d4f98a28 Yan Zheng  2009-06-10  2128   ret = commit_fs_roots(trans, 
root);
49b25e054 Jeff Mahoney   2012-03-01  2129   if (ret) {
49b25e054 Jeff Mahoney   2012-03-01  2130   
mutex_unlock(&root->fs_info->tree_log_mutex);
871383be5 David Sterba   2012-04-02  2131   
mutex_unlock(&root->fs_info->reloc_mutex);
6cf7f77e6 Wang Shilong   2014-02-19  2132   goto scrub_continue;
49b25e054 Jeff Mahoney   2012-03-01  2133   }
54aa1f4df Chris Mason2007-06-22  2134
3818aea27 Qu Wenruo  2014-01-13  2135   /*
7e1876aca David Sterba   2014-02-05  2136* Since the transaction is 
done, we can apply the pending changes
7e1876aca David Sterba   2014-02-05  2137* before the next transaction.
3818aea27 Qu Wenruo  2014-01-13  2138*/
572d9ab78 David Sterba   2014-02-05  2139   
btrfs_apply_pending_changes(root->fs_info);
3818aea27 Qu Wenruo  2014-01-13  2140
5d4f98a28 Yan Zheng  2009-06-10  2141   /* commit_fs_roots gets rid of 
all the tree log roots, it is now
e02119d5a Chris Mason2008-09-05  2142* safe to free the root of 
tree log roots
e02119d5a Chris Mason2008-09-05  2143*/
e02119d5a Chris Mason2008-09-05  2144   btrfs_free_log_root_tree(trans, 
root->fs_info);
e02119d5a Chris Mason2008-09-05  2145
0ed4792af Qu Wenruo  2015-04-16  2146   /*
a10797b83 Qu Wenruo  2016-11-29  2147* commit_fs_roots() can call 
btrfs_save_ino_cache(), which generates
a10797b83 Qu Wenruo  2016-11-29  2148* new delayed refs. Must 
handle them or qgroup can be wrong.
a10797b83 Qu Wenruo  2016-11-29  2149*/
a10797b83 Qu Wenruo  2016-11-29  2150   ret = 
btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
a10797b83 Qu Wenruo  2016-11-29  2151   if (ret) {
a10797b83 Qu Wenruo  2016-11-29  2152   
mutex_unlock(&root->fs_info->reloc_mutex);
a10797b83 Qu Wenruo  2016-11-29  2153   goto scrub_continue;
a10797b83 Qu Wenruo  2016-11-29  2154   }
a10797b83 Qu Wenruo  2016-11-29  2155
a10797b83 Qu Wenruo  2016-11-29  2156   ret = 
btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
a10797b83 Qu Wenruo  2016-11-29  2157   if (ret) {
a10797b83 Qu Wenruo  2016-11-29  2158   
mutex_unlock(&root->fs_info->reloc_mutex);
a10797b83 Qu Wenruo  2016-11-29  2159   goto scrub_continue;
a10797b83 Qu Wenruo  2016-11-29  2160   }
a10797b83 Qu Wenruo  2016-11-29  2161
a10797b83 Qu Wenruo  2016-11-29  2162   /*
0ed4792af Qu Wenruo  2015-04-16  2163* Since fs roots are all 
committed, we can get a quite accurate
0ed4792af Qu Wenruo  2015-04-16  2164* new_roots. So let's do quota 
accounting.
0ed4792af Qu Wenruo  2015-04-16  2165*/
0ed4792af Qu Wenruo  2015-04-16  2166   ret = 
btrfs_qgroup_account_extents(trans, root->fs_info);
0ed4792af Qu Wenruo  2015-04-16  2167   if (ret < 0) {
0ed4792af Qu Wenruo  2015-04-16  2168   
mutex_unlock(&root->fs_info->tree_log_mutex);
0ed4792af Qu Wenruo  201

Re: [PATCH 1/3] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option (fwd)

2016-11-28 Thread Qu Wenruo
Oh, thanks, I just forgot that since the 3rd patch just removes the 
related lines.


Anyway, I should fix it.

Thanks,
Qu

At 11/29/2016 02:46 PM, Julia Lawall wrote:

It looks like the tree_log_mutex needs to be unlocked at lines 2153 and
2159.

julia

-- Forwarded message --
Date: Tue, 29 Nov 2016 14:16:30 +0800
From: kbuild test robot 
To: kbu...@01.org
Cc: Julia Lawall 
Subject: Re: [PATCH 1/3] btrfs: qgroup: Fix qgroup corruption caused by
inode_cache mount option

Hi Qu,

[auto build test WARNING on next-20161128]
[also build test WARNING on v4.9-rc7]
[cannot apply to btrfs/next v4.9-rc7 v4.9-rc6 v4.9-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Qu-Wenruo/btrfs-qgroup-Fix-qgroup-corruption-caused-by-inode_cache-mount-option/20161129-133729
:: branch date: 39 minutes ago
:: commit date: 39 minutes ago


fs/btrfs/transaction.c:2307:1-7: preceding lock on line 2126


git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout a10797b831381f092c853c616f2eddd856ace99d
vim +2307 fs/btrfs/transaction.c

e02119d5a Chris Mason2008-09-05  2120* At this point in the commit, 
there can't be any tree-log
e02119d5a Chris Mason2008-09-05  2121* writers, but a little lower 
down we drop the trans mutex
e02119d5a Chris Mason2008-09-05  2122* and let new people in.  By 
holding the tree_log_mutex
e02119d5a Chris Mason2008-09-05  2123* from now until after the 
super is written, we avoid races
e02119d5a Chris Mason2008-09-05  2124* with the tree-log code.
e02119d5a Chris Mason2008-09-05  2125*/
e02119d5a Chris Mason2008-09-05 @2126   
mutex_lock(&root->fs_info->tree_log_mutex);
1a40e23b9 Zheng Yan  2008-09-26  2127
5d4f98a28 Yan Zheng  2009-06-10  2128   ret = commit_fs_roots(trans, 
root);
49b25e054 Jeff Mahoney   2012-03-01  2129   if (ret) {
49b25e054 Jeff Mahoney   2012-03-01  2130   
mutex_unlock(&root->fs_info->tree_log_mutex);
871383be5 David Sterba   2012-04-02  2131   
mutex_unlock(&root->fs_info->reloc_mutex);
6cf7f77e6 Wang Shilong   2014-02-19  2132   goto scrub_continue;
49b25e054 Jeff Mahoney   2012-03-01  2133   }
54aa1f4df Chris Mason2007-06-22  2134
3818aea27 Qu Wenruo  2014-01-13  2135   /*
7e1876aca David Sterba   2014-02-05  2136* Since the transaction is 
done, we can apply the pending changes
7e1876aca David Sterba   2014-02-05  2137* before the next transaction.
3818aea27 Qu Wenruo  2014-01-13  2138*/
572d9ab78 David Sterba   2014-02-05  2139   
btrfs_apply_pending_changes(root->fs_info);
3818aea27 Qu Wenruo  2014-01-13  2140
5d4f98a28 Yan Zheng  2009-06-10  2141   /* commit_fs_roots gets rid of 
all the tree log roots, it is now
e02119d5a Chris Mason2008-09-05  2142* safe to free the root of 
tree log roots
e02119d5a Chris Mason2008-09-05  2143*/
e02119d5a Chris Mason2008-09-05  2144   btrfs_free_log_root_tree(trans, 
root->fs_info);
e02119d5a Chris Mason2008-09-05  2145
0ed4792af Qu Wenruo  2015-04-16  2146   /*
a10797b83 Qu Wenruo  2016-11-29  2147* commit_fs_roots() can call 
btrfs_save_ino_cache(), which generates
a10797b83 Qu Wenruo  2016-11-29  2148* new delayed refs. Must 
handle them or qgroup can be wrong.
a10797b83 Qu Wenruo  2016-11-29  2149*/
a10797b83 Qu Wenruo  2016-11-29  2150   ret = 
btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
a10797b83 Qu Wenruo  2016-11-29  2151   if (ret) {
a10797b83 Qu Wenruo  2016-11-29  2152   
mutex_unlock(&root->fs_info->reloc_mutex);
a10797b83 Qu Wenruo  2016-11-29  2153   goto scrub_continue;
a10797b83 Qu Wenruo  2016-11-29  2154   }
a10797b83 Qu Wenruo  2016-11-29  2155
a10797b83 Qu Wenruo  2016-11-29  2156   ret = 
btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
a10797b83 Qu Wenruo  2016-11-29  2157   if (ret) {
a10797b83 Qu Wenruo  2016-11-29  2158   
mutex_unlock(&root->fs_info->reloc_mutex);
a10797b83 Qu Wenruo  2016-11-29  2159   goto scrub_continue;
a10797b83 Qu Wenruo  2016-11-29  2160   }
a10797b83 Qu Wenruo  2016-11-29  2161
a10797b83 Qu Wenruo  2016-11-29  2162   /*
0ed4792af Qu Wenruo  2015-04-16  2163* Since fs roots are all 
committed, we can get a quite accurate
0ed4792af Qu Wenruo  2015-04-16  2164* new_roots. So let's do quota 
accounting.
0ed4792af Qu Wenruo  2015-04-16  2165*/
0ed4792af Qu Wenruo  2015-04-16  2166   ret = 
btrfs_qgroup_account_extents(trans, root->fs_info);
0ed4792af Qu Wenruo  2015-04

Re: Qgroup accounting issue on kdave/for-next branch

2016-11-28 Thread Qu Wenruo

Thanks for the detailed reports and reproducer.

I'll investigate it soon.

Thanks,
Qu

At 11/29/2016 02:36 PM, Chandan Rajendra wrote:

When executing btrfs/126 test on kdave/for-next branch on a ppc64 guest, I
noticed the following call trace.

[   77.335887] [ cut here ]
[   77.336115] WARNING: CPU: 0 PID: 8325 at 
/root/repos/linux/fs/btrfs/qgroup.c:2443 .btrfs_qgroup_free_refroot+0x188/0x220
[   77.336303] Modules linked in:
[   77.336393] CPU: 0 PID: 8325 Comm: umount Not tainted 
4.9.0-rc5-00062-g6b74e43 #22
[   77.336526] task: c0062b8d4880 task.stack: c0062b9a4000
[   77.336638] NIP: c05cf018 LR: c05ceff8 CTR: c05390c0
[   77.336771] REGS: c0062b9a7450 TRAP: 0700   Not tainted  
(4.9.0-rc5-00062-g6b74e43)
[   77.336908] MSR: 8282b032
[   77.336960] <
[   77.337027] SF
[   77.337053] ,VEC
[   77.337087] ,VSX
[   77.337114] ,EE
[   77.337146] ,FP
[   77.337173] ,ME
[   77.337207] ,IR
[   77.337233] ,DR
[   77.337267] ,RI
[   77.337294] >
[   77.337330]   CR: 88000842  XER: 
[   77.337392] CFAR: c05c9b5c
[   77.337443] SOFTE: 1
[   77.337477]
   GPR00:
[   77.337517] c05cefcc
[   77.337575] c0062b9a76d0
[   77.337626] c1103a00
[   77.337714] c0063e058d40
[   77.337765]
   GPR04:
[   77.337817] c0062b9a7740
[   77.337868] 0008
[   77.337927] 0005f000
[   77.337978] 00063eed
[   77.338037]
   GPR08:
[   77.338080] c0062f9629c8
[   77.338138] 0001
[   77.338191] f000
[   77.338248] c0063e058e80
[   77.338300]
   GPR12:
[   77.338384] 
[   77.338435] cfe0
[   77.338498] 2000
[   77.338548] 
[   77.338605]
   GPR16:
[   77.338645] 0008
[   77.338703] 4d5906fc
[   77.338754] 4d5a6c08
[   77.338811] 4d54b3d0
[   77.338866]
   GPR20:
[   77.338921] 01000bcf8440
[   77.338972] 
[   77.339030] 
[   77.339080] c0062523b078
[   77.339138]
   GPR24:
[   77.339178] c0062523b080
[   77.339240] c0da2b58
[   77.339290] 
[   77.339347] c0062e539600
[   77.339398]
   GPR28:
[   77.339485] 0006
[   77.339536] c0062523b000
[   77.339594] c0062e539600
[   77.339644] c0062e539688

[   77.339740] NIP [c05cf018] .btrfs_qgroup_free_refroot+0x188/0x220
[   77.339852] LR [c05ceff8] .btrfs_qgroup_free_refroot+0x168/0x220
[   77.339959] Call Trace:
[   77.339998] [c0062b9a76d0] [c05cefcc] 
.btrfs_qgroup_free_refroot+0x13c/0x220
[   77.340123]  (unreliable)
[   77.340193] [c0062b9a7790] [c054210c] 
.commit_fs_roots+0x19c/0x240
[   77.340355] [c0062b9a78a0] [c05451a0] 
.btrfs_commit_transaction.part.5+0x480/0xbe0
[   77.340554] [c0062b9a7970] [c0503bd4] .btrfs_sync_fs+0x74/0x1c0
[   77.340725] [c0062b9a7a10] [c02d6ce0] .sync_filesystem+0xd0/0x100
[   77.340891] [c0062b9a7a90] [c0294f88] 
.generic_shutdown_super+0x38/0x1a0
[   77.341052] [c0062b9a7b10] [c0295508] .kill_anon_super+0x18/0x30
[   77.342243] [c0062b9a7b90] [c0503db8] .btrfs_kill_super+0x18/0xd0
[   77.342411] [c0062b9a7c10] [c02958c8] 
.deactivate_locked_super+0x98/0xe0
[   77.342573] [c0062b9a7c90] [c02be114] .cleanup_mnt+0x54/0xa0
[   77.342723] [c0062b9a7d10] [c00d7e74] .task_work_run+0x124/0x180
[   77.342862] [c0062b9a7db0] [c001c354] .do_notify_resume+0xa4/0xb0
[   77.343030] [c0062b9a7e30] [c000c344] 
.ret_from_except_lite+0x70/0x74
[   77.343187] Instruction dump:
[   77.343276] 3b40 e87d0c78 38810070 4bffab1d 6000 2c23 4182ff40 
e8dfffc8
[   77.343522] eb630008 7d5c3010 7d294910 7d2900d0 <0b09> 7fbc3040 41dd0070 
f95fffc8
[   77.343771] ---[ end trace 8130fb71377d4ff8 ]---
[   77.343857] BTRFS warning (device loop1): qgroup 5 reserved space underflow, 
have: 389120, to free: 393216
[   77.735193] BTRFS info (device loop1): disabling disk space caching
[   77.736629] BTRFS info (device loop1): has skinny extents


The above call trace happens because of the following:

 |+---|
 | Task A | Task B|
 |+---|
 | Write 4k bytes to a file   |   |
 | at offset range [0, 4095]  |   |
 | - __btrfs_buffered_write   |   |
 | -- btrfs_check_data_free_space |   |
 | --- Set EXTENT_QGROUP_RESERVED |   |
 | bit for range [0, 64k-1]   |   |
 | -- Copy 4k data from u

[PATCH 00/10] Enhance btrfs qgroup test group coverage to support more mount options

2016-11-28 Thread Qu Wenruo
Old test cases in btrfs qgroup test group use fixed golden output.
This limits the coverage, since mount option like compress and
inode_cache and populate output easily.

On the other hand, "btrfs check" has support for checking qgroup
correctness at least from 3 kernel release before.
And that function is proved to be stable and helped to exposed quite a
lot of qgroup bugs.

So this patchset will use "btrfs check --qgroup-report" to replace fixed
golden output, and add extra check for some existing test cases.

This will reduce false alerts and improve the coverage.
In fact, with these patches and "-o inode_cache" mount option, we have
already exposed a new qgroup bug.

Qu Wenruo (10):
  fstests: common: Introduce function to check qgroup correctness
  fstests: btrfs/017: Use new _btrfs_check_scratch_qgroup function
  fstests: btrfs/022: Add extra qgroup verification after each work
  fstests: btrfs/028: Use new wrapped _btrfs_check_scratch_qgroup
function
  fstests: btrfs/042: Add extra qgroup verification
  fstests: btrfs/091: Use _btrfs_check_scratch_qgroup other than fixed  
  golden output
  fstests: btrfs/099: Add extra verification for qgroup
  fstests: btrfs/104: Use _btrfs_check_scratch_qgroup to replace open   
 codes
  fstests: btrfs/122: Use _btrfs_check_scratch_qgroup to replace open   
 code
  fstests: btrfs/123: Use _btrfs_check_scratch_qgroup to replace open   
 code

 common/rc   | 19 +++
 tests/btrfs/017 |  4 ++--
 tests/btrfs/017.out |  2 --
 tests/btrfs/022 |  5 +
 tests/btrfs/028 |  4 ++--
 tests/btrfs/042 |  3 +++
 tests/btrfs/091 |  5 ++---
 tests/btrfs/091.out |  4 
 tests/btrfs/099 |  3 +++
 tests/btrfs/104 | 11 ++-
 tests/btrfs/122 |  9 ++---
 tests/btrfs/123 |  4 +---
 12 files changed, 41 insertions(+), 32 deletions(-)

-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/10] fstests: btrfs/042: Add extra qgroup verification

2016-11-28 Thread Qu Wenruo
Use newly introduced _btrfs_check_scratch_qgroup() to double check if
qgroup numbers are correct.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/042 | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/btrfs/042 b/tests/btrfs/042
index 498ccc9..003b7af 100755
--- a/tests/btrfs/042
+++ b/tests/btrfs/042
@@ -89,6 +89,9 @@ if [ $total_written -gt $LIMIT_SIZE ];then
_fail "total written should be less than $LIMIT_SIZE"
 fi
 
+_scratch_unmount
+_btrfs_check_scratch_qgroup
+
 # success, all done
 echo "Silence is golden"
 status=0
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] fstests: btrfs/028: Use new wrapped _btrfs_check_scratch_qgroup function

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/028 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/btrfs/028 b/tests/btrfs/028
index 1425609..c4e99c6 100755
--- a/tests/btrfs/028
+++ b/tests/btrfs/028
@@ -87,8 +87,8 @@ _run_btrfs_util_prog filesystem sync $SCRATCH_MNT
 _scratch_unmount
 
 # generate a qgroup report and look for inconsistent groups
-$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
-   grep -E "Counts for qgroup.*are different"
+_btrfs_check_scratch_qgroup
+
 echo "Silence is golden"
 status=0
 
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] fstests: btrfs/017: Use new _btrfs_check_scratch_qgroup function

2016-11-28 Thread Qu Wenruo
Now this test case can handle inode_map mount option and expose error
for that mount option.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/017 | 4 ++--
 tests/btrfs/017.out | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/tests/btrfs/017 b/tests/btrfs/017
index 3f409d3..356cef5 100755
--- a/tests/btrfs/017
+++ b/tests/btrfs/017
@@ -86,8 +86,8 @@ rm -fr $SCRATCH_MNT/snap/foo*
 
 sync
 
-units=`_btrfs_qgroup_units`
-$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $AWK_PROG '/[0-9]/ {print 
$2" "$3}'
+_scratch_unmount
+_btrfs_check_scratch_qgroup
 
 # success, all done
 status=0
diff --git a/tests/btrfs/017.out b/tests/btrfs/017.out
index 503eb88..bbc8eb4 100644
--- a/tests/btrfs/017.out
+++ b/tests/btrfs/017.out
@@ -1,4 +1,2 @@
 QA output created by 017
 Blocks modified: [0 - 1]
-65536 65536
-65536 65536
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] fstests: btrfs/104: Use _btrfs_check_scratch_qgroup to replace open codes

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/104 | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/tests/btrfs/104 b/tests/btrfs/104
index 6afaa02..a5d2070 100755
--- a/tests/btrfs/104
+++ b/tests/btrfs/104
@@ -152,14 +152,7 @@ _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
 sleep 45
 
 _scratch_unmount
-
-# generate a qgroup report and look for inconsistent groups
-#  - don't use _run_btrfs_util_prog here as it captures the output and
-#we need to grep it.
-$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
-   grep -E -q "Counts for qgroup.*are different"
-if [ $? -ne 0 ]; then
-   status=0
-fi
+_btrfs_check_scratch_qgroup
+status=0
 
 exit
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] fstests: btrfs/099: Add extra verification for qgroup

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/099 | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tests/btrfs/099 b/tests/btrfs/099
index 70f07b5..9cc9a3d 100755
--- a/tests/btrfs/099
+++ b/tests/btrfs/099
@@ -82,6 +82,9 @@ sync
 $XFS_IO_PROG -f -c "pwrite -b $BLOCKSIZE 0 $(($FILESIZE - $BLOCKSIZE))" \
$SCRATCH_MNT/foo | _filter_xfs_io
 
+_scratch_unmount
+_btrfs_check_scratch_qgroup
+
 # success, all done
 status=0
 exit
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] fstests: btrfs/022: Add extra qgroup verification after each work

2016-11-28 Thread Qu Wenruo
The old code is using rescan to verify the number.

It will never hurt to add extra qgroup verification using btrfsck.

Signed-off-by: Qu Wenruo 
---
 tests/btrfs/022 | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/btrfs/022 b/tests/btrfs/022
index 56d4f3d..d1ac921 100755
--- a/tests/btrfs/022
+++ b/tests/btrfs/022
@@ -125,20 +125,25 @@ _scratch_mkfs > /dev/null 2>&1
 _scratch_mount
 _basic_test
 _scratch_unmount
+_btrfs_check_scratch_qgroup
 
 _scratch_mkfs > /dev/null 2>&1
 _scratch_mount
 _rescan_test
 _scratch_unmount
+_btrfs_check_scratch_qgroup
 
 _scratch_mkfs > /dev/null 2>&1
 _scratch_mount
 _limit_test_exceed
 _scratch_unmount
+_btrfs_check_scratch_qgroup
 
 _scratch_mkfs > /dev/null 2>&1
 _scratch_mount
 _limit_test_noexceed
+_scratch_unmount
+_btrfs_check_scratch_qgroup
 
 # success, all done
 echo "Silence is golden"
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] fstests: btrfs/122: Use _btrfs_check_scratch_qgroup to replace open code

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/122 | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/tests/btrfs/122 b/tests/btrfs/122
index 82252ab..899ede5 100755
--- a/tests/btrfs/122
+++ b/tests/btrfs/122
@@ -77,12 +77,7 @@ _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT 
"$SCRATCH_MNT/snaps/snap1"
 _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT "$SCRATCH_MNT/snaps/snap2"
 
 _scratch_unmount
-
-# generate a qgroup report and look for inconsistent groups
-$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
-   grep -q -E "Counts for qgroup.*are different"
-if [ $? -ne 0 ]; then
-   status=0
-fi
+_btrfs_check_scratch_qgroup
+status=0
 
 exit
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] fstests: common: Introduce function to check qgroup correctness

2016-11-28 Thread Qu Wenruo
Old btrfs qgroup test cases uses fix golden output numbers, which limits
the coverage since they can't handle mount options like compress or
inode_map, and cause false alert.

Introduce _btrfs_check_scratch_qgroup() function to check qgroup
correctness using "btrfs check --qgroup-report" function, which will
follow the way kernel handle qgroup and are proved very reliable.

Signed-off-by: Qu Wenruo 
---
 common/rc | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/common/rc b/common/rc
index 8c99306..35d2d56 100644
--- a/common/rc
+++ b/common/rc
@@ -3018,6 +3018,25 @@ _require_deletable_scratch_dev_pool()
done
 }
 
+# We check if "btrfs check" support to check qgroup correctness
+# Old fixed golden output can cover case like compress and inode_map
+# mount options, which limits the coverage
+_require_btrfs_check_qgroup()
+{
+   _require_command "$BTRFS_UTIL_PROG" btrfs
+   output=$($BTRFS_UTIL_PROG check --help | grep "qgroup-report")
+   if [ -z "$output" ]; then
+   _notrun "$BTRFS_UTIL_PROG too old (must support 'check 
--qgroup-report')"
+   fi
+}
+
+_btrfs_check_scratch_qgroup()
+{
+   _require_btrfs_check_qgroup
+   $BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 |\
+   grep -E "Counts for qgroup.*are different"
+}
+
 # We check for btrfs and (optionally) features of the btrfs command
 _require_btrfs()
 {
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] fstests: btrfs/123: Use _btrfs_check_scratch_qgroup to replace open code

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/123 | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/tests/btrfs/123 b/tests/btrfs/123
index e89d541..36d24c4 100755
--- a/tests/btrfs/123
+++ b/tests/btrfs/123
@@ -76,9 +76,7 @@ _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
 _run_btrfs_util_prog balance start -d $SCRATCH_MNT
 
 _scratch_unmount
-# generate a qgroup report and look for inconsistent groups
-$BTRFS_UTIL_PROG check --qgroup-report $SCRATCH_DEV 2>&1 | \
-   grep -E "Counts for qgroup.*are different"
+_btrfs_check_scratch_qgroup
 
 # success, all done
 status=0
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/10] fstests: btrfs/091: Use _btrfs_check_scratch_qgroup other than fixed golden output

2016-11-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 tests/btrfs/091 | 5 ++---
 tests/btrfs/091.out | 4 
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/tests/btrfs/091 b/tests/btrfs/091
index e3c43c7..ab4ff4e 100755
--- a/tests/btrfs/091
+++ b/tests/btrfs/091
@@ -94,9 +94,8 @@ cp --reflink $SCRATCH_MNT/subv1/file1 $SCRATCH_MNT/subv3/file1
 # in commit tree. So need to sync to update the qgroup commit tree.
 sync
 
-units=`_btrfs_qgroup_units`
-$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' | \
-   $AWK_PROG '{print $2" "$3}'
+_scratch_unmount
+_btrfs_check_scratch_qgroup
 
 # success, all done
 status=0
diff --git a/tests/btrfs/091.out b/tests/btrfs/091.out
index 68fe2df..f2981b5 100644
--- a/tests/btrfs/091.out
+++ b/tests/btrfs/091.out
@@ -1,7 +1,3 @@
 QA output created by 091
 wrote 262144/262144 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-65536 65536
-327680 65536
-327680 65536
-327680 65536
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: raid56: Use correct stolen pages to calculate P/Q

2016-11-28 Thread Adam Borowski
On Tue, Nov 29, 2016 at 02:52:47AM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2016-11-28 at 16:48 -0500, Zygo Blaxell wrote:
> > If a drive's embedded controller RAM fails, you get corruption on the
> > majority of reads from a single disk, and most writes will be corrupted
> > (even if they were not before).
> 
> Administrating a multi-PiB Tier-2 for the LHC Computing Grid with quite
> a number of disks for nearly 10 years now, I'd have never stumbled on
> such a case of breakage so far...
> 
> Actually most cases are as simple as HDD fails to work and this is
> properly signalled to the controller.

I administer no real storage at this time, and got only 16 disks (plus a few
disk-likes) to my name right now.  Yet in a ~2 months span I've seen three
cases of silent data corruption:

* a RasPi I used for DNS recursor/DHCP/aiccu started mangling some writes,
  with no notification that something is amiss.  With ext4 being a
  silentdatalossfs, there was no clue it was a disk (ok, SD) problem at all,
  making it really "fun" to debug.  Happens on multiple SD cards, thus it's
  the machine that's at fault.

* a HDD had some link resets and silent data corruption, diagnosed to a bad
  SATA cable, the disk works fine since (obviously after extensive tests).

* a HDD that has link resets and silent data corruption (apparently
  write-time only(?)), Marduk knows why.  Happens with multiple cables and
  two machines, putting the blame somewhere on the disk.

Thus, assumption that the controller will be notified about read errors is
quite invalid.  In the above cases, if recovery was possible it'd be
beneficial to rewrite a good copy of the data.


Meow!
-- 
The bill declaring Jesus as the King of Poland fails to specify whether
the addition is at the top or end of the list of kings.  What should the
historians do?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount option

2016-11-28 Thread Qu Wenruo
[BUG]
The easist way to reproduce the bug is:
--
 # mkfs.btrfs -f $dev -n 16K
 # mount $dev $mnt -o inode_cache
 # btrfs quota enable $mnt
 # btrfs quota rescan -w $mnt
 # btrfs qgroup show $mnt
qgroupid rfer excl
  
0/5  32.00KiB 32.00KiB
 ^^ Twice the correct value
--

And fstests/btrfs qgroup test group can easily detect them.
Although some of them are false alerts since old test cases are using
fixed golden output.
While new test cases will use "btrfs check" to detect qgroup mismatch.

[CAUSE]
Inode_cache mount option will make commit_fs_roots() to call
btrfs_save_ino_cache() to update fs/subvol trees, and generate new
delayed refs.

However we call btrfs_qgroup_prepare_account_extents() too early, before
commit_fs_roots().
This makes the "old_roots" for newly generated extents are always NULL.
For freeing extent case, this makes both new_roots and old_roots to be
empty, while correct old_roots should not be empty.
This causing qgroup numbers not decreased correctly.

[FIX]
Modify the timing of calling btrfs_qgroup_prepare_account_extents() to
just before btrfs_qgroup_account_extents(), and add needed delayed_refs
handler.
So qgroup can handle inode_map mount options correctly.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix a missing mutex_unlock() at error routine.
---
 fs/btrfs/transaction.c | 25 ++---
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 9517de0..11e8f62 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2102,13 +2102,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
goto scrub_continue;
}
 
-   /* Reocrd old roots for later qgroup accounting */
-   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
-   if (ret) {
-   mutex_unlock(&root->fs_info->reloc_mutex);
-   goto scrub_continue;
-   }
-
/*
 * make sure none of the code above managed to slip in a
 * delayed item
@@ -2151,6 +2144,24 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
btrfs_free_log_root_tree(trans, root->fs_info);
 
/*
+* commit_fs_roots() can call btrfs_save_ino_cache(), which generates
+* new delayed refs. Must handle them or qgroup can be wrong.
+*/
+   ret = btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
+   if (ret) {
+   mutex_unlock(&root->fs_info->tree_log_mutex);
+   mutex_unlock(&root->fs_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   ret = btrfs_qgroup_prepare_account_extents(trans, root->fs_info);
+   if (ret) {
+   mutex_unlock(&root->fs_info->tree_log_mutex);
+   mutex_unlock(&root->fs_info->reloc_mutex);
+   goto scrub_continue;
+   }
+
+   /*
 * Since fs roots are all committed, we can get a quite accurate
 * new_roots. So let's do quota accounting.
 */
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] btrfs: qgroup: Add quick exit for non-fs extents

2016-11-28 Thread Qu Wenruo
For btrfs_qgroup_account_extent(), modify make it exit quicker for
non-fs extents.

This will also reduce the noise in trace_btrfs_qgroup_account_extent
event.

Signed-off-by: Qu Wenruo 
---
v2:
  None
---
 fs/btrfs/qgroup.c | 41 +++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e97f304..b9a2fd1 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1904,6 +1904,33 @@ static int qgroup_update_counters(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
+/*
+ * Helper to check if the @roots is a list of fs tree roots
+ * Return 0 for definitely not a fs/subvol tree roots ulist
+ * Return 1 for possible fs/subvol tree roots ulist(including empty)
+ */
+static int maybe_fs_roots(struct ulist *roots)
+{
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+
+   /* Empty one, still possible for fs roots */
+   if (!roots || roots->nnodes == 0)
+   return 1;
+
+   ULIST_ITER_INIT(&uiter);
+   unode = ulist_next(roots, &uiter);
+   if (!unode)
+   return 1;
+
+   /*
+* If it contains fs tree roots, then it must belongs to fs/subvol
+* trees.
+* If it contains non-fs tree, it won't be shared to fs/subvol trees.
+*/
+   return is_fstree(unode->val);
+}
+
 int
 btrfs_qgroup_account_extent(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
@@ -1917,10 +1944,20 @@ btrfs_qgroup_account_extent(struct btrfs_trans_handle 
*trans,
u64 nr_old_roots = 0;
int ret = 0;
 
-   if (new_roots)
+   if (new_roots) {
+   if (!maybe_fs_roots(new_roots))
+   goto out_free;
nr_new_roots = new_roots->nnodes;
-   if (old_roots)
+   }
+   if (old_roots) {
+   if (!maybe_fs_roots(old_roots))
+   goto out_free;
nr_old_roots = old_roots->nnodes;
+   }
+
+   /* Quick exit, either not fs tree roots, or won't affect any qgroup */
+   if (nr_old_roots == 0 && nr_new_roots == 0)
+   goto out_free;
 
if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
goto out_free;
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3] Qgroup and inode_cache fix, with small cleanups

2016-11-28 Thread Qu Wenruo
This patchset fixes a qgroup bug when using with inode_cache mount option.

The bug reminds us that the design of separate
btrfs_qgroup_prepare_account_extents() is a bad practice especially the
safest timing is to call prepare just before btrfs_qgroup_account_extents().

So the patchset will cleanup btrfs_qgroup_prepare_account_extents() to
avoid further similar bugs.

And add a quick exit for non-fs extents for qgroup, mainly to reduce the
noise of qgroup trace events.

v2:
  Fix the missing mutex_unlock() at error path of 1st patch.

Qu Wenruo (3):
  btrfs: qgroup: Fix qgroup corruption caused by inode_cache mount
option
  btrfs: qgroup: Add quick exit for non-fs extents
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

 fs/btrfs/qgroup.c  | 82 +-
 fs/btrfs/qgroup.h  |  2 --
 fs/btrfs/transaction.c | 21 +++--
 3 files changed, 59 insertions(+), 46 deletions(-)

-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function

2016-11-28 Thread Qu Wenruo
Quite a lot of qgroup corruption happens due to wrong timing of calling
btrfs_qgroup_prepare_account_extents().

Since the safest timing is calling it just before
btrfs_qgroup_account_extents(), there is no need to separate these 2
function.

Merging them will make code cleaner and less bug prone.

Signed-off-by: Qu Wenruo 
---
v2:
  None
---
 fs/btrfs/qgroup.c  | 41 +
 fs/btrfs/qgroup.h  |  2 --
 fs/btrfs/transaction.c | 10 --
 3 files changed, 9 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b9a2fd1..00e3c16 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1419,37 +1419,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
return ret;
 }
 
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_qgroup_extent_record *record;
-   struct btrfs_delayed_ref_root *delayed_refs;
-   struct rb_node *node;
-   u64 qgroup_to_skip;
-   int ret = 0;
-
-   delayed_refs = &trans->transaction->delayed_refs;
-   qgroup_to_skip = delayed_refs->qgroup_to_skip;
-
-   /*
-* No need to do lock, since this function will only be called in
-* btrfs_commit_transaction().
-*/
-   node = rb_first(&delayed_refs->dirty_extent_root);
-   while (node) {
-   record = rb_entry(node, struct btrfs_qgroup_extent_record,
- node);
-   ret = btrfs_find_all_roots(NULL, fs_info, record->bytenr, 0,
-  &record->old_roots);
-   if (ret < 0)
-   break;
-   if (qgroup_to_skip)
-   ulist_del(record->old_roots, qgroup_to_skip, 0);
-   node = rb_next(node);
-   }
-   return ret;
-}
-
 int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
struct btrfs_delayed_ref_root *delayed_refs,
struct btrfs_qgroup_extent_record *record)
@@ -2038,6 +2007,11 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
trace_btrfs_qgroup_account_extents(fs_info, record);
 
if (!ret) {
+   /* Search commit root to find old_roots */
+   ret = btrfs_find_all_roots(NULL, fs_info,
+   record->bytenr, 0, &record->old_roots);
+   if (ret < 0)
+   goto cleanup;
/*
 * Use (u64)-1 as time_seq to do special search, which
 * doesn't lock tree or delayed_refs and search current
@@ -2047,8 +2021,11 @@ int btrfs_qgroup_account_extents(struct 
btrfs_trans_handle *trans,
record->bytenr, (u64)-1, &new_roots);
if (ret < 0)
goto cleanup;
-   if (qgroup_to_skip)
+   if (qgroup_to_skip) {
ulist_del(new_roots, qgroup_to_skip, 0);
+   ulist_del(record->old_roots, qgroup_to_skip,
+ 0);
+   }
ret = btrfs_qgroup_account_extent(trans, fs_info,
record->bytenr, record->num_bytes,
record->old_roots, new_roots);
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 99c879d..2c7f701 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -90,8 +90,6 @@ int btrfs_limit_qgroup(struct btrfs_trans_handle *trans,
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
 void btrfs_free_qgroup_config(struct btrfs_fs_info *fs_info);
 struct btrfs_delayed_extent_op;
-int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans,
-struct btrfs_fs_info *fs_info);
 /*
  * Inform qgroup to trace one dirty extent, its info is recorded in @record.
  * So qgroup can account it at commit trans time.
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 11e8f62..37adbf8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1346,9 +1346,6 @@ static int qgroup_account_snapshot(struct 
btrfs_trans_handle *trans,
ret = commit_fs_roots(trans, src);
if (ret)
goto out;
-   ret = btrfs_qgroup_prepare_account_extents(trans, fs_info);
-   if (ret < 0)
-   goto out;
ret = btrfs_qgroup_account_extents(trans, fs_info);
if (ret < 0)
goto out;
@@ -2154,13 +2151,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
goto scrub_continue;
}
 
-   ret = btrfs_qgroup_prepare_account_extents(

Re: Qgroup accounting issue on kdave/for-next branch

2016-11-28 Thread Qu Wenruo

At 11/29/2016 02:36 PM, Chandan Rajendra wrote:

When executing btrfs/126 test on kdave/for-next branch on a ppc64 guest, I
noticed the following call trace.

[   77.335887] [ cut here ]
[   77.336115] WARNING: CPU: 0 PID: 8325 at 
/root/repos/linux/fs/btrfs/qgroup.c:2443 .btrfs_qgroup_free_refroot+0x188/0x220
[   77.336303] Modules linked in:
[   77.336393] CPU: 0 PID: 8325 Comm: umount Not tainted 
4.9.0-rc5-00062-g6b74e43 #22
[   77.336526] task: c0062b8d4880 task.stack: c0062b9a4000
[   77.336638] NIP: c05cf018 LR: c05ceff8 CTR: c05390c0
[   77.336771] REGS: c0062b9a7450 TRAP: 0700   Not tainted  
(4.9.0-rc5-00062-g6b74e43)
[   77.336908] MSR: 8282b032
[   77.336960] <
[   77.337027] SF
[   77.337053] ,VEC
[   77.337087] ,VSX
[   77.337114] ,EE
[   77.337146] ,FP
[   77.337173] ,ME
[   77.337207] ,IR
[   77.337233] ,DR
[   77.337267] ,RI
[   77.337294] >
[   77.337330]   CR: 88000842  XER: 
[   77.337392] CFAR: c05c9b5c
[   77.337443] SOFTE: 1
[   77.337477]
   GPR00:
[   77.337517] c05cefcc
[   77.337575] c0062b9a76d0
[   77.337626] c1103a00
[   77.337714] c0063e058d40
[   77.337765]
   GPR04:
[   77.337817] c0062b9a7740
[   77.337868] 0008
[   77.337927] 0005f000
[   77.337978] 00063eed
[   77.338037]
   GPR08:
[   77.338080] c0062f9629c8
[   77.338138] 0001
[   77.338191] f000
[   77.338248] c0063e058e80
[   77.338300]
   GPR12:
[   77.338384] 
[   77.338435] cfe0
[   77.338498] 2000
[   77.338548] 
[   77.338605]
   GPR16:
[   77.338645] 0008
[   77.338703] 4d5906fc
[   77.338754] 4d5a6c08
[   77.338811] 4d54b3d0
[   77.338866]
   GPR20:
[   77.338921] 01000bcf8440
[   77.338972] 
[   77.339030] 
[   77.339080] c0062523b078
[   77.339138]
   GPR24:
[   77.339178] c0062523b080
[   77.339240] c0da2b58
[   77.339290] 
[   77.339347] c0062e539600
[   77.339398]
   GPR28:
[   77.339485] 0006
[   77.339536] c0062523b000
[   77.339594] c0062e539600
[   77.339644] c0062e539688

[   77.339740] NIP [c05cf018] .btrfs_qgroup_free_refroot+0x188/0x220
[   77.339852] LR [c05ceff8] .btrfs_qgroup_free_refroot+0x168/0x220
[   77.339959] Call Trace:
[   77.339998] [c0062b9a76d0] [c05cefcc] 
.btrfs_qgroup_free_refroot+0x13c/0x220
[   77.340123]  (unreliable)
[   77.340193] [c0062b9a7790] [c054210c] 
.commit_fs_roots+0x19c/0x240
[   77.340355] [c0062b9a78a0] [c05451a0] 
.btrfs_commit_transaction.part.5+0x480/0xbe0
[   77.340554] [c0062b9a7970] [c0503bd4] .btrfs_sync_fs+0x74/0x1c0
[   77.340725] [c0062b9a7a10] [c02d6ce0] .sync_filesystem+0xd0/0x100
[   77.340891] [c0062b9a7a90] [c0294f88] 
.generic_shutdown_super+0x38/0x1a0
[   77.341052] [c0062b9a7b10] [c0295508] .kill_anon_super+0x18/0x30
[   77.342243] [c0062b9a7b90] [c0503db8] .btrfs_kill_super+0x18/0xd0
[   77.342411] [c0062b9a7c10] [c02958c8] 
.deactivate_locked_super+0x98/0xe0
[   77.342573] [c0062b9a7c90] [c02be114] .cleanup_mnt+0x54/0xa0
[   77.342723] [c0062b9a7d10] [c00d7e74] .task_work_run+0x124/0x180
[   77.342862] [c0062b9a7db0] [c001c354] .do_notify_resume+0xa4/0xb0
[   77.343030] [c0062b9a7e30] [c000c344] 
.ret_from_except_lite+0x70/0x74
[   77.343187] Instruction dump:
[   77.343276] 3b40 e87d0c78 38810070 4bffab1d 6000 2c23 4182ff40 
e8dfffc8
[   77.343522] eb630008 7d5c3010 7d294910 7d2900d0 <0b09> 7fbc3040 41dd0070 
f95fffc8
[   77.343771] ---[ end trace 8130fb71377d4ff8 ]---
[   77.343857] BTRFS warning (device loop1): qgroup 5 reserved space underflow, 
have: 389120, to free: 393216
[   77.735193] BTRFS info (device loop1): disabling disk space caching
[   77.736629] BTRFS info (device loop1): has skinny extents


The above call trace happens because of the following:

 |+---|
 | Task A | Task B|
 |+---|
 | Write 4k bytes to a file   |   |
 | at offset range [0, 4095]  |   |
 | - __btrfs_buffered_write   |   |
 | -- btrfs_check_data_free_space |   |
 | --- Set EXTENT_QGROUP_RESERVED |   |
 | bit for range [0, 64k-1]   |   |
 | -- Copy 4k data from userspace |   |
 |to page cache   |