[PATCH] btrfs-progs: Fix cross-compile error for mktables

2017-08-14 Thread Qu Wenruo
When cross compiling btrfs-progs, following error will prevent
btrfs-progs to be compiled:

[CC] mktables
[TABLE]  kernel-lib/tables.c
/bin/sh: ./mktables: cannot execute binary file: Exec format error
make: *** No rule to make target 'kernel-lib/tables.c', needed by 
'kernel-lib/tables.o'.  Stop.

"mktables" should only be executed in host environment, while @CC
set by autoconf will follow host/build/target setting, causing mktables
to be cross-compiled.

The fix is to introduce a new @HOSTCC for mktables, which will not be
affected by host/build/target settings.

Reported-by: Hallo32 
Suggested-by: David Sterba 
Signed-off-by: Qu Wenruo 
---
Tested with AArch64 cross-toolchain created by buildroot.
---
 Makefile| 2 +-
 Makefile.inc.in | 1 +
 configure.ac| 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index b3e2b636..0395e37f 100644
--- a/Makefile
+++ b/Makefile
@@ -323,7 +323,7 @@ version.h: version.sh version.h.in configure.ac
 
 mktables: kernel-lib/mktables.c
@echo "[CC] $@"
-   $(Q)$(CC) $(CFLAGS) $< -o $@
+   $(Q)$(HOSTCC) $(CFLAGS) $< -o $@
 
 kernel-lib/tables.c: mktables
@echo "[TABLE]  $@"
diff --git a/Makefile.inc.in b/Makefile.inc.in
index 4e1b68cb..308acca3 100644
--- a/Makefile.inc.in
+++ b/Makefile.inc.in
@@ -4,6 +4,7 @@
 export
 
 CC = @CC@
+HOSTCC = @HOSTCC@
 LN_S = @LN_S@
 AR = @AR@
 RM = @RM@
diff --git a/configure.ac b/configure.ac
index 30055f85..f6051ebd 100644
--- a/configure.ac
+++ b/configure.ac
@@ -26,6 +26,7 @@ AC_CONFIG_SRCDIR([btrfs.c])
 AC_PREFIX_DEFAULT([/usr/local])
 
 AC_PROG_CC
+AC_PATH_PROGS([HOSTCC], [gcc clang])
 AC_CANONICAL_HOST
 AC_C_CONST
 AC_C_VOLATILE
-- 
2.14.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] btrfs-progs: Specify C standard to gnu90 explicitly

2017-08-14 Thread Qu Wenruo
Different C compilers have different default language standard.
This sometimes causes problem on different system.

For distribution like CentOS/RHEL7, its gcc is still 4.8 and will report
error for c90 style declaration, while most developers are using newer
gcc which will just ignore it.
This makes us hard to detect such language standard problem.

This patch will specify standard to gnu90 explicitly to avoid such problem.
Gnu90 is a good mix of c90 while provide a lot of useful gnu extension,
and is supported by all modern gcc and clang.

Reported-by: Marco Lorenzo Crociani 
Signed-off-by: Qu Wenruo 
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 81598df1..cb1b3c5f 100644
--- a/Makefile
+++ b/Makefile
@@ -58,6 +58,7 @@ TOPDIR := $(shell pwd)
 
 # Common build flags
 CFLAGS = $(SUBST_CFLAGS) \
+-std=gnu90 \
 -include config.h \
 -DBTRFS_FLAT_INCLUDES \
 -D_XOPEN_SOURCE=700  \
-- 
2.13.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs fi du -s gives Inappropriate ioctl for device

2017-08-14 Thread Chris Murphy
On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniak  wrote:

>
> and... some issues:
> ~ # btrfs fi du -s /mnt/red/\@backup/
>  Total   Exclusive  Set shared  Filename
> ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for 
> device


It's a bug, but I don't know if any devs are working on a fix yet.

The problem is that the subvolume being snapshot, contains subvolumes.
The resulting snapshot, contains an empty directory in place of the
nested subvolume(s), and that is the cause for the error.


Other messages on this:
https://www.spinics.net/lists/linux-btrfs/msg67330.html
https://www.spinics.net/lists/linux-btrfs/msg61883.html



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Qu Wenruo



On 2017年08月15日 02:57, Goffredo Baroncelli wrote:

On 08/14/2017 05:10 PM, David Sterba wrote:

On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote:

[...]

mktables.c is synced from kernel sources, taking updates from there is
easier than porting any changes to the proposed scripted implementation.

The workflow is simple:
- copy kernel mktables.c changes to btrfs-progs mktables.c


How the kernel deals with this kind of problem ?
Looking at the source of btrfs Makefile, it is more simple to replace

   mktables: kernel-lib/mktables.c
 @echo "[CC] $@"
 $(Q)$(CC) $(CFLAGS) $< -o $@

with


   mktables: kernel-lib/mktables.c
 @echo "[HOSTCC] $@"
 $(Q)$(HOSTCC) $(CFLAGS) $< -o $@

where HOSTCC is defined as

HOSTCC=gcc


(may be the same applied also to CFLAGS <-> HOSTCFLAGS ?)


If using HOSTCC then I'm fine with it.

Thanks,
Qu




- compile mktables
- run 'make kernel-lib/tables.c'
- commit the changes to git

All of that happens very rarely, if ever, the raid56 tables and
correction algorithms are set in stone. Any extensions need to be done
on both sides kernel/userspace.


What about using script to generate it?


We do have the mktables utility to generate it and I'll regenerate it
should there be a change to kernel-lib/mktables.c


I mean to replace mktables.c with a script.
So no cross compiler problems at all, and even easier Makefile.
No dependence on "mktables" program.


Somebody has to implement the script and verify that the output is the
same, eventually sync changes. The cross-compilation should be fixed
with the pregenerated tables.c . Is Makefile size really a concern? The
number of related lines is like 7. I don't see any benefit in what you
propose and hopefully explained my viewpoint enough so I don't have to
continue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs fi du -s gives Inappropriate ioctl for device

2017-08-14 Thread Piotr Szymaniak
Hi list,

I have some weird issue.

So, I have btrfs fs and some subvols, like that:
~ # btrfs sub list /mnt/red/
ID 260 gen 827956 top level 5 path @home
ID 261 gen 827926 top level 5 path @backup
ID 645 gen 827911 top level 5 path @svn
*snip*

and some snapshots on those subvols:
*snip*
ID 2501 gen 827894 top level 260 path 
@home/.snapshot/monthly_2017-05-01_05:30:01
ID 2603 gen 827894 top level 260 path 
@home/.snapshot/monthly_2017-06-01_05:30:01
ID 2604 gen 827927 top level 261 path 
@backup/.snapshot/monthly_2017-06-01_05:30:02
ID 2680 gen 827895 top level 260 path 
@home/.snapshot/monthly_2017-07-01_05:30:01
ID 2681 gen 827927 top level 261 path 
@backup/.snapshot/monthly_2017-07-01_05:30:01
ID 2734 gen 827895 top level 260 path @home/.snapshot/weekly_2017-07-22_04:20:01
ID 2735 gen 827905 top level 645 path @svn/.snapshot/weekly_2017-07-22_04:20:02
ID 2754 gen 827895 top level 260 path @home/.snapshot/weekly_2017-07-29_04:20:01
ID 2755 gen 827906 top level 645 path @svn/.snapshot/weekly_2017-07-29_04:20:01
ID 2763 gen 827896 top level 260 path 
@home/.snapshot/monthly_2017-08-01_05:30:01
*snip*

and... some issues:
~ # btrfs fi du -s /mnt/red/\@backup/
 Total   Exclusive  Set shared  Filename
ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for device

and it works for other subvols:
~ # btrfs fi du -s /mnt/red/\@svn/
 Total   Exclusive  Set shared  Filename
  52.22GiB10.59MiB 4.13GiB  /mnt/red/@svn/

~ # btrfs fi du -s /mnt/red/\@home/
 Total   Exclusive  Set shared  Filename
   2.08TiB 4.54GiB   124.51GiB  /mnt/red/@home/

on the other hand, those values also look wrong (2.08TiB?):
~ # btrfs fi df /mnt/red/\@home/
Data, RAID1: total=448.00GiB, used=447.39GiB
System, RAID1: total=32.00MiB, used=80.00KiB
Metadata, RAID1: total=3.00GiB, used=2.02GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

~ # btrfs fi usage /mnt/red/\@home/
Overall:
Device size:   1.36TiB
Device allocated:902.06GiB
Device unallocated:  493.21GiB
Device missing:  0.00B
Used:898.81GiB
Free (estimated):247.22GiB  (min: 247.22GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID1: Size:448.00GiB, Used:447.39GiB
   /dev/mapper/wd0   448.00GiB
   /dev/mapper/wd1   448.00GiB

Metadata,RAID1: Size:3.00GiB, Used:2.02GiB
   /dev/mapper/wd0 3.00GiB
   /dev/mapper/wd1 3.00GiB

System,RAID1: Size:32.00MiB, Used:80.00KiB
   /dev/mapper/wd032.00MiB
   /dev/mapper/wd132.00MiB

Unallocated:
   /dev/mapper/wd0   246.60GiB
   /dev/mapper/wd1   246.60GiB

~ # uname -sr
Linux 4.4.76

~ # btrfs --version
btrfs-progs v4.12

Whats wrong with one of my subvols? Whats wrong with space reported by fi du?


Best regards,
Piotr Szymaniak.


signature.asc
Description: Digital signature


Re: [RFC] Checksum of the parity

2017-08-14 Thread Chris Murphy
On Mon, Aug 14, 2017 at 2:18 PM, Goffredo Baroncelli  wrote:

>> Anyway, I do wish I read the code better, so I knew exactly where, if
>> at all, the RMW code was happening on disk rather than just in memory.
>> There very clearly is RMW in memory code as a performanc optimizer,
>> before a stripe gets written out it's possible to RMW it to add in
>> more changes or new files, that way raid56 isn't dog slow CoW'ing
>> literally a handful of 16KiB leaves each time, that then translate
>> into a minimum of 384K of writes.
>
> In case of a fully stripe write, there is no RMW cycle, so no "write hole".

That is conflating disk writes and in-memory RMW. They have to be
separated. For sure there's in-memory RMW + CoW of the entire stripe
to disk, for a tiny (1 byte) change to a file because I've seen it.
What I don't know, and can't tell from the code, is if there's ever
such a thing as partial stripe RMW (and over write of just a data
strip and a corresponding over write for parity). Any overwrite is
where the write hole comes into play.

But inferring from the work Liu is apparently working on for a
journal, it must be true that there is such a thing as a overwrites
with Btrfs raid56.



>
> Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the 
> minimum data is 64k * 2 (+ 64kb of parity).

Bad addition on my part.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Goffredo Baroncelli
On 08/14/2017 09:08 PM, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelli  
> wrote:
> 
>> Form a theoretical point of view, if you have a "PURE" COW file-system, you 
>> don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, 
>> so you need a journal to keep it in sync. The same is true for the NOCOW 
>> file (and their checksums)
>>
> 
> I'm pretty sure the raid56 rmw is in memory only, I don't think we
> have a case where a stripe is getting partial writes (a block in a
> stripe is being overwritten). Partial stripe updates with rmw *on
> disk* would mean Btrfs raid56 is not CoW.
> 

I am not sure about that. Consider the following cases:
- what if we have to wrote less than a stripe ?
- supposing to remove a file with length of 4k. If you don't allow a RMW cycle, 
this means the space would be lost forever

Pay attention that the size of a stripe (theoretically) could be quite big: 
suppose to have an (insanely) raid compose by 20disks, the stripe would be 
about 20*64k = ~1.2MB)

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Checksum of the parity

2017-08-14 Thread Goffredo Baroncelli
On 08/14/2017 09:28 PM, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli  
> wrote:
>> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>>> [2]
>>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>>> not sure, sadly I don't read the code well enough. But if all Btrfs
>>> raid56 writes are full stripe CoW writes, and if the prescribed order
>>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>>> superblock update, then I don't see how the write hole happens. Write
>>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>>> and a crash during the modification of the stripe making that stripe
>>> inconsistent as well as still pointed to by metadata.
>>
>>
>> RAID5 is *single* failure prof. And in order to have the write hole bug we 
>> need two failure:
>> 1) a transaction is aborted (e.g. due to a power failure) and the results is 
>> that data and parity are mis-aligned
>> 2) a disk disappears
>>
>> These two events may happen even in different moment.
>>
>> The key is that when a disk disappear, all remaining ones are used to 
>> rebuild the missing one. So if data and parity are mis-aligned the rebuild 
>> disk is wrong.
>>
>> Let me to show an example
>>
>> Disk 1Disk 2 Disk 3  (parity)
>> AABB CC
>>
>> where CC = A ^ B
>>
>> Note1: A is a valid data
>>
>> Supposing to update B and due to a power failure you can't update parity, 
>> you have:
>>
>>
>> Disk 1Disk 2 Disk 3  (parity)
>> AADDDCC
>>
>> Of course CC != A ^ D  (data and parity are misaligned).
>>
>>
>> Pay attention that AA is still valid data.
>>
>> Now suppose to loose disk1. If you want to read from it, you have to perform 
>> a read of disk2 and disk3 to compute disk1.
>>
>> However Disk2 and disk3 are misaligned, so doing a D ^ C you don't 
>> got A anymore.
>>
>>
>> Note that it is not important if DD or B are valid or invalid data.
> 
> 
> Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
> results in csum mismatch. This I've tested.

I never argued about that. The write hole is related to *loss* of "valid data" 
due to a mis-alignement between data and parity.
The fact that  BTRFS is capable to detect the problem and return an -EIO, 
doesn't mitigate the loss of valid data. Pay attention that in my example A 
reached the disk before the "failure events"

> 
> I vaguely remember a while ago doing a dd conv=notrunc modification of
> a file that's raid5, and there was no RMW, what happened is the whole
> stripe was CoW'd and had the modification. So that would, hardware
> behaving correctly, mean that the raid5 data CoW succeeds, then there
> is a metadata CoW to point to it, then the super block is updated to
> point to the new tree.
> 
> At any point, if there's an interruption, we have the old super
> pointing to the old tree which points to premodified data.
> 
> Anyway, I do wish I read the code better, so I knew exactly where, if
> at all, the RMW code was happening on disk rather than just in memory.
> There very clearly is RMW in memory code as a performanc optimizer,
> before a stripe gets written out it's possible to RMW it to add in
> more changes or new files, that way raid56 isn't dog slow CoW'ing
> literally a handful of 16KiB leaves each time, that then translate
> into a minimum of 384K of writes.

In case of a fully stripe write, there is no RMW cycle, so no "write hole". 
Unfortunately not all writes are full stripe size. I never checked the code, 
but I hope that during a commit of the transaction all the writing are grouped 
in "full stripe write" as possible.

Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the 
minimum data is 64k * 2 (+ 64kb of parity).

> But yeah, Qu just said in another thread that Liu is working on a
> journal for the raid56 write hole problem. Thing is I don't see when
> it happens in the code or in practice (so far, it's really tedious to
> poke a file system with a stick).
> 



> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Christoph Anton Mitterer
On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote:
> Quite a few applications actually _do_ have some degree of secondary 
> verification or protection from a crash.  Go look at almost any
> database 
> software.
Then please give proper references for this!

This is from 2015, where you claimed this already and I looked up all
the bigger DBs and they either couldn't do it at all, didn't to it per
default, or it required application support (i.e. from the programs
using the DB)
https://www.spinics.net/lists/linux-btrfs/msg50258.html


> It usually will not have checksumming, but it will almost 
> always have support for a journal, which is enough to cover the 
> particular data loss scenario we're talking about (unexpected
> unclean 
> shutdown).

I don't think we talk about this:
We talk about people wanting checksuming to notice e.g. silent data
corruption.

The crash case is only the corner case about what happens then if data
is written correctly but csums not.


> In my own experience, the things that use nodatacow fall into one of
> 4 
> categories:
> 1. Cases where the data is non-critical, and data loss will be 
> inconvenient but not fatal.  Systemd journal files are a good example
> of 
> this, as are web browser profiles when you're using profile sync.

I'd guess many people would want to have their log files valid and
complete. Same for their profiles (especially since people concerned
about their integrity might not want to have these synced to
Mozilla/Google etc.)


> 2. Cases where the upper level can reasonably be expected to have
> some 
> degree of handling, even if it's not correction.  VM disk images and 
> most database applications fall into this category.

No. Wrong. Or prove me that I'm wrong ;-)
And these two (VMs, DBs) are actually *the* main cases for nodatacow.


> And I and most other sysadmins I know would prefer the opposite with
> the 
> addition of a secondary notification method.  You can still hook the 
> notification to stop the application, but you don't have to if you
> don't 
> want to (and in cases 1 and 2 I listed above, you probably don't want
> to).

Then I guess btrfs is generally not the right thing for such people, as
in the CoW case it will also give them EIO on any corruptions and their
programs will fail.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Christoph Anton Mitterer
On Mon, 2017-08-14 at 10:23 -0400, Austin S. Hemmelgarn wrote:
> Assume you have higher level verification.  Would you rather not be
> able 
> to read the data regardless of if it's correct or not, or be able to 
> read it and determine yourself if it's correct or not?

What would be the difference here then to the CoW+checksuming+some-
data-corruption-case?!
btrfs would also give EIO and all these applications you mention would
fail then.

As I've said previous, one could provide end users with the means to
still access the faulty data. Or they could simply mount with
nochecksum.




> For almost 
> anybody, the answer is going to be the second case, because the 
> application knows better than the OS if the data is correct (and 
> 'correct' may be a threshold, not some binary determination).
You've made that claim already once with VMs and DBs, and your claim
proved simply wrong.

Most applications don't do this kind of verification.

And those that do probably rather just check whether the data is valid
and if not give an error or at best fall back to some automatical
backups (e.g. what package managers do).

I'd know only few programs who'd really be capable to use data they
know is bogus and recover from that automagically... the only examples
I'd know are some archive formats which include error correcting codes.
And I really mean using the blocks for recovery for which the csum
wouldn't verify (i.e. the ones that gives an EIO)... without ECCs, how
would a program know what do to with such data?


I cannot image that many people would choose the second option, to be
honest.
Working with bogus data?! What should be the benefit of this?



>   At that 
> point, you need to make the checksum error a warning instead of 
> returning -EIO.  How do you intend to communicate that warning back
> to 
> the application?  The kernel log won't work, because on any
> reasonably 
> secure system it's not visible to anyone but root.

Still same problem with CoW + any data corruption...

>   There's also no side 
> channel for the read() system calls that you can utilize.  That then 
> means that the checksums end up just being a means for the
> administrator 
> to know some data wasn't written correctly, but they should know
> that 
> anyway because the system crashed.

No, they'd have no idea if any / which data was written during the
crash.



> Looking at this from a different angle: Without background, what
> would 
> you assume the behavior to be for this?  For most people, the
> assumption 
> would be that this provides the same degree of data safety that the 
> checksums do when the data is CoW.

I don't think the average use would have any such assumption. Most
people likely don't even know that there is implicitly no checksuming
if nodatacow is enabled.


What people may however have heard of is, that btrfs does doe
checksuming and they'd assume that their filesystem gives them always
just valid data (or an error)... and IMO that's actually what each
modern fs should do per default.
Relying on higher levels providing such means is simply not realistic.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC] Checksum of the parity

2017-08-14 Thread Chris Murphy
On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli  wrote:
> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>> [2]
>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>> not sure, sadly I don't read the code well enough. But if all Btrfs
>> raid56 writes are full stripe CoW writes, and if the prescribed order
>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>> superblock update, then I don't see how the write hole happens. Write
>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>> and a crash during the modification of the stripe making that stripe
>> inconsistent as well as still pointed to by metadata.
>
>
> RAID5 is *single* failure prof. And in order to have the write hole bug we 
> need two failure:
> 1) a transaction is aborted (e.g. due to a power failure) and the results is 
> that data and parity are mis-aligned
> 2) a disk disappears
>
> These two events may happen even in different moment.
>
> The key is that when a disk disappear, all remaining ones are used to rebuild 
> the missing one. So if data and parity are mis-aligned the rebuild disk is 
> wrong.
>
> Let me to show an example
>
> Disk 1Disk 2 Disk 3  (parity)
> AABB CC
>
> where CC = A ^ B
>
> Note1: A is a valid data
>
> Supposing to update B and due to a power failure you can't update parity, you 
> have:
>
>
> Disk 1Disk 2 Disk 3  (parity)
> AADDDCC
>
> Of course CC != A ^ D  (data and parity are misaligned).
>
>
> Pay attention that AA is still valid data.
>
> Now suppose to loose disk1. If you want to read from it, you have to perform 
> a read of disk2 and disk3 to compute disk1.
>
> However Disk2 and disk3 are misaligned, so doing a D ^ C you don't 
> got A anymore.
>
>
> Note that it is not important if DD or B are valid or invalid data.


Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
results in csum mismatch. This I've tested.

I vaguely remember a while ago doing a dd conv=notrunc modification of
a file that's raid5, and there was no RMW, what happened is the whole
stripe was CoW'd and had the modification. So that would, hardware
behaving correctly, mean that the raid5 data CoW succeeds, then there
is a metadata CoW to point to it, then the super block is updated to
point to the new tree.

At any point, if there's an interruption, we have the old super
pointing to the old tree which points to premodified data.

Anyway, I do wish I read the code better, so I knew exactly where, if
at all, the RMW code was happening on disk rather than just in memory.
There very clearly is RMW in memory code as a performanc optimizer,
before a stripe gets written out it's possible to RMW it to add in
more changes or new files, that way raid56 isn't dog slow CoW'ing
literally a handful of 16KiB leaves each time, that then translate
into a minimum of 384K of writes.

But yeah, Qu just said in another thread that Liu is working on a
journal for the raid56 write hole problem. Thing is I don't see when
it happens in the code or in practice (so far, it's really tedious to
poke a file system with a stick).




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Chris Murphy
On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelli  wrote:

> Form a theoretical point of view, if you have a "PURE" COW file-system, you 
> don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, 
> so you need a journal to keep it in sync. The same is true for the NOCOW file 
> (and their checksums)
>

I'm pretty sure the raid56 rmw is in memory only, I don't think we
have a case where a stripe is getting partial writes (a block in a
stripe is being overwritten). Partial stripe updates with rmw *on
disk* would mean Btrfs raid56 is not CoW.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Goffredo Baroncelli
On 08/14/2017 05:10 PM, David Sterba wrote:
> On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote:
[...]
> mktables.c is synced from kernel sources, taking updates from there is
> easier than porting any changes to the proposed scripted implementation.
> 
> The workflow is simple:
> - copy kernel mktables.c changes to btrfs-progs mktables.c

How the kernel deals with this kind of problem ? 
Looking at the source of btrfs Makefile, it is more simple to replace

  mktables: kernel-lib/mktables.c
@echo "[CC] $@"
$(Q)$(CC) $(CFLAGS) $< -o $@

with


  mktables: kernel-lib/mktables.c
@echo "[HOSTCC] $@"
$(Q)$(HOSTCC) $(CFLAGS) $< -o $@

where HOSTCC is defined as

HOSTCC=gcc


(may be the same applied also to CFLAGS <-> HOSTCFLAGS ?)

> - compile mktables
> - run 'make kernel-lib/tables.c'
> - commit the changes to git
> 
> All of that happens very rarely, if ever, the raid56 tables and
> correction algorithms are set in stone. Any extensions need to be done
> on both sides kernel/userspace.
> 
 What about using script to generate it?
>>>
>>> We do have the mktables utility to generate it and I'll regenerate it
>>> should there be a change to kernel-lib/mktables.c
>>
>> I mean to replace mktables.c with a script.
>> So no cross compiler problems at all, and even easier Makefile.
>> No dependence on "mktables" program.
> 
> Somebody has to implement the script and verify that the output is the
> same, eventually sync changes. The cross-compilation should be fixed
> with the pregenerated tables.c . Is Makefile size really a concern? The
> number of related lines is like 7. I don't see any benefit in what you
> propose and hopefully explained my viewpoint enough so I don't have to
> continue.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: test incremental send with compression and extent cloning

2017-08-14 Thread fdmanana
From: Filipe Manana 

Test that an incremental send/receive operation will not fail when the
destination filesystem has compression enabled and the source filesystem
has a 4K extent at a file offset 0 that is not compressed and that is
shared.

This currently fails on btrfs and is fixed by the following patch for the
linux kernel:

  "Btrfs: incremental send, fix emission of invalid clone operations"

Signed-off-by: Filipe Manana 
---
 tests/btrfs/149 | 113 
 tests/btrfs/149.out |  14 +++
 tests/btrfs/group   |   1 +
 3 files changed, 128 insertions(+)
 create mode 100755 tests/btrfs/149
 create mode 100644 tests/btrfs/149.out

diff --git a/tests/btrfs/149 b/tests/btrfs/149
new file mode 100755
index ..432ec707
--- /dev/null
+++ b/tests/btrfs/149
@@ -0,0 +1,113 @@
+#! /bin/bash
+# FS QA Test No. btrfs/149
+#
+# Test that an incremental send/receive operation will not fail when the
+# destination filesystem has compression enabled and the source filesystem
+# has a 4K extent at a file offset 0 that is not compressed and that is
+# shared.
+#
+#---
+#
+# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -fr $send_files_dir
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/reflink
+
+# real QA test starts here
+_supported_fs btrfs
+_supported_os Linux
+_require_test
+_require_scratch
+_require_scratch_reflink
+_require_odirect
+
+send_files_dir=$TEST_DIR/btrfs-test-$seq
+
+rm -f $seqres.full
+rm -fr $send_files_dir
+mkdir $send_files_dir
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount "-o compress"
+
+# Write to our file using direct IO, so that this way the write ends up not
+# getting compressed, that is, we get a regular extent which is neither
+# inlined nor compressed.
+# Alternatively, we could have mounted the fs without compression enabled,
+# which would result as well in an uncompressed regular extent.
+$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 4K" $SCRATCH_MNT/foobar | 
_filter_xfs_io
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/mysnap1 > /dev/null
+
+# Clone the regular (not inlined) extent.
+$XFS_IO_PROG -c "reflink $SCRATCH_MNT/foobar 0 8K 4K" $SCRATCH_MNT/foobar \
+   | _filter_xfs_io
+
+$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \
+   $SCRATCH_MNT/mysnap2 > /dev/null
+
+$BTRFS_UTIL_PROG send -f $send_files_dir/1.snap \
+$SCRATCH_MNT/mysnap1 2>&1 >/dev/null | _filter_scratch
+
+# Now do an incremental send of the second snapshot. The send stream can have
+# a clone operation to clone the extent at offset 0 to offset 8K. This 
operation
+# would fail on the receiver if it has compression enabled, since the write
+# operation of the extent at offset 0 was compressed because it was a buffered
+# write operation, and btrfs' clone implementation does not allow cloning 
inline
+# extents to offsets different from 0.
+$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 -f $send_files_dir/2.snap \
+$SCRATCH_MNT/mysnap2 2>&1 >/dev/null | _filter_scratch
+
+echo "File digests in the original filesystem:"
+md5sum $SCRATCH_MNT/mysnap1/foobar | _filter_scratch
+md5sum $SCRATCH_MNT/mysnap2/foobar | _filter_scratch
+
+# Now recreate the filesystem by receiving both send streams and verify we get
+# the same file content that the original filesystem had.
+_scratch_unmount
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount "-o compress"
+
+$BTRFS_UTIL_PROG receive -f $send_files_dir/1.snap $SCRATCH_MNT > /dev/null
+$BTRFS_UTIL_PROG receive -f $send_files_dir/2.snap $SCRATCH_MNT > /dev/null
+
+echo "File digests in the new filesystem:"
+md5sum $SCRATCH_MNT/mysnap1/foobar | _filter_scratch
+md5sum $SCRATCH_MNT/mysnap2/foobar | _filter_scratch
+
+status=0
+exit
diff --git a/tests/btrfs/149.out b/tests/btrfs/149.out

[PATCH] Btrfs: incremental send, fix emission of invalid clone operations

2017-08-14 Thread fdmanana
From: Filipe Manana 

When doing an incremental send it's possible that the computed send stream
contains clone operations that will fail on the receiver if the receiver
has compression enabled and the clone operations target a sector sized
extent that starts at a zero file offset, is not compressed on the source
filesystem but ends up being compressed and inlined at the destination
filesystem.

Example scenario:

  $ mkfs.btrfs -f /dev/sdb
  $ mount -o compress /dev/sdb /mnt

  # By doing a direct IO write, the data is not compressed.
  $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

  $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar
  $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

  $ btrfs send -f /tmp/1.snap /mnt/mysnap1
  $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount -o compress /dev/sdc /mnt
  $ btrfs receive -f /tmp/1.snap /mnt
  $ btrfs receive -f /tmp/2.snap /mnt
  ERROR: failed to clone extents to foobar
  Operation not supported

The same could be achieved by mounting the source filesystem without
compression and doing a buffered IO write instead of a direct IO one,
and mounting the destination filesystem with compression enabled.

So fix this by issuing regular write operations in the send stream
instead of clone operations when the source offset is zero and the
range has a length matching the sector size.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/send.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index b082210df9c8..460be72ab78b 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -4992,6 +4992,25 @@ static int clone_range(struct send_ctx *sctx,
struct btrfs_key key;
int ret;
 
+   /*
+* Prevent cloning from a zero offset with a length matching the sector
+* size because in some scenarios this will make the receiver fail.
+*
+* For example, if in the source filesystem the extent at offset 0
+* has a length of sectorsize and it was written using direct IO, then
+* it can never be an inline extent (even if compression is enabled).
+* Then this extent can be cloned in the original filesystem to a non
+* zero file offset, but it may not be possible to clone in the
+* destination filesystem because it can be inlined due to compression
+* on the destination filesystem (as the receiver's write operations are
+* always done using buffered IO). The same happens when the original
+* filesystem does not have compression enabled but the destination
+* filesystem has.
+*/
+   if (clone_root->offset == 0 &&
+   len == sctx->send_root->fs_info->sectorsize)
+   return send_extent_data(sctx, offset, len);
+
path = alloc_path_for_send();
if (!path)
return -ENOMEM;
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Graham Cobb
On 14/08/17 16:53, Austin S. Hemmelgarn wrote:
> Quite a few applications actually _do_ have some degree of secondary
> verification or protection from a crash.  

I am glad your applications do and you have no need of this feature.
You are welcome not to use it. I, on the other hand, definitely want
this feature and would have it enabled by default on all my systems
despite the need for manual actions after some unclean shutdowns.

> Go look at almost any database
> software.  It usually will not have checksumming, but it will almost
> always have support for a journal, which is enough to cover the
> particular data loss scenario we're talking about (unexpected unclean
> shutdown).

No, the problem we are talking about is the data-at-rest corruption that
checksumming is designed to deal with. That is why I want it. The
unclean shutdown is a side issue that means there is a trade-off to
using it.

No one is suggesting that checksums are any significant help with the
unclean shutdown case, just that the existence of that atomicity issue
does not **prevent** them being very useful for the function for which
they were designed. The degree to which any particular sysadmin will
choose to enable or disable checksums on nodatacow files will depend on
how much they value the checksum protection vs. the impact of manually
fixing problems after some unclean shutdowns.

In my particular case, many of these nodatacow files are large, very
long-lived and only in use intermittently. I would like my monthly
"btrfs scrub" to know they haven't gone bad but they are extremely
unlikely to be in the middle of a write during an unclean shutdown so I
am likely to have very few false errors. They are all backed up, but
without checksumming I don't know that the backup needs to be restored
(or even that I am not backing up now-bad data).

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-08-14 Thread David Sterba
On Wed, Aug 02, 2017 at 03:18:27AM -0300, Ernesto A. Fernández wrote:
> When changing a file's acl mask, btrfs_set_acl() will first set the
> group bits of i_mode to the value of the mask, and only then set the
> actual extended attribute representing the new acl.
> 
> If the second part fails (due to lack of space, for example) and the
> file had no acl attribute to begin with, the system will from now on
> assume that the mask permission bits are actual group permission bits,
> potentially granting access to the wrong users.
> 
> Prevent this by restoring the original mode bits if __btrfs_set_acl
> fails.
> 
> Signed-off-by: Ernesto A. Fernández 
> ---
> Please ignore the two previous versions, this is far simpler and has the
> same effect. To Josef Bacik: thank you for your review, I'm sorry I
> wasted your time.

This version is much better, starting the transaction would add some
overhead and would need to be measured so we have an idea about the
impact.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Austin S. Hemmelgarn

On 2017-08-14 11:13, Graham Cobb wrote:

On 14/08/17 15:23, Austin S. Hemmelgarn wrote:

Assume you have higher level verification.


But almost no applications do. In real life, the decision
making/correction process will be manual and labour-intensive (for
example, running fsck on a virtual disk or restoring a file from backup).
Quite a few applications actually _do_ have some degree of secondary 
verification or protection from a crash.  Go look at almost any database 
software.  It usually will not have checksumming, but it will almost 
always have support for a journal, which is enough to cover the 
particular data loss scenario we're talking about (unexpected unclean 
shutdown).



Would you rather not be able
to read the data regardless of if it's correct or not, or be able to
read it and determine yourself if it's correct or not?


It must be controllable on a per-file basis, of course. For the tiny
number of files where the app can both spot the problem and correct it
(for example if it has a journal) the current behaviour could be used.
In my own experience, the things that use nodatacow fall into one of 4 
categories:
1. Cases where the data is non-critical, and data loss will be 
inconvenient but not fatal.  Systemd journal files are a good example of 
this, as are web browser profiles when you're using profile sync.
2. Cases where the upper level can reasonably be expected to have some 
degree of handling, even if it's not correction.  VM disk images and 
most database applications fall into this category.
3. Cases where data corruption will take out the application anyway. 
Poorly written database software is the primary example of this.
4. Things that shouldn't be using nodatacow because data safety is the 
most important aspect of the system.


The first two cases work perfectly fine with the current behavior and 
are arguably no better off either way.  The third is functionally fine 
with the current behavior provided that the crash doesn't change state 
(which isn't a guarantee), but could theoretically benefit from the 
determinism of knowing the app will die if the data is bad.  The fourth 
is what most people seem to want this for, and don't realize that even 
if this is implemented, they will be no better off on average.


But, on MY system, I absolutely would **always** select the first option
(-EIO). I need to know that a potential problem may have occurred and
will take manual action to decide what to do. Of course, this also needs
a special utility (as Christoph proposed) to be able to force the read
(to allow me to examine the data) and to be able to reset the checksum
(although that is presumably as simple as rewriting the data).
And I and most other sysadmins I know would prefer the opposite with the 
addition of a secondary notification method.  You can still hook the 
notification to stop the application, but you don't have to if you don't 
want to (and in cases 1 and 2 I listed above, you probably don't want to).


This is what happens normally with any filesystem when a disk block goes
bad, but with the additional benefit of being able to examine a
"possibly valid" version of the data block before overwriting it.


Looking at this from a different angle: Without background, what would
you assume the behavior to be for this?  For most people, the assumption
would be that this provides the same degree of data safety that the
checksums do when the data is CoW.


Exactly. The naive expectation is that turning off datacow does not
prevent the bitrot checking from working. Also, the naive expectation
(for any filesystem operation) is that if there is any doubt about the
reliability of the data, the error is reported for the user to deal with.
The problem is that the naive expectation about data safety appears to 
be that adding checksumming support for nodatacow will improve safety, 
which it WILL NOT do.  All it will do is add some reporting that will 
have a 50%+ rate of false positives (there is the very real possibility 
that the unexpected power loss will corrupt the checksum or the data if 
you're on anything but a traditional hard drive).  If you have something 
that you need data safety for and can't be arsed to pay attention to 
whether or not your system had an unclean shutdwon, then you have two 
practical options:

1. Don't use nodatacow.
2. Do some form of higher level verification.
Nothing about that is going to magically change because you suddenly 
have checksums telling you the data might be bad.


Now, you _might_ be better off in a situation where the data got 
corrupted for some other reason (say, a media error for example), but 
even then you should have higher level verification, and it won't 
provide much benefit unless you're using replication or parity in BTRFS.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Graham Cobb
On 14/08/17 15:23, Austin S. Hemmelgarn wrote:
> Assume you have higher level verification.  

But almost no applications do. In real life, the decision
making/correction process will be manual and labour-intensive (for
example, running fsck on a virtual disk or restoring a file from backup).

> Would you rather not be able
> to read the data regardless of if it's correct or not, or be able to
> read it and determine yourself if it's correct or not?  

It must be controllable on a per-file basis, of course. For the tiny
number of files where the app can both spot the problem and correct it
(for example if it has a journal) the current behaviour could be used.

But, on MY system, I absolutely would **always** select the first option
(-EIO). I need to know that a potential problem may have occurred and
will take manual action to decide what to do. Of course, this also needs
a special utility (as Christoph proposed) to be able to force the read
(to allow me to examine the data) and to be able to reset the checksum
(although that is presumably as simple as rewriting the data).

This is what happens normally with any filesystem when a disk block goes
bad, but with the additional benefit of being able to examine a
"possibly valid" version of the data block before overwriting it.

> Looking at this from a different angle: Without background, what would
> you assume the behavior to be for this?  For most people, the assumption
> would be that this provides the same degree of data safety that the
> checksums do when the data is CoW.  

Exactly. The naive expectation is that turning off datacow does not
prevent the bitrot checking from working. Also, the naive expectation
(for any filesystem operation) is that if there is any doubt about the
reliability of the data, the error is reported for the user to deal with.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread David Sterba
On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote:
> On 2017年08月14日 22:03, David Sterba wrote:
> > On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote:
> >> On 2017年08月14日 21:06, David Sterba wrote:
> >>> On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:
>  Since versions 4.12 btrfs-progs is complicated to cross compile for
>  other systems.
>  The problem is, that this version includes mktables, which needs to be
>  compiled for the host system and executed there for the creation of
>  tables.c.
> 
>  Are there any changes planed for the next version of btrfs-progs to make
>  the cross compiling as simple as in the past? A included tables.c for
>  example?
> >>>
> >>> Yes, keeping the generated tables.c around is fine. There's no reason it
> >>> needs to be generated each time during build. I'll fix that in 4.12.1.
> >>
> >> But the number of lines and impossibility to review it makes it not
> >> suitable to be managed by git.
> > 
> > I don't understand your concern. The file is generated from a set of
> > formulas, not intended to be updated directly.
> 
> Yes, it should never be updated directly, so it's generated by a less 
> than 400 lines program, instead of a whole 10K+ lines file managed by git.

mktables.c is synced from kernel sources, taking updates from there is
easier than porting any changes to the proposed scripted implementation.

The workflow is simple:
- copy kernel mktables.c changes to btrfs-progs mktables.c
- compile mktables
- run 'make kernel-lib/tables.c'
- commit the changes to git

All of that happens very rarely, if ever, the raid56 tables and
correction algorithms are set in stone. Any extensions need to be done
on both sides kernel/userspace.

> >> What about using script to generate it?
> > 
> > We do have the mktables utility to generate it and I'll regenerate it
> > should there be a change to kernel-lib/mktables.c
> 
> I mean to replace mktables.c with a script.
> So no cross compiler problems at all, and even easier Makefile.
> No dependence on "mktables" program.

Somebody has to implement the script and verify that the output is the
same, eventually sync changes. The cross-compilation should be fixed
with the pregenerated tables.c . Is Makefile size really a concern? The
number of related lines is like 7. I don't see any benefit in what you
propose and hopefully explained my viewpoint enough so I don't have to
continue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Austin S. Hemmelgarn

On 2017-08-14 08:24, Christoph Anton Mitterer wrote:

On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote:

And how are you going to write your data and checksum atomically
when
doing in-place updates?


Exactly, that's the main reason I can figure out why btrfs disables
checksum for nodatacow.


Still, I don't get the problem here...

Yes it cannot be done atomically (without workarounds like a journal or
so), but this should be only an issue in case of a crash or similar.

And in this case nodatacow+nochecksum is anyway already bad, it's also
not atomic, so data may be completely garbage (e.g. half written)...
just that no one will ever notice.

The only problem that nodatacow + checksuming + nonatomic should give
is when the data was actually correctly written at a crash, but the
cheksum was not, in which case the bogus checksum would invalidate the
good data on next read.

Or do I miss something?


To me that sounds still much better than having no protection at all.
Assume you have higher level verification.  Would you rather not be able 
to read the data regardless of if it's correct or not, or be able to 
read it and determine yourself if it's correct or not?  For almost 
anybody, the answer is going to be the second case, because the 
application knows better than the OS if the data is correct (and 
'correct' may be a threshold, not some binary determination).  At that 
point, you need to make the checksum error a warning instead of 
returning -EIO.  How do you intend to communicate that warning back to 
the application?  The kernel log won't work, because on any reasonably 
secure system it's not visible to anyone but root.  There's also no side 
channel for the read() system calls that you can utilize.  That then 
means that the checksums end up just being a means for the administrator 
to know some data wasn't written correctly, but they should know that 
anyway because the system crashed.  As a result, the whole thing ends up 
reduced to some extra work for a pointless notification that some people 
may not even see.


Looking at this from a different angle: Without background, what would 
you assume the behavior to be for this?  For most people, the assumption 
would be that this provides the same degree of data safety that the 
checksums do when the data is CoW.  We already have enough issues with 
people misunderstanding how things work and losing data as a result 
(keep in mind that the average user doesn't read documentation and will 
often blindly follow any random advice they see online), and we don't 
need more that are liable to cause data loss.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Goffredo Baroncelli
On 08/14/2017 09:08 AM, Qu Wenruo wrote:
> 
>>
>> Supposing to log for each transaction BTRFS which "data NOCOW blocks" will 
>> be updated and their checksum, in case a transaction is interrupted you know 
>> which blocks have to be checked and are able to verify if the checksum 
>> matches and correct the mismatch. Logging also the checksum could help to 
>> identify if:
>> - the data is old
>> - the data is updated
>> - the updated data is correct
>>
>> The same approach could be used also to solving also the issue related to 
>> the infamous RAID5/6 hole: logging which block are updated, in case of 
>> transaction aborted you can check the parity which have to be rebuild.
> Indeed Liu is using journal to solve RAID5/6 write hole.
> 
> But to address the lack-of-journal nature of btrfs, he introduced a journal 
> device to handle it, since btrfs metadata is either written or trashed, we 
> can't rely existing btrfs metadata to handle journal.

The Liu's solution is a lot heavier. With the Liu's solution, you need to write 
both the data and parity 2 times. I am only suggest to track the block to 
update. And it would be only need for the stripes involved by a RMW cycle. This 
is a lot less data to write (8 byte vs 4Kbyte)

> 
> PS: This reminds me why ZFS is still using journal (called ZFS intent log) 
> but not mandatory metadata CoW of btrfs.

Form a theoretical point of view, if you have a "PURE" COW file-system, you 
don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, so 
you need a journal to keep it in sync. The same is true for the NOCOW file (and 
their checksums)


> 
> Thanks,
> Qu


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 22:03, David Sterba wrote:

On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote:



On 2017年08月14日 21:06, David Sterba wrote:

On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:

Since versions 4.12 btrfs-progs is complicated to cross compile for
other systems.
The problem is, that this version includes mktables, which needs to be
compiled for the host system and executed there for the creation of
tables.c.

Are there any changes planed for the next version of btrfs-progs to make
the cross compiling as simple as in the past? A included tables.c for
example?


Yes, keeping the generated tables.c around is fine. There's no reason it
needs to be generated each time during build. I'll fix that in 4.12.1.


But the number of lines and impossibility to review it makes it not
suitable to be managed by git.


I don't understand your concern. The file is generated from a set of
formulas, not intended to be updated directly.


Yes, it should never be updated directly, so it's generated by a less 
than 400 lines program, instead of a whole 10K+ lines file managed by git.




What about using script to generate it?


We do have the mktables utility to generate it and I'll regenerate it
should there be a change to kernel-lib/mktables.c


I mean to replace mktables.c with a script.
So no cross compiler problems at all, and even easier Makefile.
No dependence on "mktables" program.

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Checksum of the parity

2017-08-14 Thread Goffredo Baroncelli
On 08/13/2017 08:45 PM, Chris Murphy wrote:
> [2]
> Is Btrfs subject to the write hole problem manifesting on disk? I'm
> not sure, sadly I don't read the code well enough. But if all Btrfs
> raid56 writes are full stripe CoW writes, and if the prescribed order
> guarantees still happen: data CoW to disk > metadata CoW to disk >
> superblock update, then I don't see how the write hole happens. Write
> hole requires: RMW of a stripe, which is a partial stripe overwrite,
> and a crash during the modification of the stripe making that stripe
> inconsistent as well as still pointed to by metadata.


RAID5 is *single* failure prof. And in order to have the write hole bug we need 
two failure:
1) a transaction is aborted (e.g. due to a power failure) and the results is 
that data and parity are mis-aligned
2) a disk disappears

These two events may happen even in different moment.

The key is that when a disk disappear, all remaining ones are used to rebuild 
the missing one. So if data and parity are mis-aligned the rebuild disk is 
wrong.

Let me to show an example

Disk 1Disk 2 Disk 3  (parity)
AABB CC

where CC = A ^ B

Note1: A is a valid data

Supposing to update B and due to a power failure you can't update parity, you 
have:


Disk 1Disk 2 Disk 3  (parity)
AADDDCC

Of course CC != A ^ D  (data and parity are misaligned).


Pay attention that AA is still valid data.

Now suppose to loose disk1. If you want to read from it, you have to perform a 
read of disk2 and disk3 to compute disk1. 

However Disk2 and disk3 are misaligned, so doing a D ^ C you don't got 
A anymore.


Note that it is not important if DD or B are valid or invalid data.


Moreover I have to point out that a simple scrub process between 1 and 2, is 
able to rebuild a correct parity. This would reduce the likelihood of the 
"write hole" bug. 
The only case which would still exists is when 1) and 2) happen at the same 
time (which is not impossible: i.e. if a disk die, it is not infrequent that 
the user shutdown the machine without waiting a clean shutdown).

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread David Sterba
On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年08月14日 21:06, David Sterba wrote:
> > On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:
> >> Since versions 4.12 btrfs-progs is complicated to cross compile for
> >> other systems.
> >> The problem is, that this version includes mktables, which needs to be
> >> compiled for the host system and executed there for the creation of
> >> tables.c.
> >>
> >> Are there any changes planed for the next version of btrfs-progs to make
> >> the cross compiling as simple as in the past? A included tables.c for
> >> example?
> > 
> > Yes, keeping the generated tables.c around is fine. There's no reason it
> > needs to be generated each time during build. I'll fix that in 4.12.1.
> 
> But the number of lines and impossibility to review it makes it not 
> suitable to be managed by git.

I don't understand your concern. The file is generated from a set of
formulas, not intended to be updated directly.
> 
> What about using script to generate it?

We do have the mktables utility to generate it and I'll regenerate it
should there be a change to kernel-lib/mktables.c
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building a BTRFS test machine

2017-08-14 Thread Austin S. Hemmelgarn

On 2017-08-13 21:01, Cerem Cem ASLAN wrote:

Would that be useful to build a BTRFS test machine, which will perform
both software tests (btrfs send | btrfs receive, read/write random
data etc.) and hardware tests, such as abrupt power off test, abruptly
removing a raid-X disk physically, etc.
In general, yes.  There are already a couple of people (at least myself 
and Adam Borowski) who pick out patches we have interest in from the 
mailing list and test them in VM's, but having more people involved in 
testing is never a bad thing (cross verification of the testing is very 
helpful, because it can help identify when one of the test systems is 
suspect).


Based on my own experience, if you do go with a VM, I've found that QEMU 
with LVM as the backing storage is probably one of the simplest setups 
to automate.  You can easily script things like adding and removing 
disks, and using LVM for storage means you can add and remove (and 
snapshot) backend devices as needed.


If it would be useful, what tests should it cover?

Qu covered this well, so there's not much for me to add here.

My own testing is pretty consistent with what Qu mentioned, plus a few 
special cases I've set up myself that I still need to get pushed 
upstream somewhere.  For reference, the big ones I test that aren't 
(AFAIK at least) in any of the standard test sets are:


* Large scale bulk parallel creation and removal of subvolumes.  I've 
got a script that creates a 16 subvolumes, and then in parallel 
snapshots them 65536 times each, and then calls `btrfs subvolume delete` 
on all 1048576 subvolumes simultaneously.  This is _really_ good at 
showing performance differences in handling of snapshots and subvolumes.


* Large scale bulk reflink creation and deletion.  Similar to the above, 
but using a 1GB file and the clone ioctl to create a similar number of 
reflinked files.


* Scaling performance of directories with very large numbers of entries. 
 In essence, I create directories with power of 2 numbers of files 
starting at 512 and ending with 1048576, with random names and random 
metadata, and see how long `ls -als` takes on the directory.  This came 
about because of performance issues seen on a file server where I work 
with a directory that has well over four thousand files in it that got 
noticeably worse performance with BTRFS than with ext4.


* Kernel handling of mixed profile filesystems.  I've got a script that 
generates a BTRFS filesystem with a data, metadata, and system chunk of 
each possible profile (single, dup, raid0, raid1, raid10, raid5, and 
raid6), and then makes sure the kernel can mount this and that balances 
to each profile work correctly.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-14 Thread David Sterba
On Fri, Aug 11, 2017 at 09:20:10AM -0400, Chris Mason wrote:
> 
> 
> On 08/10/2017 03:25 PM, Hugo Mills wrote:
> > On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:
> >> On 08/10/2017 04:30 AM, Eric Biggers wrote:
> >>>
> >>> Theses benchmarks are misleading because they compress the whole file as a
> >>> single stream without resetting the dictionary, which isn't how data will
> >>> typically be compressed in kernel mode.  With filesystem compression the 
> >>> data
> >>> has to be divided into small chunks that can each be decompressed 
> >>> independently.
> >>> That eliminates one of the primary advantages of Zstandard (support for 
> >>> large
> >>> dictionary sizes).
> >>
> >> I did btrfs benchmarks of kernel trees and other normal data sets as
> >> well.  The numbers were in line with what Nick is posting here.
> >> zstd is a big win over both lzo and zlib from a btrfs point of view.
> >>
> >> It's true Nick's patches only support a single compression level in
> >> btrfs, but that's because btrfs doesn't have a way to pass in the
> >> compression ratio.  It could easily be a mount option, it was just
> >> outside the scope of Nick's initial work.
> > 
> > Could we please not add more mount options? I get that they're easy
> > to implement, but it's a very blunt instrument. What we tend to see
> > (with both nodatacow and compress) is people using the mount options,
> > then asking for exceptions, discovering that they can't do that, and
> > then falling back to doing it with attributes or btrfs properties.
> > Could we just start with btrfs properties this time round, and cut out
> > the mount option part of this cycle.
> > 
> > In the long run, it'd be great to see most of the btrfs-specific
> > mount options get deprecated and ultimately removed entirely, in
> > favour of attributes/properties, where feasible.
> > 
> 
> It's a good point, and as was commented later down I'd just do mount -o 
> compress=zstd:3 or something.

We've solved that already, here's a proposed patch that extends current
mount options,

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg66248.html

and a patch that could be backported to older kernels so the new mount
options do not break mounts on older kernels with the new syntax

https://patchwork.kernel.org/patch/9845697/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lazytime mount option—no support in Btrfs

2017-08-14 Thread Austin S. Hemmelgarn

On 2017-08-13 07:50, Adam Hunt wrote:

Back in 2014 Ted Tso introduced the lazytime mount option for ext4 and
shortly thereafter a more generic VFS implementation which was then
merged into mainline. His early patches included support for Btrfs but
those changes were removed prior to the feature being merged. His
changelog includes the following note about the removal:

   - Per Christoph's suggestion, drop support for btrfs and xfs for now,
 issues with how btrfs and xfs handle dirty inode tracking.  We can add
 btrfs and xfs support back later or at the end of this series if we
 want to revisit this decision.

My reading of the current mainline shows that Btrfs still lacks any
support for lazytime. Has any thought been given to adding support for
lazytime to Btrfs?
It has bee at least lightly discussed (I forget the thread, but I did a 
reasonably specific explanation of the interaction of the *atime and 
lazytime options in a thread a while back when trying to explain to 
someone why I wanted to be able to run with noatime and lazytime), but I 
don't think the discussion got anywhere.


I would personally love to see support for it myself (or you know, at 
least have some warning that it isn't supported instead of just silently 
accepting and ignoring it like we do currently), but I unfortunately 
don't have the time or expertise to work on implementing it.  If someone 
does post a patch though, I'll be more than happy to throw a few dozen 
VM's at testing it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 21:06, David Sterba wrote:

On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:

Since versions 4.12 btrfs-progs is complicated to cross compile for
other systems.
The problem is, that this version includes mktables, which needs to be
compiled for the host system and executed there for the creation of
tables.c.

Are there any changes planed for the next version of btrfs-progs to make
the cross compiling as simple as in the past? A included tables.c for
example?


Yes, keeping the generated tables.c around is fine. There's no reason it
needs to be generated each time during build. I'll fix that in 4.12.1.


But the number of lines and impossibility to review it makes it not 
suitable to be managed by git.


What about using script to generate it?

Thanks,
Qu


Thanks for the report, cross-compilation should stay supported.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 20:17, Hallo32 wrote:

Hello at all,

I'm new at this list. If the mail is not in line with your standards 
please inform me.


Since versions 4.12 btrfs-progs is complicated to cross compile for 
other systems.
The problem is, that this version includes mktables, which needs to be 
compiled for the host system and executed there for the creation of 
tables.c.


Are there any changes planed for the next version of btrfs-progs to make 
the cross compiling as simple as in the past? A included tables.c for 
example?


Including tables.c seems quite crazy for me.
Just check how many lines tables.c has.

If we screw up any number of it, RAID6 is screwed up and we will never 
find the bug by reviewing the such meaningless numbers.


We could enhance the makefile to use host-cc to compile and execute mktable.
But it may make the makefile more complicated just for one file.
(Well, I just don't want to screw traditional Makefile/configure any more)

Another solution is to generate tables.c using scripts, other binary 
executable.
I'll try to port mktable to bash or python, which should solve the 
problem withouth further complicating makefile.
(And maybe I can update kernel too, but kernel has its host-cc and 
cross-cc already distinguished well, so it may not help much)


Thanks,
Qu



Best regards
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs-v4.12: cross compiling

2017-08-14 Thread David Sterba
On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote:
> Since versions 4.12 btrfs-progs is complicated to cross compile for 
> other systems.
> The problem is, that this version includes mktables, which needs to be 
> compiled for the host system and executed there for the creation of 
> tables.c.
> 
> Are there any changes planed for the next version of btrfs-progs to make 
> the cross compiling as simple as in the past? A included tables.c for 
> example?

Yes, keeping the generated tables.c around is fine. There's no reason it
needs to be generated each time during build. I'll fix that in 4.12.1.
Thanks for the report, cross-compilation should stay supported.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 20:32, Christoph Anton Mitterer wrote:

On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote:

The problem here is, if you enable csum and even data is updated
correctly, only metadata is trashed, then you can't even read out
the
correct data.


So what?
This problem occurs anyway *only* in case of a crash,.. and *only* if
notdatacow+checksumung would be used.
A case in which currently, the user can either only hope that his data
is fine (unless higher levels provide some checksumming means[0]), or
anyway needs to recover from a backup.


Let's make it clear of the combinations and its result in power loss case:

Datacow + Datasum: Good old data

Datacow + nodatasum: Good old data

Nodatacow + datacum: Good old data (data not committed yet) or -EIO 
(data updated)
Not supported yet, so I just assume it's using current csum checking 
behavior.


Nodatacow + nodatasum: Good old data (data not committed yet) or 
uncertain data.


The uncertain part is when data updated, what it should behave.

If we really need to implement nodatacow +datasum, I prefer to make it 
consistent with nodatacow + nodatasum behavior, at least read out the 
data, give some csum warning instead of refuse to read and returning -EIO.




Intuitively I'd also say it's much less likely that the data (which is
more in terms of space) is written correctly while the checksum is not.
Or is it?


Checksums are protected by mandatory metadata CoW, so metadata update is 
always atomic.

Checksum will either be updated correctly, or trashed at all. Unlike data.

And it's highly possible to happen. As when synchronising a filesystem, 
we write data first, then metadata (data and meta may be cached by disk 
controller, but at least we submit such request to disk), then flush all 
data and metadata to disk, and update superblock finally.


Since metadata is updated CoW, unless the superblock is written to disk, 
we are always reading the old metadata trees (including csum tree).


So if powerloss happens between data written to disk and final 
superblock update, it's highly possible to hit the problem.
And considering the data/metadata ratio, we spend more time flushing 
data other than metadata, which increase the possibility further more.




[0] And when I've investigated back when discussion rose up the first
time and some list member claimed that most typical cases (DBs, VM
images) would anyway do their own checksuming,... I came to the
conclusion that most did not even support it and even if they would
it's no enabled per default and not really a *full* checksumming in
most cases.




As btrfs csum checker will just prevent you from reading out any
data
which doesn't match with csum.

As I've said before, a tool could be provided, that re-computes the
checksums then (making the data accessible again)... or one could
simply mount the fs with nochecksum or some other special option, which
allows bypassing any checks.


Just as you pointed out, such csum bypassing should be the prerequisite 
for nodatacow+datasum.

And unfortunately, we don't have such facility yet.




Now it's not just data corruption, but data loss then.

I think the former is worse than the later. The later gives you a
chance of noting it, and either recover from a backup, regenerate the
data (if possible) or manually mark the data as being "good" (though
corrupted) again.


This depends.
If the upper layer has its own error detection mechanism, like keeping a 
special file fsynced before write (or just call it journal), then 
allowing reading out the corrupted data gives it a chance to find it 
good and continue.

While just returning -EIO kills the chance at all.

BTW, normal user space programs can handle csum mismatch better than -EIO.
Like zip files has its own checksum, btw can't handle -EIO at all.

Thanks,
Qu




Cheers,
Chris.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Christoph Anton Mitterer
On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote:
> The problem here is, if you enable csum and even data is updated 
> correctly, only metadata is trashed, then you can't even read out
> the 
> correct data.

So what?
This problem occurs anyway *only* in case of a crash,.. and *only* if
notdatacow+checksumung would be used.
A case in which currently, the user can either only hope that his data
is fine (unless higher levels provide some checksumming means[0]), or
anyway needs to recover from a backup.

Intuitively I'd also say it's much less likely that the data (which is
more in terms of space) is written correctly while the checksum is not.
Or is it?



[0] And when I've investigated back when discussion rose up the first
time and some list member claimed that most typical cases (DBs, VM
images) would anyway do their own checksuming,... I came to the
conclusion that most did not even support it and even if they would
it's no enabled per default and not really a *full* checksumming in
most cases.



> As btrfs csum checker will just prevent you from reading out any
> data 
> which doesn't match with csum.
As I've said before, a tool could be provided, that re-computes the
checksums then (making the data accessible again)... or one could
simply mount the fs with nochecksum or some other special option, which
allows bypassing any checks.

> Now it's not just data corruption, but data loss then.
I think the former is worse than the later. The later gives you a
chance of noting it, and either recover from a backup, regenerate the
data (if possible) or manually mark the data as being "good" (though
corrupted) again.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Christoph Anton Mitterer
On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote:
> > And how are you going to write your data and checksum atomically
> > when
> > doing in-place updates?
> 
> Exactly, that's the main reason I can figure out why btrfs disables 
> checksum for nodatacow.

Still, I don't get the problem here...

Yes it cannot be done atomically (without workarounds like a journal or
so), but this should be only an issue in case of a crash or similar.

And in this case nodatacow+nochecksum is anyway already bad, it's also
not atomic, so data may be completely garbage (e.g. half written)...
just that no one will ever notice.

The only problem that nodatacow + checksuming + nonatomic should give
is when the data was actually correctly written at a crash, but the
cheksum was not, in which case the bogus checksum would invalidate the
good data on next read.

Or do I miss something?


To me that sounds still much better than having no protection at all.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


btrfs-progs-v4.12: cross compiling

2017-08-14 Thread Hallo32

Hello at all,

I'm new at this list. If the mail is not in line with your standards 
please inform me.


Since versions 4.12 btrfs-progs is complicated to cross compile for 
other systems.
The problem is, that this version includes mktables, which needs to be 
compiled for the host system and executed there for the creation of 
tables.c.


Are there any changes planed for the next version of btrfs-progs to make 
the cross compiling as simple as in the past? A included tables.c for 
example?


Best regards
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to benchmark schedulers

2017-08-14 Thread Qu Wenruo



On 2017年08月08日 21:19, Janos Toth F. wrote:

I think you should consider using Linux 4.12 which has bfq (bfq-mq)
for blk-mq. So, you don't need out-of-tree BFQ patches if you switch
to blk-mq (but now you are free to do so even if you have HDDs or SSDs
which benefit from software schedulers since you have some multi-queue
schedulers for them). Just make sure to enable blk-mq (has to be a
boot parameter or build-time choice) in order to gain access to
bfq-mq. And remember that bfq-mq has to be activated manually (the
build-time choice for a default scheduler is not valid for multi-queue
schedulers, you will default to "none" which is effectively the new
"no-op").
Note: there is only one BFQ in 4.12 and it's bfq-mq which runs under
the name of simply BFQ (not bfq-mq, I only used that name to make it
clear that BFQ in 4.12 is a multi-queue version of BFQ).

I always wondered if Btrfs makes any use of FUA if it's enabled for
compatible SATA devices (it's disabled by default because there are
some drives with faulty firmware).


At least, if using blktrace, you could find that superblock write of 
btrfs is submitting bio using FWFSM (Flush, Write, FUA, Sync, Meta).


So for compatible SATA device, it is possible to use FUA.
It's block driver to emulate FUA by addressing Write then Flush for 
incompatible disk.


However, in my previous test with FUA compatible HDD (you could check 
kernel "Write cache:" line), it turns out that at least for DB workload, 
the performance improvement is hard to notice and sometimes even reduce 
the performance.


Maybe it's related to the workload (PostgreSQL), but FUA support may or 
may not affect performance as you expected.


Thanks,
Qu


I also wonder if RAID10 is any better (or actually worse?) for
metadata (and system) chunks than RAID1.

On Tue, Aug 8, 2017 at 1:59 PM, Bernhard Landauer  wrote:

Hello everyone

I am looking for a way to test different available schedulers with Manjaro's
bfq-patched kernels on sytems with both SSD and spinning drives. Since
phoronix-test-suite apparently is currently useless for this task due to its
bad config for bfq I am looking for alternatives. Do you have any
suggestions for me?
Thank you.

kind regards
Bernhard Landauer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building a BTRFS test machine

2017-08-14 Thread Lakshmipathi.G
Definitely it will be useful.  For quite sometime, I was thinking about:
1. Creating test cases for each BTRFS features. (ex:
https://btrfs.wiki.kernel.org/index.php/Scrub_corruption_cases)
2. Automate these  feature specific test scripts using bash/python
while maintaining them in public git repos.
3. Launching BTRFS test machines in cloud (aws spot instance are very cheap) .
4. Run those feature specific tests and publish the results into wiki
or/and bugs into bugzilla.
5. Destroy the test machines.

Such setup will require quite good effort. As Qu mentioned, the
challenging part will be integrating existing different tests and make
it automatic.

Another thought on BTRFS testing:
At this moment, when someone sends a patch - i assume, they
automatically pushed into patchwork.kernel.org (based on some back-end
configuration I guess).

On Btrfs mailing-list,we have lot of bug reports so using something
like [1] we can write a
program which parses mailing-list for specific keyword like
"BUGREPORT:" similar to "PATCH:"
and create a new
bug or update the details on bugzilla.kernel.org? This will ensure we
can keep track of bugs on
bugzilla.

I think updating bugs based on mailing-list responses will be little
tricky than creating new bugs.
something like that can be useful as well.

I can help a bit on such testbed setup  :-) Thanks!


Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org


On Mon, Aug 14, 2017 at 6:31 AM, Cerem Cem ASLAN  wrote:
> Would that be useful to build a BTRFS test machine, which will perform
> both software tests (btrfs send | btrfs receive, read/write random
> data etc.) and hardware tests, such as abrupt power off test, abruptly
> removing a raid-X disk physically, etc.
>
> If it would be useful, what tests should it cover?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 15:43, Paul Jones wrote:

-Original Message-
From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
ow...@vger.kernel.org] On Behalf Of Qu Wenruo
Sent: Monday, 14 August 2017 4:37 PM
To: Christoph Hellwig ; Christoph Anton Mitterer

Cc: Btrfs BTRFS 
Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?



On 2017年08月12日 15:42, Christoph Hellwig wrote:

On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote:

Qu Wenruo wrote:

Although Btrfs can disable data CoW, nodatacow also disables data
checksum, which is another main feature for btrfs.


Then decoupling of the two should probably decoupled and support for
notdatacow+checksumming be implemented?!


And how are you going to write your data and checksum atomically when
doing in-place updates?


Exactly, that's the main reason I can figure out why btrfs disables checksum
for nodatacow.


But does it matter if it's not strictly atomic? By turning off COW it implies 
you accept the risk of an ill-timed failure.


The problem here is, if you enable csum and even data is updated 
correctly, only metadata is trashed, then you can't even read out the 
correct data.


As btrfs csum checker will just prevent you from reading out any data 
which doesn't match with csum.


Now it's not just data corruption, but data loss then.

Thanks,
Qu


Although from my point of view any reason that would require COW to be disabled 
implies you're using the wrong filesystem anyway.

Paul.








--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Qu Wenruo
> Sent: Monday, 14 August 2017 4:37 PM
> To: Christoph Hellwig ; Christoph Anton Mitterer
> 
> Cc: Btrfs BTRFS 
> Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
> 
> 
> 
> On 2017年08月12日 15:42, Christoph Hellwig wrote:
> > On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote:
> >> Qu Wenruo wrote:
> >>> Although Btrfs can disable data CoW, nodatacow also disables data
> >>> checksum, which is another main feature for btrfs.
> >>
> >> Then decoupling of the two should probably decoupled and support for
> >> notdatacow+checksumming be implemented?!
> >
> > And how are you going to write your data and checksum atomically when
> > doing in-place updates?
> 
> Exactly, that's the main reason I can figure out why btrfs disables checksum
> for nodatacow.

But does it matter if it's not strictly atomic? By turning off COW it implies 
you accept the risk of an ill-timed failure. 
Although from my point of view any reason that would require COW to be disabled 
implies you're using the wrong filesystem anyway.

Paul.









Re: Building a BTRFS test machine

2017-08-14 Thread Qu Wenruo



On 2017年08月14日 09:01, Cerem Cem ASLAN wrote:

Would that be useful to build a BTRFS test machine, which will perform
both software tests (btrfs send | btrfs receive, read/write random
data etc.) and hardware tests, such as abrupt power off test, abruptly
removing a raid-X disk physically, etc.


If you want to contribute to btrfs development/bug report, then it's 
definitely helpful.




If it would be useful, what tests should it cover?


At least fstests (old xfstests, and some tests are known to fail, so you 
need some knowledge to judge which failure is real problem), selftest 
from btrfs-progs, and fs related tests from Linux Test Project.


The hard part to integrate all these different tests and make them 
automatic.


I'm not familiar with hardware crash test, but it would definitely help, 
as currently we're only using dm-flakey to *emulate* power loss, still 
not true power off.


Other test like performance test from variant other projects like 
postgresql/apache will also help.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Qu Wenruo



On 2017年08月13日 22:08, Goffredo Baroncelli wrote:

On 08/12/2017 02:12 PM, Hugo Mills wrote:

On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote:

On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote:

[...]

   good, but csum is not


I don't think this is a particularly good description of the
problem. I'd say it's more like this:

If you write data and metadata separately (which you have to do in
the nodatacow case), and the system halts between the two writes, then
you either have the new data with the old csum, or the old csum with
the new data. Both data and csum are "good", but good from different
states of the FS. In both cases (data first or metadata first), the
csum doesn't match the data, and so you now have an I/O error reported
when trying to read that data.

You can't easily fix this, because when the data and csum don't
match, you need to know the _reason_ they don't match -- is it because
the machine was interrupted during write (in which case you can fix
it), or is it because the hard disk has had someone write data to it
directly, and the data is now toast (in which case you shouldn't fix
the I/O error)?


I am still inclined to think that this kind of problems could be solved using a 
journal: if you track which blocks are updated in the transaction and their 
checksum, if the transaction are interrupted, you can always rebuild the pair 
data/checksum:
in case of interruption of a transaction:
- all COW data are trashed
- some NOCOW data might be written
- all metadata (which are COW) are trashed


The idea itself sounds good, however btrfs doesn't use journal (yet) and 
that means we need to introduce journal while btrfs uses metadata CoW to 
handle most work of journal.




Supposing to log for each transaction BTRFS which "data NOCOW blocks" will be 
updated and their checksum, in case a transaction is interrupted you know which blocks 
have to be checked and are able to verify if the checksum matches and correct the 
mismatch. Logging also the checksum could help to identify if:
- the data is old
- the data is updated
- the updated data is correct

The same approach could be used also to solving also the issue related to the 
infamous RAID5/6 hole: logging which block are updated, in case of transaction 
aborted you can check the parity which have to be rebuild.

Indeed Liu is using journal to solve RAID5/6 write hole.

But to address the lack-of-journal nature of btrfs, he introduced a 
journal device to handle it, since btrfs metadata is either written or 
trashed, we can't rely existing btrfs metadata to handle journal.


PS: This reminds me why ZFS is still using journal (called ZFS intent 
log) but not mandatory metadata CoW of btrfs.


Thanks,
Qu





Basically, nodatacow bypasses the very mechanisms that are meant to
provide consistency in the filesystem.

Hugo.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use BTRFS_FSID_SIZE for fsid

2017-08-14 Thread Nikolay Borisov


On 13.08.2017 06:58, Anand Jain wrote:
> We have define for FSID size so use it.
> 
> Signed-off-by: Anand Jain 

I like consistency!

Reviewed-by: Nikolay Borisov 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fstests: btrfs: enhance regression test for nocsum dio read's repair

2017-08-14 Thread Lu Fengqi
I catch this following error from dmesg when this testcase fails.

[17446.661127] Buffer I/O error on dev sdb1, logical block 64, async page read

We expect to inject disk IO errors on the device when xfs_io reads the
specific file, but other processes may trigger IO error earlier. So, we
can use task-filter to solve this problem.

Signed-off-by: Lu Fengqi 
---
 tests/btrfs/142 | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/tests/btrfs/142 b/tests/btrfs/142
index 414af1b2..5bd8b728 100755
--- a/tests/btrfs/142
+++ b/tests/btrfs/142
@@ -75,6 +75,7 @@ start_fail()
 {
echo 100 > $DEBUGFS_MNT/fail_make_request/probability
echo 2 > $DEBUGFS_MNT/fail_make_request/times
+   echo 1 > $DEBUGFS_MNT/fail_make_request/task-filter
echo 0 > $DEBUGFS_MNT/fail_make_request/verbose
echo 1 > $SYSFS_BDEV/make-it-fail
 }
@@ -83,6 +84,7 @@ stop_fail()
 {
echo 0 > $DEBUGFS_MNT/fail_make_request/probability
echo 0 > $DEBUGFS_MNT/fail_make_request/times
+   echo 0 > $DEBUGFS_MNT/fail_make_request/task-filter
echo 0 > $SYSFS_BDEV/make-it-fail
 }
 
@@ -118,16 +120,17 @@ echo "step 3..repair the bad copy" >>$seqres.full
 # since raid1 consists of two copies, and the bad copy was put on stripe #1
 # while the good copy lies on stripe #0, the bad copy only gets access when the
 # reader's pid % 2 == 1 is true
-while true; do
-   # start_fail only fails the following dio read so the repair is
+start_fail
+while [[ -z ${result} ]]; do
+   # enable task-filter only fails the following dio read so the repair is
# supposed to work.
-   start_fail
-   $XFS_IO_PROG -d -c "pread -b 128K 0 128K" "$SCRATCH_MNT/foobar" > 
/dev/null &
-   pid=$!
-   wait
-   stop_fail
-   [ $((pid % 2)) == 1 ] && break
+   result=$(bash -c "
+   if [[ \$((\$\$ % 2)) -eq 1 ]]; then
+   echo 1 > /proc/\$\$/make-it-fail
+   exec $XFS_IO_PROG -d -c \"pread -b 128K 0 128K\" 
\"$SCRATCH_MNT/foobar\"
+   fi");
 done
+stop_fail
 
 _scratch_unmount
 
-- 
2.14.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-14 Thread Qu Wenruo



On 2017年08月12日 15:42, Christoph Hellwig wrote:

On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote:

Qu Wenruo wrote:

Although Btrfs can disable data CoW, nodatacow also disables data
checksum, which is another main feature for btrfs.


Then decoupling of the two should probably decoupled and support for
notdatacow+checksumming be implemented?!


And how are you going to write your data and checksum atomically when
doing in-place updates?


Exactly, that's the main reason I can figure out why btrfs disables 
checksum for nodatacow.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html