[PATCH 1/3] btrfs-progs: test: umount if confirmation failed

2015-12-04 Thread Naohiro Aota
When a check in check_inode() failed, the test should umount test target
file system. This commit add clean up umount line in failure path.

Signed-off-by: Naohiro Aota 
---
 tests/fsck-tests/012-leaf-corruption/test.sh | 4 
 1 file changed, 4 insertions(+)

diff --git a/tests/fsck-tests/012-leaf-corruption/test.sh 
b/tests/fsck-tests/012-leaf-corruption/test.sh
index 6e23145..bfdd0ea 100755
--- a/tests/fsck-tests/012-leaf-corruption/test.sh
+++ b/tests/fsck-tests/012-leaf-corruption/test.sh
@@ -57,6 +57,7 @@ check_inode()
# Check whether the inode exists
exists=$($SUDO_HELPER find $path -inum $ino)
if [ -z "$exists" ]; then
+   $SUDO_HELPER umount $TEST_MNT
_fail "inode $ino not recovered correctly"
fi
 
@@ -64,17 +65,20 @@ check_inode()
found_mode=$(printf "%o" 0x$($SUDO_HELPER stat $exists -c %f))
if [ $found_mode -ne $mode ]; then
echo "$found_mode"
+   $SUDO_HELPER umount $TEST_MNT
_fail "inode $ino modes not recovered"
fi
 
# Check inode size
found_size=$($SUDO_HELPER stat $exists -c %s)
if [ $mode -ne 41700 -a $found_size -ne $size ]; then
+   $SUDO_HELPER umount $TEST_MNT
_fail "inode $ino size not recovered correctly"
fi
 
# Check inode name
if [ "$(basename $exists)" != "$name" ]; then
+   $SUDO_HELPER umount $TEST_MNT
_fail "inode $ino name not recovered correctly"
else
return 0
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: properly reset nlink of multi-linked file

2015-12-04 Thread Naohiro Aota
If a file is linked from more than one directory and only one
of the links is corrupted, btrfs check dose not reset the nlink
properly. Actually it can go into infinite loop to link the broken file
into lost+found.

This patch fix two part of the code. The first one delay the freeing
valid (no error, found inode ref, directory index, and directory
item) backrefs. Freeing valid backrefs earier prevent reset_nlink() to
add back all valid links.

The second fix is obvious: passing `ref_type' to btrfs_add_link() is just
wrong. It should be `filetype' instead. The current code can break all valid
file links.

Signed-off-by: Naohiro Aota 
---
 cmds-check.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 6a0b50a..11ff3fe 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -810,7 +810,8 @@ static void maybe_free_inode_rec(struct cache_tree 
*inode_cache,
if (backref->found_dir_item && backref->found_dir_index) {
if (backref->filetype != filetype)
backref->errors |= REF_ERR_FILETYPE_UNMATCH;
-   if (!backref->errors && backref->found_inode_ref) {
+   if (!backref->errors && backref->found_inode_ref &&
+   rec->nlink == rec->found_link) {
list_del(>list);
free(backref);
}
@@ -2392,7 +2393,7 @@ static int reset_nlink(struct btrfs_trans_handle *trans,
list_for_each_entry(backref, >backrefs, list) {
ret = btrfs_add_link(trans, root, rec->ino, backref->dir,
 backref->name, backref->namelen,
-backref->ref_type, >index, 1);
+backref->filetype, >index, 1);
if (ret < 0)
goto out;
}
-- 
2.6.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Linux 4.3 call traces for defective disk

2015-12-04 Thread Wolfgang Rohdewald
I have a defect disk which produced kernel backtraces like
(see below).

Are you interested in them, what else do you need to know, do you
prefer things inline or as attachments?

unmodified Linux 4.3 tainted with nvidia driver

disk:WDC WD2002FYPS-02W3B0
196 Reallocated_Event_Count 0x0032   200   200   000Old_age   Always   
-   0
197 Current_Pending_Sector  0x0032   200   200   000Old_age   Always   
-   3
198 Offline_Uncorrectable   0x0030   200   200   000Old_age   Offline  
-   2
199 UDMA_CRC_Error_Count0x0032   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000Old_age   Offline  
-   1

I mounted the disk normally (no RAID) and copied files from it.
I know I should have mounted readonly ...
Meanwhile the disk data is really corrupt, even after having it cool down 
overnight.
btrfs check -sX fails for X in 0..5. So since mounting is no longer possible, 
I cannot produce new call traces.

smartctl still says PASSED.

The data loss is no problem.

Dec  4 08:48:08 s5 kernel: [  114.814022] ata5.00: irq_stat 0x4008
Dec  4 08:48:08 s5 kernel: [  114.814024] ata5.00: failed command: READ FPDMA 
QUEUED
Dec  4 08:48:08 s5 kernel: [  114.814028] ata5.00: cmd 
60/08:60:07:8e:03/00:00:00:00:00/40 tag 12 ncq 4096 in
Dec  4 08:48:08 s5 kernel: [  114.814028]  res 
41/40:00:0e:8e:03/00:00:00:00:00/40 Emask 0x409 (media error) 
Dec  4 08:48:08 s5 kernel: [  114.814029] ata5.00: status: { DRDY ERR }
Dec  4 08:48:08 s5 kernel: [  114.814030] ata5.00: error: { UNC }
Dec  4 08:48:08 s5 kernel: [  114.822313] ata5.00: configured for UDMA/133
Dec  4 08:48:08 s5 kernel: [  114.822322] sd 4:0:0:0: [sde] tag#12 
UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Dec  4 08:48:08 s5 kernel: [  114.822324] sd 4:0:0:0: [sde] tag#12 Sense Key : 
0x3 [current] [descriptor] 
Dec  4 08:48:08 s5 kernel: [  114.822326] sd 4:0:0:0: [sde] tag#12 ASC=0x11 
ASCQ=0x4 
Dec  4 08:48:08 s5 kernel: [  114.822328] sd 4:0:0:0: [sde] tag#12 CDB: 
opcode=0x28 28 00 00 03 8e 07 00 00 08 00
Dec  4 08:48:08 s5 kernel: [  114.822329] blk_update_request: I/O error, dev 
sde, sector 232974
Dec  4 08:48:08 s5 kernel: [  114.822340] ata5: EH complete
Dec  4 08:48:08 s5 kernel: [  114.822360] BTRFS: failed to read tree root on 
sde1


And this is one of the six backtrace I got: (BTW all six are diffent)

Dec  3 11:39:45 s5 kernel: [ 8393.928639] ata5: link is slow to respond, please 
be patient (ready=0)
Dec  3 11:39:46 s5 kernel: [ 8395.160246] ata5: SATA link up 3.0 Gbps (SStatus 
123 SControl 300)
Dec  3 11:39:46 s5 kernel: [ 8395.164216] ata5.00: ACPI cmd 
ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec  3 11:39:46 s5 kernel: [ 8395.164219] ata5.00: ACPI cmd 
f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec  3 11:39:46 s5 kernel: [ 8395.164220] ata5.00: ACPI cmd 
b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Dec  3 11:39:46 s5 kernel: [ 8395.185378] ata5.00: ACPI cmd 
ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec  3 11:39:46 s5 kernel: [ 8395.185381] ata5.00: ACPI cmd 
f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec  3 11:39:46 s5 kernel: [ 8395.185383] ata5.00: ACPI cmd 
b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Dec  3 11:39:46 s5 kernel: [ 8395.190195] ata5.00: configured for UDMA/133
Dec  3 11:39:46 s5 kernel: [ 8395.204218] ata5: EH complete
Dec  3 11:39:57 s5 kernel: [ 8406.044742] ata5.00: exception Emask 0x50 SAct 
0x0 SErr 0x4090800 action 0xe frozen
Dec  3 11:39:57 s5 kernel: [ 8406.044746] ata5.00: irq_stat 0x00400040, 
connection status changed
Dec  3 11:39:57 s5 kernel: [ 8406.044747] ata5: SError: { HostInt PHYRdyChg 
10B8B DevExch }
Dec  3 11:39:57 s5 kernel: [ 8406.044749] ata5.00: failed command: FLUSH CACHE 
EXT
Dec  3 11:39:57 s5 kernel: [ 8406.044752] ata5.00: cmd 
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
Dec  3 11:39:57 s5 kernel: [ 8406.044752]  res 
40/00:0c:bf:8f:03/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Dec  3 11:39:57 s5 kernel: [ 8406.044753] ata5.00: status: { DRDY }
Dec  3 11:39:57 s5 kernel: [ 8406.044756] ata5: hard resetting link
Dec  3 11:40:03 s5 kernel: [ 8411.806856] ata5: link is slow to respond, please 
be patient (ready=0)
Dec  3 11:40:04 s5 kernel: [ 8413.038465] ata5: SATA link up 3.0 Gbps (SStatus 
123 SControl 300)
Dec  3 11:40:04 s5 kernel: [ 8413.043051] ata5.00: ACPI cmd 
ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec  3 11:40:04 s5 kernel: [ 8413.043054] ata5.00: ACPI cmd 
f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec  3 11:40:04 s5 kernel: [ 8413.043056] ata5.00: ACPI cmd 
b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION OVERLAY) filtered out
Dec  3 11:40:04 s5 kernel: [ 8413.064667] ata5.00: ACPI cmd 
ef/10:06:00:00:00:00 (SET FEATURES) succeeded
Dec  3 11:40:04 s5 kernel: [ 8413.064670] ata5.00: ACPI cmd 
f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK) filtered out
Dec  3 11:40:04 s5 

[PATCH 0/3] btrfs-progs: fix file restore to lost+found bug

2015-12-04 Thread Naohiro Aota
This series address an issue of btrfsck to restore infinite number of
same file into `lost+found' directory. The issue occur on a file which
is linked from two different directory A and B. If links from dir A is
corrupted and links from dir B is kept valid, btrfsck won't stop
creating a file in lost+found like this:

-
Moving file 'file.del.51' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Trying to rebuild inode:1877
Moving file 'del' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1877
Can't get file name for inode 1876, using '1876' as fallback
Moving file '1876' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Can't get file name for inode 1876, using '1876' as fallback
Moving file '1876.1876' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
(snip)
Moving file 
'1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876'
 to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Can't get file name for inode 1876, using '1876' as fallback
Can't get file name for inode 1876, using '1876' as fallback
Can't get file name for inode 1876, using '1876' as fallback
-

The problem is early release of inode backrefs. The release prevents
`reset_nlink()' to add back valid backrefs to an inode. In the result,
the following results occur:

0. btrfsck scan a FS tree
1. It finds valid links and invalid links (some links are lost)
2. All valid links are released
3. btrfsck detects found_links != nlink
4. reset_nlink() reset nlink to 0
5. No valid links are restored (thus still nlink = 0)
6. The file is restored to lost+found since nlink == 0 (now, nlink = 1)
7. btrfsck rescan the FS tree
8. It finds `found_links' = #valid_links+1 (in lost+found) and nlink = 1
9. again all valid links are lost, and restore to lost+found

The first patch add clean up code to the test. It umount test
directory on failure path. The second patch fix the above problem. And
the last patch extend the test to check a case of multiple-linked file
corruption.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bug/regression: Read-only mount not read-only

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-02 18:40, Qu Wenruo wrote:



On 12/03/2015 06:48 AM, Eric Sandeen wrote:

On 12/2/15 11:48 AM, Austin S Hemmelgarn wrote:


On a side note, do either XFS or ext4 support removing the norecovery
option from the mount flags through mount -o remount?  Even if they
don't, that might be a nice feature to have in BTRFS if we can safely
support it.


It's not remountable today on xfs:

 /* ro -> rw */
 if ((mp->m_flags & XFS_MOUNT_RDONLY) && !(*flags & MS_RDONLY)) {
 if (mp->m_flags & XFS_MOUNT_NORECOVERY) {
 xfs_warn(mp,
 "ro->rw transition prohibited on norecovery mount");
 return -EINVAL;
 }

not sure about ext4.

-Eric


Not remountable is very good to implement it.
Makes things super easy to do.

Or we will need to add log replay for remount time.

I'd like to implement it first for non-remountable case as a try.
And for the option name, I prefer something like "notreereplay", but I
don't consider it the best one yet

I entirely understand wanting a simple implementation first, my only 
point is that it would be a potentially useful feature to have if we 
could sanely implement it.





smime.p7s
Description: S/MIME Cryptographic Signature


3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
One of my test laptops started hanging on mounting the root filesystem.  I 
think that it had experience an unexpected power outage prior to that which 
may have caused corruption.

When I tried to mount the root filesystem the mount process would stick in D 
state, there would be no disk IO, and the computer would get hot - presumably 
due to kernel CPU use even though "top" didn't seem to indicate that.

When I mounted the filesystem with a 4.2.0 kernel it said "The free space cache 
file (1103101952) is invalid, skip it" and then things worked.  Now that the 
machine is running 4.2.0 everything is fine.

I know that there are no plans to backport things to 3.16 and I don't think 
the Debian people are going to be very interested in this.  So this message is 
a FYI for users, maybe consider not using the Debian/Jessie kernel for BTRFS 
systems.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Subvolume UUID, data corruption?

2015-12-04 Thread S.J
Hello

As we know, two file systems with the same UUID (like reported by eg. "blkid") 
are problematic, especially if both are mounted at the same time it leads to 
data corruption. So, copying a BTRFS partition with eg. dd to another and use 
it immediately is bad. To prevent this, "btrfstune -u /dev/sdaX" changes the 
UUID of the given partition.

However, BTRFS subvolumes have their own UUID, which can be viewed eg. with 
"btrfs sub list -u /mountpoint". This UUIDs are not changed by the command 
above, and apparently there is no other way to do this.

My question is: Is this a problem similar to the main UUID? Can mounting two 
BTRFS partitions with equal subvolume UUIDs (but different main UUID) can cause 
data corruption?

(...well, and maybe someone could explain me what these subvol UUIDs are for in 
the first place. Subvolumes already have an unique number, and from user p.o.v, 
there isn't anything where the subvol UUIDs can be used at all (?))

Thank you

PS: Apologies for sending a second mail, somehow my first try didn't contain 
any text
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs crashing the kernel with Seagate 8TB SMR drives.

2015-12-04 Thread Robert Krig
As Chris mentioned, check out the Bug report here:
https://bugzilla.kernel.org/show_bug.cgi?id=93581


I have a 8TB SMR Drive and the kernel was reporting drive errors.
Switching to Kernel 3.16 (Standard Debian Jessie kernel) fixed it for me
( for the moment).

>From what I read in that kernel bug report. The patch has been submitted
for kernel 4.4.

On 03.12.2015 19:07, Codebird wrote:
> I've got a nice bug for you - because I can offer you what everyone
> likes to see, a precise error message.
>
> I've got a btrfs filesystem spread over six devices, RAID1 mode. Four
> of these are Seagate 8TB archive drives - those SMR ones that a few
> others have reported failing when used with btrfs. I've had that issue
> too, and I just can't explain why, other than to say that it only
> occurs when using them on my mainboard SATA ports, not via USB dock.
> But that's not what I'm reporting - that's just the source of the
> problem that causes the crash I am reporting.
>
> The crash occurs when scrubbing, after some time and some terabytes -
> or possibly just when reading a certain place, I'm not sure - and it
> gives this helpful error left on the screen along with a system so
> unresponsive numlock won't flash:
>
> BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
> failure
> BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
> failure
> BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
> IO failure
> BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
> IO failure
> BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
> IO failure
>  BTRFS: assertion failed:
> f(fs_info->sb->s_flags & MS  
> ---[ cut here ]
> kernel BUG at ../fs/btrfs/ctree.h:4057!
>
> Not sure if some of those 5 might be 6, as I was in a hurry to get it
> back up both times and just got a blurry photo. But it looks to me
> like there might be a chunk of code that doesn't handle a hardware
> fault - rather than cleanly return an error it's causing the kernel to
> hang entirely. I've managed to get this to happen twice now, so it's
> certainly something worth looking into. This is on SUSE tumbleweed,
> with kernel 4.3.0-2-default.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compression disk space saving - what are your results?

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-03 01:29, Duncan wrote:

Austin S Hemmelgarn posted on Wed, 02 Dec 2015 09:39:08 -0500 as
excerpted:


On 2015-12-02 09:03, Imran Geriskovan wrote:

What are your disk space savings when using btrfs with compression?



[Some] posters have reported that for mostly text, compress didn't
give them expected compression results and they needed to use
compress-force.


"compress-force" option compresses regardless of the "compressibility"
of the file.

"compress" option makes some inference about the "compressibility" and
decides to compress or not.

I wonder how that inference is done?
Can anyone provide some pseudo code for it?



I'm not certain how BTRFS does it, but my guess would be trying to
compress the block, then storing the uncompressed version if the
compressed one is bigger.


No pseudocode as I'm not a dev and wouldn't want to give the wrong
impression, but as I believe I replied recently in another thread, based
on comments the devs have made...

With compress, btrfs does a(n intended to be fast) trial compression of
the first 128 KiB block or two and uses the result of that to decide
whether to compress the entire file.

Compress-force simply bypasses that first decision point, processing the
file as if the test always succeeded and compression was chosen.

If the decision to compress is made, the file is (evidently, again, not a
dev, but filefrag results support) compressed a 128 KiB block at a time
with the resulting size compared against the uncompressed version, with
the smaller version stored.

(Filefrag doesn't understand btrfs compression and reports individual
extents for each 128 KiB compression block, if compressed.  However, for
many files processed with compress-force, filefrag doesn't report the
expected size/128-KiB extents, but rather something lower.  If
filefrag -v is used, details of each "extent" are listed, and some show
up as multiples of 128 KiB, indicating runs of uncompressable blocks that
unlike actually compressed blocks, filefrag can and does report correctly
as single extents.  The conclusion is thus as above, that btrfs is
testing the compression result of each block, and not compressing if the
"compression" ends up being negative, that is, if the "compressed" size
is larger than the uncompressed size.)


On a side note, I really wish BTRFS would just add LZ4 support.  It's a
lot more deterministic WRT decompression time than LZO, gets a similar
compression ratio, and runs faster on most processors for both
compression and decompression.


There were patches (at least RFC level, IIRC) floating around years ago
to add lz4... I wonder what happened to them?  My impression was that a
large deployment somewhere may actually be running them as well, making
them well tested (and obviously well beyond preliminary RFC level) by
now, altho that impression could well be wrong.

Hmm, I'll have to see if I can find those and rebase them.  IIRC, the 
argument against adding it was 'but we already have a fast compression 
algorithm!', which in turn says to me they didn't try to sell it on the 
most significant parts, namely that it's faster at decompression than 
LZO (even when you use the lz4hc variant, which takes longer to compress 
to give a (usually) better compression ratio, but decompresses just as 
fast as regular lz4), and the timings are a lot more deterministic 
(which is really important if you're doing real-time stuff).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-04 05:00, Russell Coker wrote:

One of my test laptops started hanging on mounting the root filesystem.  I
think that it had experience an unexpected power outage prior to that which
may have caused corruption.

When I tried to mount the root filesystem the mount process would stick in D
state, there would be no disk IO, and the computer would get hot - presumably
due to kernel CPU use even though "top" didn't seem to indicate that.

When I mounted the filesystem with a 4.2.0 kernel it said "The free space cache
file (1103101952) is invalid, skip it" and then things worked.  Now that the
machine is running 4.2.0 everything is fine.

I know that there are no plans to backport things to 3.16 and I don't think
the Debian people are going to be very interested in this.  So this message is
a FYI for users, maybe consider not using the Debian/Jessie kernel for BTRFS
systems.


I'd suggest extending that suggestion to:
If you're not using an Enterprise distro (RHEL, SLES, CentOS, OEL), then 
you should probably be building your own kernel, ideally using upstream 
sources.


Ubuntu is notorious for picking 'stable' kernels that then fail to be 
marked by kernel.org as LTS, Debian picks kernels that are multiple 
versions old by the time they make a release, and I've heard similar 
from other non-enterprise distros that don't inherently make you build 
your own kernel.  Even among ones that you have to build the kernel 
yourself anyway, there are issues (Gentoo for example doesn't often mark 
new kernels as stable, even when they are perfectly usable for pretty 
much everyone).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
On Sat, 5 Dec 2015 12:53:07 AM Austin S Hemmelgarn wrote:
> > The only reason I'm not running Unstable kernels on my Debian systems is
> > because I run some Xen servers and upgrading Xen is problemmatic.  Linode
> > is moving from Xen to KVM so I guess I should consider doing the
> > same.  If I migrate my Xen servers to KVM I can use newer kernels with
> > less risk.
> 
> That's interesting, that must be something with how they do kernel 
> development in Debian, because I've never had any issues upgrading 
> either Xen or Linux on any of the systems I've run Xen on, and I 
> directly track mainline (with a small number of patches) for Linux, and 
> stay relatively close to mainline with Xen (Gentoo doesn't have all that 
> many patches on top of the regular release for Xen, aside from XSA
> patches).

I don't think that Debian does anything wrong in this regard.  It's just that 
my experience of Xen is that it is fragile at the best of times.  The fact 
that Red Hat packaged the Xen kernel in the Linux kernel package is a major 
indication of Xen problems IMHO, the concept of Xen is that it shouldn't be 
tied to a Linux kernel.

If you haven't had Xen issues then I think you have been lucky.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: compression disk space saving - what are your results?

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-03 07:09, Imran Geriskovan wrote:

On a side note, I really wish BTRFS would just add LZ4 support.  It's a
lot more deterministic WRT decompression time than LZO, gets a similar
compression ratio, and runs faster on most processors for both
compression and decompression.


Relative ratios according to
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

Compressed size
gzip (1) - lzo (1.4) - lz4 (1.4)

Compression Time
gzip (5) - lzo (1) - lz4 (0.8)

Decompression Time
gzip (9) - lzo (4) - lz4 (1)

Compression Memory
gzip (1) - lzo (2) - lz4 (20)

Decompression Memory
gzip (1) - lzo (2) - lz4 (130). Yes 130! not a typo.

But there is a note:
Note: lz4 it's the program using this size, the
code for internal lz4 use very less memory.

However, I could not find any better apples to apples
comparison.

If lz4's real memory consumption is in orders of lzo,
than it looks good.
AFAICT, it's similar memory consumption.  I did some tests a while back 
comparing the options for kernel image compression using a VM, and one 
of the things I tested (although I can't for the life of me remember how 
exactly except that it involved using QEMU hooked up to GDB) was 
run-time decompressor footprint.  LZO really should have a smaller 
memory footprint too, it's just that lzop needs to handle almost a dozen 
different LZO compression formats.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Bug/regression: Read-only mount not read-only

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-02 18:51, Hugo Mills wrote:

On Thu, Dec 03, 2015 at 07:40:08AM +0800, Qu Wenruo wrote:



On 12/03/2015 06:48 AM, Eric Sandeen wrote:

On 12/2/15 11:48 AM, Austin S Hemmelgarn wrote:


On a side note, do either XFS or ext4 support removing the norecovery
option from the mount flags through mount -o remount?  Even if they
don't, that might be a nice feature to have in BTRFS if we can safely
support it.


It's not remountable today on xfs:

 /* ro -> rw */
 if ((mp->m_flags & XFS_MOUNT_RDONLY) && !(*flags & MS_RDONLY)) {
 if (mp->m_flags & XFS_MOUNT_NORECOVERY) {
 xfs_warn(mp,
 "ro->rw transition prohibited on norecovery mount");
 return -EINVAL;
 }

not sure about ext4.

-Eric


Not remountable is very good to implement it.
Makes things super easy to do.

Or we will need to add log replay for remount time.

I'd like to implement it first for non-remountable case as a try.
And for the option name, I prefer something like "notreereplay", but
I don't consider it the best one yet


Thinking out loud...

no-log-replay, no-log, hard-ro, ro-log,
really-read-only-i-mean-it-this-time-honest-guvnor

Delete hyphens at your pleasure.

Personally, I think no-log-replay (with or without hyphens) is the most 
concise option name.  With something like this, it should be as clear as 
possible what is being done.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!

2015-12-04 Thread David Sterba
On Fri, Dec 04, 2015 at 09:21:59AM +0800, Qu Wenruo wrote:
> > We do have the alignment check in kernel, but it's in the early phase
> > where we don't know if nodesize is reliable and print only a warning.
> >
> This can be enhanced by the following method:

At minimum, we can promote the 4k alignment checks in
btrfs_check_super_valid from a warning to an error. The blocks must be
4k aligned, regardless of sectorsize or nodesize.

> 1) Check sectorsize first
> Only several sector size is valid for current btrfs:
> 4K, 8K, 16K, 32K, 64K
> Just five numbers, quite easy to check.

The sectorsize must be PAGE_SIZE at the moment. This will change with
Chandan's patchset though.

> Or if anyone is going to extend supported sectorsize, we can change
> the check to if the number is power of 2 starting from 4K.
> 
> 2) Check nodesize/leafsize then
> It should be aligned to sectorsize.

This particular check is missing but is implicit because of the
sectorsize == PAGE_SIZE restriction.

> And nodesize must match with leafsize.
> Currently, it's done out of check_super_valid(), we can integrate it.

Yeah it's done, then I don't see why we should add it agian.

> 3) Check all super root bytenr against *sectorsize*
> Yeah, not nodesize.
> As some old bad convert will cause metadata extent unaligned to
> nodesize(just before my convert rework patch), but only aligned to
> sectorsize.
> So only check alignment of sectorsize.

While the real check should be against the sectorsize, at the moment I
think it's covered by the 4k checks anyway. I understand why we can't
use the nodesize.

So, if we do the warning -> error, we're fine for now. Some of the
checks you suggest would be good to merge when the subpage blocksize
patchset is merged.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: disable online scrub repair on ro cases

2015-12-04 Thread kbuild test robot
Hi Liu,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.4-rc3 next-20151203]

url:
https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-disable-online-scrub-repair-on-ro-cases/20151204-205115
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: powerpc-defconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   fs/btrfs/scrub.c: In function 'scrub_fixup_readpage':
>> fs/btrfs/scrub.c:703:10: error: invalid type argument of '->' (have 'u64 
>> {aka long long unsigned int}')
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^

vim +703 fs/btrfs/scrub.c

   697  struct inode *inode = NULL;
   698  struct btrfs_fs_info *fs_info;
   699  u64 end = offset + PAGE_SIZE - 1;
   700  struct btrfs_root *local_root;
   701  int srcu_index;
   702  
 > 703  if (root->fs_info->sb->s_flags & MS_RDONLY)
   704  return -EROFS;
   705  
   706  key.objectid = root;

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Russell Coker
On Sat, 5 Dec 2015 12:08:58 AM Austin S Hemmelgarn wrote:
> > I know that there are no plans to backport things to 3.16 and I don't
> > think the Debian people are going to be very interested in this.  So
> > this message is a FYI for users, maybe consider not using the
> > Debian/Jessie kernel for BTRFS systems.
> 
> I'd suggest extending that suggestion to:
> If you're not using an Enterprise distro (RHEL, SLES, CentOS, OEL), then 
> you should probably be building your own kernel, ideally using upstream 
> sources.

There are lots of ways of dealing with this.

Debian development doesn't stop.  Anyone who is running a Jessie system can 
easily run a kernel from Testing or Unstable (which really isn't particularly 
unstable).  It's generally expected that Debian user-space will work with a 
kernel from +- one release of Debian.  Also every time I've tried it Debian 
has worked well with a CentOS kernel of a similar version.

The only reason I'm not running Unstable kernels on my Debian systems is 
because I run some Xen servers and upgrading Xen is problemmatic.  Linode is 
moving from Xen to KVM so I guess I should consider doing the same.  If I 
migrate my Xen servers to KVM I can use newer kernels with less risk.

-- 
My Main Blog http://etbe.coker.com.au/
My Documents Bloghttp://doc.coker.com.au/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Make btrfs-progs really compatible with any kernel version

2015-12-04 Thread David Sterba
On Fri, Dec 04, 2015 at 10:08:35AM +0800, Qu Wenruo wrote:
> Liu Bo wrote on 2015/12/03 17:44 -0800:
> > On Mon, Nov 23, 2015 at 06:56:09PM +0100, David Sterba wrote:
> >> On Mon, Nov 23, 2015 at 08:56:13PM +0800, Anand Jain wrote:
> >>> Btrfs-progs is a tool for the btrfs kernel and we hope latest btrfs-progs
> >>> be compatible w any set of older/newer kernels.
> >>>
> >>> So far mkfs.btrfs and btrfs-convert sets the default features, for eg,
> >>> skinny-metadata even if the running kernel does not supports it, and
> >>> so the mount fails on the running.
> >>
> >> So the default behaviour of mkfs will try to best guess the feature set
> >> of currently running kernel. I think this is is the most common scenario
> >> and justifies the change in default behaviours.
> >>
> >> For the other cases I'd like to introduce some human-readable shortcuts
> >> to the --features option. Eg. 'mkfs.btrfs -O compat-3.2' will pick all
> >> options supported by the unpatched mainline kernel of version 3.2. This
> >> would be present for all version, regardless if there was a change in the
> >> options or not.
> >>
> >> Similarly for convenience, add 'running' that would pick the options
> >> from running kernel but will be explicit.
> >>
> >> A remaining option should override the 'running' behaviour and pick the
> >> latest mkfs options. Naming it 'defaults' sounds a bit ambiguous so the
> >> name is yet to be determined.
> >>
> >>> Here in this set of patches will make sure the progs understands the
> >>> kernel supported features.
> >>>
> >>> So in this patch, checks if sysfs tells whether the feature is
> >>> supported if not, then it will relay on static kernel version which
> >>> provided that feature (skinny-metadata here in this example), next
> >>> if for some reason the running kernel does not provide the kernel
> >>> version, then it will fall back to the original method to enable
> >>> the feature with a hope that kernel will support it.
> >>>
> >>> Also the last patch adds a warning when we fail to read either
> >>> sysfs features or the running kernel version.
> >>
> >> Your patchset is a good start, the additional options I've described can
> >> be added on top of that. We might need to switch the version
> >> representation from string to KERNEL_VERSION but that's an
> >> implementation detail.
> >
> > Depending on sysfs is stable but depending on kernel version may be not,
> > we may have a distro kernel which backports some incompat features from
> > upstream, then we have to decide based on sysfs interface.
> 
> +1.
> 
> Although sysfs does not always show up even for supported kernel, e.g 
> btrfs modules is not loaded after boot.
> So we need to consider twice before choosing a fallback method.

There are several factors that we have to take into account for the
default behaviour and fallback. I'm close to a final proposal yet missed
the possibility of unloaded module that would remove the access to
sysfs, as you point out.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] btrfs: change how delay_iput is tracked in btrfs_delalloc_work

2015-12-04 Thread David Sterba
On Thu, Dec 03, 2015 at 06:25:37PM -0800, Liu Bo wrote:
> > struct inode *inode;
> > -   int delay_iput;
> > struct completion completion;
> > struct list_head list;
> > struct btrfs_work work;
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 15b29e879ffc..529a53b80ca0 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -9436,16 +9436,21 @@ static void btrfs_run_delalloc_work(struct 
> > btrfs_work *work)
> >  {
> > struct btrfs_delalloc_work *delalloc_work;
> > struct inode *inode;
> > +   int delay_iput;
> >  
> > delalloc_work = container_of(work, struct btrfs_delalloc_work,
> >  work);
> > inode = delalloc_work->inode;
> > +   /* Lowest bit of inode pointer tracks the delayed status */
> > +   delay_iput = ((unsigned long)inode & 1UL);
> > +   inode = (struct inode *)((unsigned long)inode & ~1UL);
> > +
> 
> To be quite frankly, I don't like this, it's a pointer anyway,
> error-prone in a debugging view, instead would 'u8 delayed_iput' help?

The point was to shrink the structure. Adding the u8 will grow it by
another 8 bytes, besides the slab objects are aligned to 8 bytes by
default so the overall cost of storing the delayed information is 8
bytes:

struct btrfs_delalloc_work {
struct inode * inode;/* 0 8 */
struct completion  completion;   /* 832 */
struct list_head   list; /*4016 */
struct btrfs_work  work; /*5688 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
u8 delay;/*   144 1 */

/* size: 152, cachelines: 3, members: 5 */
/* padding: 7 */
/* last cacheline: 24 bytes */
};

As the use of the inode pointer is limited, I don't think this would
cause surprises. And it's commented where used which should help during
debugging.

Abusing the low bits of pointers is nothing new, the page cache tags are
implemented that way. This kind of low-level optimizations is IMO acceptable.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very various speed of grep operation on btrfs partition

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-03 14:36, Михаил Гаврилов wrote:

Today on work I needed searching some strings in repository. Only
machine with windows was available. I am was using grep from Cygwin
for this task and I am was surprised about speed of NTFS partition.I
decided to repeat this task on my home Linux workstation.


[...snip...]

 From results we see that search goes sometimes instantly less than a
second, and sometimes lasts 4 minutes. /home partition formatted in
BTRFS filesystem. I would be interested investigate what is related to
search speed. And make that search was always goes less than a second.

Here is my mount options:
UUID=82df2d84-bf54-46cb-84ba-c88e93677948 /home btrfs
subvolid=5,autodefrag,noatime,space_cache,inode_cache,nodatacow 0 0

# uname -a
Linux localhost.localdomain 4.2.6-301.fc23.x86_64+debug #1 SMP Fri Nov
20 22:07:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

How to start investigation?

Well, what other things are accessing the filesystem at the same time? 
If you've got something like KDE running with the 'semantic desktop' 
stuff turned on, than that will seriously impact the performance of 
other things using that filesystem.


The other thing to keep in mind, is that caching may be impacting things 
somewhat.  To really get a good idea of performance for something like 
this, you should run 'sync' followed by 'echo 3 > 
/proc/sys/vm/drop_caches' (you'll need to be root for the second one) 
prior to each run, and ideally have nothing else running on that filesystem.


On a separate note, if you're either running on a 64-bit system, or have 
less than about 2^31 files on the FS, inode_cache will slow things down. 
 It's intended for stuff like mail spools where you have billions of 
files being created and deleted over a few weeks, and quickly use up the 
inode numbers.  On almost all systems, it will make things run slower, 
and possibly result in non-=deterministic filesystem performance like 
what you are seeing here.


Additionally, do you have some particular reason that you absolutely 
_need_ nodatacow to be enabled for the FS?  It usually has no impact on 
performance, but it removes any kind of error correction for file data 
(checksums can't be used safely without COW semantics).  It probably has 
no direct impact on what you're seeing here, but it is something that 
really shouldn't be used in most cases at the filesystem level (it can 
be done on given subvolumes or directories, and that's the recommended 
way to do it if you don't want to go down to the per-file level).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] Btrfs: disable online scrub repair on ro cases

2015-12-04 Thread kbuild test robot
Hi Liu,

[auto build test WARNING on btrfs/next]
[also build test WARNING on v4.4-rc3 next-20151203]

url:
https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-disable-online-scrub-repair-on-ro-cases/20151204-205115
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: i386-randconfig-c0-12042053 (attached as .config)
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
from include/linux/stddef.h:4,
from include/uapi/linux/posix_types.h:4,
from include/uapi/linux/types.h:13,
from include/linux/types.h:5,
from include/uapi/linux/capability.h:16,
from include/linux/capability.h:15,
from include/linux/sched.h:15,
from include/linux/blkdev.h:4,
from fs/btrfs/scrub.c:19:
   fs/btrfs/scrub.c: In function 'scrub_fixup_readpage':
   fs/btrfs/scrub.c:703:10: error: invalid type argument of '->' (have 'u64 
{aka long long unsigned int}')
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^
   include/linux/compiler.h:147:28: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> fs/btrfs/scrub.c:703:2: note: in expansion of macro 'if'
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^
   fs/btrfs/scrub.c:703:10: error: invalid type argument of '->' (have 'u64 
{aka long long unsigned int}')
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^
   include/linux/compiler.h:147:40: note: in definition of macro '__trace_if'
 if (__builtin_constant_p((cond)) ? !!(cond) :   \
   ^
>> fs/btrfs/scrub.c:703:2: note: in expansion of macro 'if'
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^
   fs/btrfs/scrub.c:703:10: error: invalid type argument of '->' (have 'u64 
{aka long long unsigned int}')
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^
   include/linux/compiler.h:158:16: note: in definition of macro '__trace_if'
  __r = !!(cond); \
   ^
>> fs/btrfs/scrub.c:703:2: note: in expansion of macro 'if'
 if (root->fs_info->sb->s_flags & MS_RDONLY)
 ^

vim +/if +703 fs/btrfs/scrub.c

   687  }
   688  
   689  static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void 
*fixup_ctx)
   690  {
   691  struct page *page = NULL;
   692  unsigned long index;
   693  struct scrub_fixup_nodatasum *fixup = fixup_ctx;
   694  int ret;
   695  int corrected = 0;
   696  struct btrfs_key key;
   697  struct inode *inode = NULL;
   698  struct btrfs_fs_info *fs_info;
   699  u64 end = offset + PAGE_SIZE - 1;
   700  struct btrfs_root *local_root;
   701  int srcu_index;
   702  
 > 703  if (root->fs_info->sb->s_flags & MS_RDONLY)
   704  return -EROFS;
   705  
   706  key.objectid = root;
   707  key.type = BTRFS_ROOT_ITEM_KEY;
   708  key.offset = (u64)-1;
   709  
   710  fs_info = fixup->root->fs_info;
   711  srcu_index = srcu_read_lock(_info->subvol_srcu);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 04/12] btrfs: change how delay_iput is tracked in btrfs_delalloc_work

2015-12-04 Thread Holger Hoffstätte
On 12/04/15 13:36, David Sterba wrote:
[snip]
> As the use of the inode pointer is limited, I don't think this would
> cause surprises. And it's commented where used which should help during
> debugging.

When I read through those bits (mostly pondering portability) I was wondering
whether it might make sense to provide thin wrap/unwrap functions for the tag
bit instead of relying on open code and comments only. Just an idea, not sure
if it's worth the trouble. The code itself is functional and works fine as it
is, I'm running it right now.

-h

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolume UUID, data corruption?

2015-12-04 Thread Hugo Mills
On Fri, Dec 04, 2015 at 01:05:28PM +0100, S.J wrote:
> Hello
> 
> As we know, two file systems with the same UUID (like reported by eg. 
> "blkid") are problematic, especially if both are mounted at the same time it 
> leads to data corruption. So, copying a BTRFS partition with eg. dd to 
> another and use it immediately is bad. To prevent this, "btrfstune -u 
> /dev/sdaX" changes the UUID of the given partition.
> 
> However, BTRFS subvolumes have their own UUID, which can be viewed eg. with 
> "btrfs sub list -u /mountpoint". This UUIDs are not changed by the command 
> above, and apparently there is no other way to do this.
> 
> My question is: Is this a problem similar to the main UUID? Can mounting two 
> BTRFS partitions with equal subvolume UUIDs (but different main UUID) can 
> cause data corruption?

   I don't think it'll cause problems. The UUIDs on subvols are only
really used internally to that filesystem, so the kernel doesn't have
a chance to get confused. The main thing that could be confused is
send/receive, but that's a matter of possibly losing some validation
(thus allowing you to do something that will fail) rather than causing
active damage, as in the duplicate-FS-UUID case.

> (...well, and maybe someone could explain me what these subvol UUIDs are for 
> in the first place. Subvolumes already have an unique number, and from user 
> p.o.v, there isn't anything where the subvol UUIDs can be used at all (?))

   The subvol UUIDs are used to identify them through send/receive
operations. There are three main UUID fields on a subvol: the actual
UUID (u), the Received_UUID (r) and the Parent_UUID (p), and these are
used to identify whether an incremental send could function correctly
when received. (I can give you chapter and verse on how they're used
if you like, but that's a bit excessive just for answering your
question here).

   Hugo.

> Thank you
> 
> PS: Apologies for sending a second mail, somehow my first try didn't contain 
> any text

-- 
Hugo Mills | Do not meddle in the affairs of system
hugo@... carfax.org.uk | administrators, for they are subtle, and quick to
http://carfax.org.uk/  | anger.
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-04 08:42, Russell Coker wrote:

On Sat, 5 Dec 2015 12:08:58 AM Austin S Hemmelgarn wrote:

I know that there are no plans to backport things to 3.16 and I don't
think the Debian people are going to be very interested in this.  So
this message is a FYI for users, maybe consider not using the
Debian/Jessie kernel for BTRFS systems.


I'd suggest extending that suggestion to:
If you're not using an Enterprise distro (RHEL, SLES, CentOS, OEL), then
you should probably be building your own kernel, ideally using upstream
sources.


There are lots of ways of dealing with this.

Debian development doesn't stop.  Anyone who is running a Jessie system can
easily run a kernel from Testing or Unstable (which really isn't particularly
unstable).  It's generally expected that Debian user-space will work with a
kernel from +- one release of Debian.  Also every time I've tried it Debian
has worked well with a CentOS kernel of a similar version.
Well yes, that does usually work, but that doesn't mean that it keeps up 
with mainline very well.  Back when I used Debian on a regular basis, I 
ran the 'unstable' kernels, and they still lagged behind mainline by at 
least a minor version, and often more than that.  And there have been 
cases where things got horribly broken in mainline due to lack of proper 
vetting of code (Most recent example being the insanity with the 
clustered MD code, which broke non-clustered soft raid for at least two 
major releases), which prevents them from safely keeping up-to-date with 
mainline.


The only reason I'm not running Unstable kernels on my Debian systems is
because I run some Xen servers and upgrading Xen is problemmatic.  Linode is
moving from Xen to KVM so I guess I should consider doing the same.  If I
migrate my Xen servers to KVM I can use newer kernels with less risk.
That's interesting, that must be something with how they do kernel 
development in Debian, because I've never had any issues upgrading 
either Xen or Linux on any of the systems I've run Xen on, and I 
directly track mainline (with a small number of patches) for Linux, and 
stay relatively close to mainline with Xen (Gentoo doesn't have all that 
many patches on top of the regular release for Xen, aside from XSA patches).





smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs crashing the kernel with Seagate 8TB SMR drives.

2015-12-04 Thread Birdsarenice
I did suspect that NCQ may be involved, but I had no clear evidence - 
until I noticed that my drives had also incremented the 'end to end 
error' count in SMART, which does match accounts of the NCQ issue. That 
suggests there are two interlinked issues: The issue with those Seagate 
drives and NCQ, combined with btrfs causing a kernel lock under certain 
error circumstances when it would be more appropriate to remount ro. 
Looks like the NCQ issue is already being addressed, but I did uncover a 
new and unusual error condition that btrfs needs to handle - and looking 
at the patch, it's a trivial thing to fix, so bothering the mailing list 
with it has made btrfs better in a tiny way. I don't usually report 
errors, assuming that people far more capable than I are already on top 
of them, but when I saw one that gave a description right down to the 
line number I thought it might be something that could be looked into 
very easily.


I'm still impressed with the resilience of btrfs though - after all this 
abuse of crashing during rebalancing, corrupted filesystem structures 
and out-of-order commands, all my data is still undamaged. No 
conventional RAID could have endured that.


Thanks for the patch, but I'd rather not fiddle with he kernel and have 
to repeat every time a new version comes out. I'll just disable NCQ 
until the fix is mainlined and SUSE incorporates it.


uOn 04/12/15 15:21, Robert Krig wrote:

As Chris mentioned, check out the Bug report here:
https://bugzilla.kernel.org/show_bug.cgi?id=93581


I have a 8TB SMR Drive and the kernel was reporting drive errors.
Switching to Kernel 3.16 (Standard Debian Jessie kernel) fixed it for me
( for the moment).

>From what I read in that kernel bug report. The patch has been submitted
for kernel 4.4.

On 03.12.2015 19:07, Codebird wrote:

I've got a nice bug for you - because I can offer you what everyone
likes to see, a precise error message.

I've got a btrfs filesystem spread over six devices, RAID1 mode. Four
of these are Seagate 8TB archive drives - those SMR ones that a few
others have reported failing when used with btrfs. I've had that issue
too, and I just can't explain why, other than to say that it only
occurs when using them on my mainboard SATA ports, not via USB dock.
But that's not what I'm reporting - that's just the source of the
problem that causes the crash I am reporting.

The crash occurs when scrubbing, after some time and some terabytes -
or possibly just when reading a certain place, I'm not sure - and it
gives this helpful error left on the screen along with a system so
unresponsive numlock won't flash:

BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
failure
BTRFS: Error (device sdg1) in  __btrfs_free_extent:6360: errno=-5 IO
failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
BTRFS: Error (device sdg1) in  btrfs_run_delayed_refs:2851: errno=-5
IO failure
 BTRFS: assertion failed:
f(fs_info->sb->s_flags & MS  
---[ cut here ]
kernel BUG at ../fs/btrfs/ctree.h:4057!

Not sure if some of those 5 might be 6, as I was in a hurry to get it
back up both times and just got a blurry photo. But it looks to me
like there might be a chunk of code that doesn't handle a hardware
fault - rather than cleanly return an error it's causing the kernel to
hang entirely. I've managed to get this to happen twice now, so it's
certainly something worth looking into. This is on SUSE tumbleweed,
with kernel 4.3.0-2-default.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.16.0 Debian kernel hang

2015-12-04 Thread Austin S Hemmelgarn

On 2015-12-04 09:26, Russell Coker wrote:

On Sat, 5 Dec 2015 12:53:07 AM Austin S Hemmelgarn wrote:

The only reason I'm not running Unstable kernels on my Debian systems is
because I run some Xen servers and upgrading Xen is problemmatic.  Linode
is moving from Xen to KVM so I guess I should consider doing the
same.  If I migrate my Xen servers to KVM I can use newer kernels with
less risk.


That's interesting, that must be something with how they do kernel
development in Debian, because I've never had any issues upgrading
either Xen or Linux on any of the systems I've run Xen on, and I
directly track mainline (with a small number of patches) for Linux, and
stay relatively close to mainline with Xen (Gentoo doesn't have all that
many patches on top of the regular release for Xen, aside from XSA
patches).


I don't think that Debian does anything wrong in this regard.  It's just that
my experience of Xen is that it is fragile at the best of times.  The fact
that Red Hat packaged the Xen kernel in the Linux kernel package is a major
indication of Xen problems IMHO, the concept of Xen is that it shouldn't be
tied to a Linux kernel.
In the case of Red Hat, that's probably the way it's done because that's 
originally what was needed to make things work.  Early versions of Xen 
very much did need a special version of Linux running as Domain 0. 
Coupling things like that also simplifies testing for the developers at 
Red hat, as they then only need to test one combination, instead of a 
big matrix of features.  Less to test means they can test more 
thoroughly, which means they can provide a better guarantee that things 
will work without intervention right out of the box, which is important 
for enterprise distros.


Xen is supposed to be decoupled from the version of the Domain 0 kernel, 
and in most of my experience with it, they do a pretty good job.  90% of 
the issues I've heard of personally have been with patched versions put 
together by Linux distros, not with an upstream release.


If you haven't had Xen issues then I think you have been lucky.

I have personally had issues using Debian as Domain 0 and keeping Xen up 
to date myself, but all of those issues vanished when I switched to 
Gentoo for that purpose (well, they vanished when I switched to NetBSD, 
but haven't resurfaced since I switched from that to Gentoo Linux after 
about a week of pulling my hair out from fighting with BSD). I'm 
admittedly not doing anything other than small purpose built PV domains 
for service isolation in most cases (although I do use a dedicated PV 
domain for testing kernel patches from time to time), but that really 
shouldn't have any impact.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH 0/3] btrfs-progs: fix file restore to lost+found bug

2015-12-04 Thread Qu Wenruo



On 12/04/2015 01:37 PM, Naohiro Aota wrote:

This series address an issue of btrfsck to restore infinite number of
same file into `lost+found' directory. The issue occur on a file which
is linked from two different directory A and B. If links from dir A is
corrupted and links from dir B is kept valid, btrfsck won't stop
creating a file in lost+found like this:

-
Moving file 'file.del.51' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Trying to rebuild inode:1877
Moving file 'del' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1877
Can't get file name for inode 1876, using '1876' as fallback
Moving file '1876' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Can't get file name for inode 1876, using '1876' as fallback
Moving file '1876.1876' to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
(snip)
Moving file 
'1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876.1876'
 to 'lost+found' dir since it has no valid backref
Fixed the nlink of inode 1876
Can't get file name for inode 1876, using '1876' as fallback
Can't get file name for inode 1876, using '1876' as fallback
Can't get file name for inode 1876, using '1876' as fallback
-

The problem is early release of inode backrefs. The release prevents
`reset_nlink()' to add back valid backrefs to an inode. In the result,
the following results occur:

0. btrfsck scan a FS tree
1. It finds valid links and invalid links (some links are lost)
2. All valid links are released
3. btrfsck detects found_links != nlink
4. reset_nlink() reset nlink to 0
5. No valid links are restored (thus still nlink = 0)
6. The file is restored to lost+found since nlink == 0 (now, nlink = 1)
7. btrfsck rescan the FS tree
8. It finds `found_links' = #valid_links+1 (in lost+found) and nlink = 1
9. again all valid links are lost, and restore to lost+found


Right, that's one case I missed in the repair code.

Thanks for the fix.



The first patch add clean up code to the test. It umount test
directory on failure path. The second patch fix the above problem. And
the last patch extend the test to check a case of multiple-linked file
corruption.


But I only see the first 2 patches in maillist...
The last test case seems missing?

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Subvolume UUID, data corruption?

2015-12-04 Thread Christoph Anton Mitterer
On Fri, 2015-12-04 at 13:07 +, Hugo Mills wrote:
> I don't think it'll cause problems.
Is there any guaranteed behaviour when btrfs encounters two filesystems
(i.e. not talking about the subvols now) with the same UUID?

Given that it's long standing behaviour that people could clone
filesystems (dd, etc.) and this just worked™, btrfs should at least
handle such case gracefully.
For example, when already more than one block device with a btrfs of
the same UUID are known, then it should refuse to mount any of them.

And if one is already known and another device pops up it should refuse
to mount that and continue to normally use the already mounted one.



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 0/5] Make btrfs-progs really compatible with any kernel version

2015-12-04 Thread Anand Jain



David,


the possibility of unloaded module that would remove the access to
sysfs, as you point out.


 Kindly note, the patch below made /dev/btrfs-control a static node,

-
commit 578454ff7eab61d13a26b568f99a89a2c9edc881
Author: Kay Sievers 
Date: Thu May 20 18:07:20 2010 +0200

driver core: add devname module aliases to allow module on-demand 
auto-loading

--

 And here the function, check_or_load_btrfs_ko(), in the PATCH v2 2/5,
 will take care of this problem.


+
+int check_or_load_btrfs_ko()
+{
+   int fd;
+
+   /*
+* open will load btrfs kernel module if its not loaded,
+* and if the kernel has CONFIG auto load set?
+*/
+   fd = open("/dev/btrfs-control", O_RDONLY);
+   if (fd < 0)
+   return -errno;
+
+   close(fd);
+   return 0;
+}
+


 Since now static minor number for /dev/btrfs-control is mapped to
 the btrfs kernel module, it will ensure btrfs is loaded when
 /dev/btrfs-control is accessed.

 Further, /dev/btrfs-control node is created by udevd, by reading
 the modules.devname which is either supplied/updated by the distro
 or compilation.

 For systems without udev, IMO should run mknod ..btrfs-control
 in their install script which I guess is a must.


# ls -li /dev/btrfs-control
7338 crw-rw 1 root disk 10, 234 Dec  5 10:45 /dev/btrfs-control

# cat modules.devname | egrep btrfs
btrfs btrfs-control c10:234

# cat ./include/linux/miscdevice.h | egrep BTRFS
#define BTRFS_MINOR 234


 So IMO this is not a real problem.

Thanks, Anand

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: attacking btrfs filesystems via UUID collisions? (was: Subvolume UUID, data corruption?)

2015-12-04 Thread Christoph Anton Mitterer
Thinking a bit more I that, I came to the conclusion that it's actually 
security relevant that btrfs deals gracefully with filesystems having the same 
UUID:

Getting to know someone else's filesystem's UUID may be more easily possible 
than one may think.
It's usually not considered secret and for example included in debug reports 
(e.g. several Debian packages do this).

The only thing an attacker then needs to do is somehow making another 
filesystem with the UUID available in his victims system.
Simplest way is via a USB stick when he has local access.
Thanks to some stupid desktop environments, chances aren't to bad that the 
system will even auto mount the stick.

If btrfs doesn't handle this gracefully the attacker may damage or destroy the 
original filesystem, or if things get awkwardly corrupted (and data is written 
to the fake btrfs) even get data out of such a system (despite any screen locks 
or dm-crypt).

Cheers
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] btrfs: Cleanup num_tolerated_disk_barrier_failures

2015-12-04 Thread Qu Wenruo

Hi Anand,

Would you please push patch 1~6 in your hot spare patchset to Chris first?

In my opinion, it will need some time before some details like whether 
to do hot-spare in kernel or in user-space are settled.


And all these 6 patches are quite independent from the hot spare patchset.
So it would be OK to push them into mainline in this or next merge windows.

Thanks,
Qu

On 11/09/2015 06:56 PM, Anand Jain wrote:

From: Qu Wenruo 

As we use per-chunk degradable check, now the global
num_tolerated_disk_barrier_failures is of no use. So cleanup it.

Signed-off-by: Qu Wenruo 

[Btrfs: resolve conflict to apply 'btrfs: Cleanup 
num_tolerated_disk_barrier_failures']
Signed-off-by: Anand Jain 
---
  fs/btrfs/ctree.h   |  2 --
  fs/btrfs/disk-io.c | 56 --
  fs/btrfs/disk-io.h |  2 --
  fs/btrfs/volumes.c | 17 -
  4 files changed, 77 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a86051e..dedd3e0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1753,8 +1753,6 @@ struct btrfs_fs_info {
/* next backup root to be overwritten */
int backup_root_index;

-   int num_tolerated_disk_barrier_failures;
-
/* device replace state */
struct btrfs_dev_replace dev_replace;

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d3303f9..d10ef2e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2965,8 +2965,6 @@ retry_root_backup:
printk(KERN_ERR "BTRFS: Failed to read block groups: %d\n", 
ret);
goto fail_sysfs;
}
-   fs_info->num_tolerated_disk_barrier_failures =
-   btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);

fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
   "btrfs-cleaner");
@@ -3498,60 +3496,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 
flags)
return 0;
  }

-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info)
-{
-   struct btrfs_ioctl_space_info space;
-   struct btrfs_space_info *sinfo;
-   u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
-  BTRFS_BLOCK_GROUP_SYSTEM,
-  BTRFS_BLOCK_GROUP_METADATA,
-  BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-   int i;
-   int c;
-   int num_tolerated_disk_barrier_failures =
-   (int)fs_info->fs_devices->num_devices;
-
-   for (i = 0; i < ARRAY_SIZE(types); i++) {
-   struct btrfs_space_info *tmp;
-
-   sinfo = NULL;
-   rcu_read_lock();
-   list_for_each_entry_rcu(tmp, _info->space_info, list) {
-   if (tmp->flags == types[i]) {
-   sinfo = tmp;
-   break;
-   }
-   }
-   rcu_read_unlock();
-
-   if (!sinfo)
-   continue;
-
-   down_read(>groups_sem);
-   for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
-   u64 flags;
-
-   if (list_empty(>block_groups[c]))
-   continue;
-
-   btrfs_get_block_group_info(>block_groups[c],
-  );
-   if (space.total_bytes == 0 || space.used_bytes == 0)
-   continue;
-   flags = space.flags;
-
-   num_tolerated_disk_barrier_failures = min(
-   num_tolerated_disk_barrier_failures,
-   btrfs_get_num_tolerated_disk_barrier_failures(
-   flags));
-   }
-   up_read(>groups_sem);
-   }
-
-   return num_tolerated_disk_barrier_failures;
-}
-
  static int write_all_supers(struct btrfs_root *root, int max_mirrors)
  {
struct list_head *head;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index adeb318..6dc5fd3 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -142,8 +142,6 @@ struct btrfs_root *btrfs_create_tree(struct 
btrfs_trans_handle *trans,
  int btree_lock_page_hook(struct page *page, void *data,
void (*flush_fn)(void *));
  int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-   struct btrfs_fs_info *fs_info);
  int __init btrfs_end_io_wq_init(void);
  void btrfs_end_io_wq_exit(void);

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a5262bf..33ad42e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1782,9 +1782,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
free_fs_devices(cur_devices);
  

Re: BUG: failure at fs/btrfs/ctree.h:337/btrfs_chunk_item_size()!

2015-12-04 Thread Qu Wenruo



On 12/04/2015 09:12 PM, David Sterba wrote:

On Fri, Dec 04, 2015 at 09:21:59AM +0800, Qu Wenruo wrote:

We do have the alignment check in kernel, but it's in the early phase
where we don't know if nodesize is reliable and print only a warning.


This can be enhanced by the following method:


At minimum, we can promote the 4k alignment checks in
btrfs_check_super_valid from a warning to an error. The blocks must be
4k aligned, regardless of sectorsize or nodesize.


1) Check sectorsize first
 Only several sector size is valid for current btrfs:
 4K, 8K, 16K, 32K, 64K
 Just five numbers, quite easy to check.


The sectorsize must be PAGE_SIZE at the moment. This will change with
Chandan's patchset though.


PAGE_SIZE would be good enough.




 Or if anyone is going to extend supported sectorsize, we can change
 the check to if the number is power of 2 starting from 4K.

2) Check nodesize/leafsize then
 It should be aligned to sectorsize.


This particular check is missing but is implicit because of the
sectorsize == PAGE_SIZE restriction.


But still need to check nodesize/leafsize validation against sectorsize.
Current btrfs is already using large nodesize by default.

For example, 20K nodesize can pass 4K page size check but still wrong.

(And I'm also wrong in previous mail, it's not only aligned to 
sectorisze, but also need to be power of 2)





 And nodesize must match with leafsize.
 Currently, it's done out of check_super_valid(), we can integrate it.


Yeah it's done, then I don't see why we should add it agian.


Just want to move it to check_super_valid(), as it's better to put 
validation check codes together, and that's why we have check_super_valid().





3) Check all super root bytenr against *sectorsize*
 Yeah, not nodesize.
 As some old bad convert will cause metadata extent unaligned to
 nodesize(just before my convert rework patch), but only aligned to
 sectorsize.
 So only check alignment of sectorsize.


While the real check should be against the sectorsize, at the moment I
think it's covered by the 4k checks anyway. I understand why we can't
use the nodesize.



4K is good enough for x86 family but can't find all problem for 64K page 
size like PPC64 or AArch64.


So it's still better to change the check at least to page size even we 
don't have subpage size support yet.



So, if we do the warning -> error, we're fine for now. Some of the
checks you suggest would be good to merge when the subpage blocksize
patchset is merged.


Right, more accurate check is only needed after subpage patchset.

Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] Btrfs: disable online scrub repair on ro cases

2015-12-04 Thread Liu Bo
This disables repair process on ro cases as it can cause system
to be unresponsive on the ASSERT() in repair_io_failure().

This can happen when scrub is running and a hardware error pops up,
we should fallback to ro mounts gracefully instead of being unresponsive.

Reported-by: Codebird 
Signed-off-by: Liu Bo 
---
v2: Get @fs_info from a real pointer instead of a confusing-name u64 root.

 fs/btrfs/scrub.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2907a77..cb8a4e0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -682,11 +682,14 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 
root, void *fixup_ctx)
struct btrfs_root *local_root;
int srcu_index;
 
+   fs_info = fixup->root->fs_info;
+   if (fs_info->sb->s_flags & MS_RDONLY)
+   return -EROFS;
+
key.objectid = root;
key.type = BTRFS_ROOT_ITEM_KEY;
key.offset = (u64)-1;
 
-   fs_info = fixup->root->fs_info;
srcu_index = srcu_read_lock(_info->subvol_srcu);
 
local_root = btrfs_read_fs_root_no_name(fs_info, );
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Make btrfs-progs really compatible with any kernel version

2015-12-04 Thread Liu Bo
On Fri, Dec 04, 2015 at 11:57:55AM +0800, Qu Wenruo wrote:
> 
> 
> Liu Bo wrote on 2015/12/03 18:53 -0800:
> >On Fri, Dec 04, 2015 at 10:08:35AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Liu Bo wrote on 2015/12/03 17:44 -0800:
> >>>On Mon, Nov 23, 2015 at 06:56:09PM +0100, David Sterba wrote:
> On Mon, Nov 23, 2015 at 08:56:13PM +0800, Anand Jain wrote:
> >Btrfs-progs is a tool for the btrfs kernel and we hope latest btrfs-progs
> >be compatible w any set of older/newer kernels.
> >
> >So far mkfs.btrfs and btrfs-convert sets the default features, for eg,
> >skinny-metadata even if the running kernel does not supports it, and
> >so the mount fails on the running.
> 
> So the default behaviour of mkfs will try to best guess the feature set
> of currently running kernel. I think this is is the most common scenario
> and justifies the change in default behaviours.
> 
> For the other cases I'd like to introduce some human-readable shortcuts
> to the --features option. Eg. 'mkfs.btrfs -O compat-3.2' will pick all
> options supported by the unpatched mainline kernel of version 3.2. This
> would be present for all version, regardless if there was a change in the
> options or not.
> 
> Similarly for convenience, add 'running' that would pick the options
> from running kernel but will be explicit.
> 
> A remaining option should override the 'running' behaviour and pick the
> latest mkfs options. Naming it 'defaults' sounds a bit ambiguous so the
> name is yet to be determined.
> 
> >Here in this set of patches will make sure the progs understands the
> >kernel supported features.
> >
> >So in this patch, checks if sysfs tells whether the feature is
> >supported if not, then it will relay on static kernel version which
> >provided that feature (skinny-metadata here in this example), next
> >if for some reason the running kernel does not provide the kernel
> >version, then it will fall back to the original method to enable
> >the feature with a hope that kernel will support it.
> >
> >Also the last patch adds a warning when we fail to read either
> >sysfs features or the running kernel version.
> 
> Your patchset is a good start, the additional options I've described can
> be added on top of that. We might need to switch the version
> representation from string to KERNEL_VERSION but that's an
> implementation detail.
> >>>
> >>>Depending on sysfs is stable but depending on kernel version may be not,
> >>>we may have a distro kernel which backports some incompat features from
> >>>upstream, then we have to decide based on sysfs interface.
> >>
> >>+1.
> >>
> >>Although sysfs does not always show up even for supported kernel, e.g btrfs
> >>modules is not loaded after boot.
> >>So we need to consider twice before choosing a fallback method.
> >>
> >>>
> >>>However, this brings another problems, for very old kernels, they don't
> >>>have sysfs, do you have any suggestions for that?
> >>
> >>Other fs, like xfs/ext* doesn't even have sysfs feature interface, only
> >>release announcement mentioning default behavior change.
> >>And I don't see many users complaining about it.
> >>
> >>Here is the example of xfsprogs changed its default feature recently:
> >>In 10th, June, 2015, xfsprogs v3.2.3 is released, with new default feature
> >>of enabling CRC for fs.
> >>The first supported kernel is 3.15, which is release in 8th Jun, 2014.
> >>Almost one year ago.
> >
> >It's the same thing, if you use a earlier version(before v5) xfs and a
> >v5 xfsprogs, you are not going to mount it.
> >
> >>
> >>On the other hand, the sysfs feature is introduced at the end of year 2013.
> >>It's already over 2 years.
> >>
> >>So just forgot the extra minor case of super old kernel would be good
> >>enough.
> >
> >Sorry we're not able to do that since most users won't keep up upgrading 
> >their
> >kernels to the latest one, instead they use the stable one they think.
> >
> >The fact is that btrfs has way more incompatible features than either ext4 
> >or xfs,
> >and no complain on ext4/xfs from them won't solve our btrfs issue anyway.
> >
> >The problem is much more serious for enterprise users which are sort of
> >conservative, they would backport what they need, if they use
> >btrfs they will experience the painful things.
> 
> Only if enterprise really think btrfs is stable enough.
> For this point, xfs is considered more stable than btrfs, but v5 xfs recent
> change doesn't introduce such facility to do that compatibility check in
> xfsprogs.

Xfs on kernel side obviously refuses to mount if you create an incompatible
feature with a recent xfsprogs but try to mount it with older kernel.

STATIC int
xfs_mount_validate_sb()
{
...
if (xfs_sb_has_incompat_feature(sbp, XFS_SB_FEAT_INCOMPAT_UNKNOWN)) {   
  
xfs_warn(mp,