possible bug in balance

2015-01-02 Thread luvar
Hi,
I have today added one device and I have converted metadata to raid1. Than I 
wanted to convert to raid1 also some data (with balance filter) and try if 
there is some speedup when reading files (starting programs)... I have issued 
this command:

luvar@blackdawn:~$ sudo time btrfs balance start -dconvert=raid1 -dusage=20 
/home/luvar/programs/
[sudo] password for luvar: 
ERROR: error during balancing '/home/luvar/programs/' - Input/output error
There may be more info in syslog - try dmesg | tail
Command exited with non-zero status 19
0.00user 0.08system 0:08.29elapsed 1%CPU (0avgtext+0avgdata 768maxresident)k
14696inputs+6584outputs (2major+253minor)pagefaults 0swaps


Part of df command:
luvar@blackdawn:~$ df -h
Filesystem   Size  Used Avail Use% Mounted on
/dev/sdb2458G  177G   59G  75% /
/dev/sdb2458G  177G   59G  75% /home
/dev/sdb1226M   96M  114M  46% /boot
/dev/sdb2458G  177G   59G  75% 
/home/luvar/eclipseWorkspaceAndroid
/dev/sdb2458G  177G   59G  75% 
/home/luvar/eclipseWorkspaceErlang
/dev/sdb2458G  177G   59G  75% /home/luvar/programs

root@blackdawn:/home/luvar# dmesg|tail -n 50
[ 8107.693414] attempt to access beyond end of device
[ 8107.693425] sdb2: rw=32, want=480102272, limit=473956352
[ 8107.711854] attempt to access beyond end of device
[ 8107.711863] sdb2: rw=1041, want=480102272, limit=473956352
[ 8107.771103] attempt to access beyond end of device
[ 8107.771114] sdb2: rw=32, want=482410504, limit=473956352
[ 8107.784037] attempt to access beyond end of device
[ 8107.784045] sdb2: rw=1041, want=482410504, limit=473956352
[ 8107.804923] attempt to access beyond end of device
[ 8107.804933] sdb2: rw=32, want=478657496, limit=473956352
[ 8107.817134] attempt to access beyond end of device
[ 8107.817142] sdb2: rw=1041, want=478657496, limit=473956352
[ 8107.835377] attempt to access beyond end of device
[ 8107.835384] sdb2: rw=32, want=480795752, limit=473956352
[ 8107.842977] attempt to access beyond end of device
[ 8107.842985] sdb2: rw=1041, want=480795752, limit=473956352
[ 8107.887768] attempt to access beyond end of device
[ 8107.887778] sdb2: rw=32, want=478931480, limit=473956352
[ 8107.898939] attempt to access beyond end of device
[ 8107.898946] sdb2: rw=1041, want=478931480, limit=473956352
[ 8107.958691] attempt to access beyond end of device
[ 8107.958699] sdb2: rw=32, want=479426840, limit=473956352
[ 8107.966368] attempt to access beyond end of device
[ 8107.966375] sdb2: rw=1041, want=479426840, limit=473956352
[ 8116.097908] attempt to access beyond end of device
[ 8116.097919] sdb2: rw=32, want=478334096, limit=473956352
[ 8116.097923] btrfs_dev_stat_print_on_error: 12 callbacks suppressed
[ 8116.097926] btrfs: bdev /dev/sdb2 errs: wr 638625, rd 65863, flush 0, 
corrupt 0, gen 0
[ 8116.133108] attempt to access beyond end of device
[ 8116.133118] sdb2: rw=1041, want=478334096, limit=473956352
[ 8116.133124] btrfs: bdev /dev/sdb2 errs: wr 638626, rd 65863, flush 0, 
corrupt 0, gen 0
[ 8125.065061] attempt to access beyond end of device
[ 8125.065073] sdb2: rw=32, want=481418928, limit=473956352
[ 8125.065077] btrfs: bdev /dev/sdb2 errs: wr 638626, rd 65864, flush 0, 
corrupt 0, gen 0
[ 8125.084522] attempt to access beyond end of device
[ 8125.084533] sdb2: rw=1041, want=481418928, limit=473956352
[ 8125.084539] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65864, flush 0, 
corrupt 0, gen 0
[ 8131.848768] btrfs: relocating block group 472710643712 flags 1
[ 8133.866427] attempt to access beyond end of device
[ 8133.866436] sdb2: rw=0, want=476739152, limit=473956352
[ 8133.866441] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65865, flush 0, 
corrupt 0, gen 0
[ 8133.866516] attempt to access beyond end of device
[ 8133.866520] sdb2: rw=0, want=476739152, limit=473956352
[ 8133.866523] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65866, flush 0, 
corrupt 0, gen 0
[ 8159.272179] attempt to access beyond end of device
[ 8159.272191] sdb2: rw=32, want=480110048, limit=473956352
[ 8159.272196] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65867, flush 0, 
corrupt 0, gen 0
[ 8159.300427] attempt to access beyond end of device
[ 8159.300434] sdb2: rw=1041, want=480110048, limit=473956352
[ 8159.300440] btrfs: bdev /dev/sdb2 errs: wr 638628, rd 65867, flush 0, 
corrupt 0, gen 0

root@blackdawn:/home/luvar# uname -a
Linux blackdawn 3.13.0-30-generic #55-Ubuntu SMP Fri Jul 4 21:40:53 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux

root@blackdawn:/home/luvar# btrfs v
Btrfs v0.20-rc1-189-g704a08c


Am I doing something forbidden (I have not see any structure where raid type is 
stored per file/subvolume item), or I just hit some problem? What should I try?

PS: After all I will convert all data to raid1, but I want to play first :-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More 

Re: I need to P. are we almost there yet?

2015-01-02 Thread Austin S Hemmelgarn

On 2014-12-31 12:27, ashf...@whisperpc.com wrote:

Phillip


I had a similar question a year or two ago (
specifically about raid10  ) so I both experimented and read the code
myself to find out.  I was disappointed to find that it won't do
raid10 on 3 disks since the chunk metadata describes raid10 as a
stripe layered on top of a mirror.

Jose's point was also a good one though; one chunk may decide to
mirror disks A and B, so a failure of A and C it could recover from,
but a different chunk could choose to mirror on disks A and C, so that
chunk would be lost if A and C fail.  It would probably be nice if the
chunk allocator tried to be more deterministic about that.


I see this as a CRITICAL design flaw.  The reason for calling it CRITICAL
is that System Administrators have been trained for 20 years that RAID-10
can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.
No, some rather simple math will tell you that a 4 disk BTRFS filesystem 
in raid10 mode has exactly a 50% chance of surviving a dual disk 
failure, and that as the number of disks goes up, the chance of survival 
will asymptotically approach 100% (but never reach it).
This is the case for _every_ RAID-10 implementation that I have ever 
seen, including hardware raid controllers; the only real difference is 
in the stripe length (usually 512 bytes * half the number of disks for 
hardware raid, 4k * half the number of disks for software raid, and the 
filesystem block size (default is 16k in current versions) * half the 
number of disks for BTRFS).





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] xfstests: btrfs: fix up 001.out

2015-01-02 Thread Filipe David Manana
On Wed, Dec 31, 2014 at 7:48 PM, Anand Jain anand.j...@oracle.com wrote:
 The subvol delete output has changed with btrfs-progs
 -Delete subvolume 'SCRATCH_MNT/snap'
 +Delete subvolume (no-commit): 'SCRATCH_MNT/snap'

 make the matching changes in the xfstests btrfs 001.out

Hi Anand,

This is a wrong approach to fix it.
With this change it means the test will now fail with a btrfs-progs
release older than v3.18...

The test should just ignore the output and check if the snapshot
creation command succeeds.
See how more recent tests do it - they are calling
_run_btrfs_util_prog (which calls run_check).

thanks


 Signed-off-by: Anand Jain anand.j...@oracle.com
 ---
  tests/btrfs/001.out | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/tests/btrfs/001.out b/tests/btrfs/001.out
 index c782bde..8dc6eac 100644
 --- a/tests/btrfs/001.out
 +++ b/tests/btrfs/001.out
 @@ -33,7 +33,7 @@ subvol
  Listing subvolumes
  snap
  subvol
 -Delete subvolume 'SCRATCH_MNT/snap'
 +Delete subvolume (no-commit): 'SCRATCH_MNT/snap'
  List root dir
  subvol
  List root dir
 --
 2.0.0.153.g79d

 --
 To unsubscribe from this list: send the line unsubscribe fstests in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] E2fsprogs: add compress and cow support in chattr, lsattr

2015-01-02 Thread Lutz Vieweg

On 04/18/2011 09:37 AM, liubo wrote:

Modify command 'chattr' and 'lsattr' to support compress and cow.
- use 'C' to indicate NOCOW attribute.


It's kind of confusing for new users that when one sets
 chattr +C someexistingfile
on btrfs, a subsequent
 lsattr someexistingfile
will show the C flag as not set. It takes some
reading to realize that btrfs cannot change the non-COW
flag on files bigger than 0 bytes.

Maybe chattr +C could print a warning if a file
to change attributes for is  0 bytes long?

Regards,

Lutz Vieweg



Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
  lib/e2p/pf.c |1 +
  lib/ext2fs/ext2_fs.h |1 +
  misc/chattr.1.in |   15 +++
  misc/chattr.c|   15 ++-
  4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c
index cc50896..c9385dd 100644
--- a/lib/e2p/pf.c
+++ b/lib/e2p/pf.c
@@ -48,6 +48,7 @@ static struct flags_name flags_array[] = {
{ FS_TOPDIR_FL, T, Top_of_Directory_Hierarchies },
{ EXT4_EXTENTS_FL, e, Extents },
{ EXT4_HUGE_FILE_FL, h, Huge_file },
+   { FS_NOCOW_FL, C, NOCOW },
{ 0, NULL, NULL }
  };

diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index 858c103..776be92 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -276,6 +276,7 @@ struct ext2_dx_countlimit {
  #define EXT4_EXTENTS_FL   0x0008 /* Inode uses extents */
  #define EXT4_EA_INODE_FL  0x0020 /* Inode used for large EA */
  #define EXT4_EOFBLOCKS_FL 0x0040 /* Blocks allocated beyond 
EOF */
+#define FS_NOCOW_FL0x0080 /* Do not cow file */
  #define EXT4_SNAPFILE_FL  0x0100  /* Inode is a snapshot */
  #define EXT4_SNAPFILE_DELETED_FL  0x0400  /* Snapshot is being 
deleted */
  #define EXT4_SNAPFILE_SHRUNK_FL   0x0800  /* Snapshot shrink 
has completed */
diff --git a/misc/chattr.1.in b/misc/chattr.1.in
index 92f6d70..434eb04 100644
--- a/misc/chattr.1.in
+++ b/misc/chattr.1.in
@@ -19,17 +19,18 @@ chattr \- change file attributes on a Linux file system
  .B chattr
  changes the file attributes on a Linux file system.
  .PP
-The format of a symbolic mode is +-=[acdeijstuADST].
+The format of a symbolic mode is +-=[acdeijstuACDST].
  .PP
  The operator `+' causes the selected attributes to be added to the
  existing attributes of the files; `-' causes them to be removed; and
  `=' causes them to be the only attributes that the files have.
  .PP
-The letters `acdeijstuADST' select the new attributes for the files:
+The letters `acdeijstuACDST' select the new attributes for the files:
  append only (a), compressed (c), no dump (d), extent format (e), immutable 
(i),
  data journalling (j), secure deletion (s), no tail-merging (t),
-undeletable (u), no atime updates (A), synchronous directory updates (D),
-synchronous updates (S), and top of directory hierarchy (T).
+undeletable (u), no atime updates (A), no copy on write (C),
+synchronous directory updates (D), synchronous updates (S),
+and top of directory hierarchy (T).
  .PP
  The following attributes are read-only, and may be listed by
  .BR lsattr (1)
@@ -64,6 +65,10 @@ this file compresses data before storing them on the disk.  
Note: please
  make sure to read the bugs and limitations section at the end of this
  document.
  .PP
+A file with the `C' attribute set is marked without COW (copy on write).  Note:
+please make sure to read the bugs and limitations section at the end of this
+document.
+.PP
  When a directory with the `D' attribute set is modified,
  the changes are written synchronously on the disk; this is equivalent to
  the `dirsync' mount option applied to a subset of the files.
@@ -161,6 +166,8 @@ The `c', 's',  and `u' attributes are not honored
  by the ext2 and ext3 filesystems as implemented in the current mainline
  Linux kernels.These attributes may be implemented
  in future versions of the ext2 and ext3 filesystems.
+The `C' attribute is only used in btrfs filesystem in the current mainline
+Linux kernels.
  .PP
  The `j' option is only useful if the filesystem is mounted as ext3.
  .PP
diff --git a/misc/chattr.c b/misc/chattr.c
index 78e3736..8c8231e 100644
--- a/misc/chattr.c
+++ b/misc/chattr.c
@@ -82,7 +82,7 @@ static unsigned long sf;
  static void usage(void)
  {
fprintf(stderr,
-   _(Usage: %s [-RVf] [-+=AacDdeijsSu] [-v version] files...\n),
+   _(Usage: %s [-RVf] [-+=AacDdeijsSuC] [-v version] files...\n),
program_name);
exit(1);
  }
@@ -106,6 +106,7 @@ static const struct flags_char flags_array[] = {
{ FS_UNRM_FL, 'u' },
{ FS_NOTAIL_FL, 't' },
{ FS_TOPDIR_FL, 'T' },
+   { FS_NOCOW_FL, 'C' },
{ 0, 0 }
  };

@@ -159,6 +160,12 @@ static int decode_arg (int * i, int argc, char ** argv)
}
if ((fl = get_flag(*p)) == 0)
 

Re: I need to P. are we almost there yet?

2015-01-02 Thread Austin S Hemmelgarn

On 2015-01-02 12:45, Brendan Hide wrote:

On 2015/01/02 15:42, Austin S Hemmelgarn wrote:

On 2014-12-31 12:27, ashf...@whisperpc.com wrote:

I see this as a CRITICAL design flaw.  The reason for calling it
CRITICAL
is that System Administrators have been trained for 20 years that
RAID-10
can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.

No, some rather simple math

That's the problem. The math isn't as simple as you'd expect:

The example below is probably a pathological case - but here goes. Let's
say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where
d1 is the first bit of data and d2 is the second:
Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10

Lose any two disks and you have a 50% chance on *each* chunk to have
lost that chunk. With traditional RAID10 you have a 50% chance of losing
the array entirely. With btrfs, the more data you have stored, the
chances get closer to 100% of losing *some* data in a 2-disk failure.

In the above example, losing A and B means you lose d3, d6, and d7
(which ends up being 60% of all chunks).
Losing A and C means you lose d1 (20% of all chunks).OK
Losing A and D means you lose d9 (20% of all chunks).
Losing B and C means you lose d10 (20% of all chunks).
Losing B and D means you lose d2 (20% of all chunks).
Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

The above skewed example has an average of 40% of all chunks failed. As
you add more data and randomise the allocation, this will approach 50% -
BUT, the chances of losing *some* data is already clearly shown to be
very close to 100%.

OK, I forgot about the randomization effect that the chunk allocation 
and freeing has.  We really should slap a *BIG* warning label on that 
(and ideally find some better way to do it so it's more reliable).


As an aside, I've found that a BTRFS raid1 set on top of 2 LVM/MD RAID0 
sets is actually faster than using a BTRFS raid10 set with the same 
number of disks (how much faster is workload dependent), and provides 
better guarantees than a BTRFS raid10 set.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: I need to P. are we almost there yet?

2015-01-02 Thread Brendan Hide

On 2015/01/02 15:42, Austin S Hemmelgarn wrote:

On 2014-12-31 12:27, ashf...@whisperpc.com wrote:
I see this as a CRITICAL design flaw.  The reason for calling it 
CRITICAL
is that System Administrators have been trained for 20 years that 
RAID-10

can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.

No, some rather simple math

That's the problem. The math isn't as simple as you'd expect:

The example below is probably a pathological case - but here goes. Let's 
say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where 
d1 is the first bit of data and d2 is the second:

Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10

Lose any two disks and you have a 50% chance on *each* chunk to have 
lost that chunk. With traditional RAID10 you have a 50% chance of losing 
the array entirely. With btrfs, the more data you have stored, the 
chances get closer to 100% of losing *some* data in a 2-disk failure.


In the above example, losing A and B means you lose d3, d6, and d7 
(which ends up being 60% of all chunks).

Losing A and C means you lose d1 (20% of all chunks).
Losing A and D means you lose d9 (20% of all chunks).
Losing B and C means you lose d10 (20% of all chunks).
Losing B and D means you lose d2 (20% of all chunks).
Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

The above skewed example has an average of 40% of all chunks failed. As 
you add more data and randomise the allocation, this will approach 50% - 
BUT, the chances of losing *some* data is already clearly shown to be 
very close to 100%.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs-progs: Documentation: add T/P/E description for resize cmd

2015-01-02 Thread David Sterba
On Thu, Jan 01, 2015 at 08:27:55PM -0700, Chris Murphy wrote:
 Small problem with the rendering of this commit
 d4ef1a06f8be623ae94e4d498c306e8dd1605bef, when I use 'man btrfs
 filesystem' the above portion looks like this:
 
  'K', 'M', 'G', 'T', 'P', or 'E\',
 
 I'm not sure why there's a trailing slash after the E.

Me neither, but it looks like a bug in the asciidoc processing, ends up
in the intermediate xml output. I'll probably drop/change the quoting.

 Separately, for -t option, it reads:
 For start, len, size it is possible to append a suffix like k
 for 1 KBytes, m for 1 MBytes...
 
 So there's a reference of small k and m there, but then later
 references for capitalized KMGTPE, so maybe the reference could be
 more like e.g. LVM where it's described as [bBsSkKmMgGtTpPeE] and just
 omit the sS for sectors since this isn't supported.

Yep, this should be unified.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fixing quota error when removing files from a limit exceeded subvols

2015-01-02 Thread Dongsheng Yang
Hi Khaled,

Could you give use more description about the problem this patch
is trying to solve? Maybe an example will help a lot to understand it.

Thanx

On Fri, Jan 2, 2015 at 7:48 AM, Khaled Ahmed khaled@gmail.com wrote:
 Signed-off-by: Khaled Ahmed khaled@gmail.com
 ---
  fs/btrfs/qgroup.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
 index 48b60db..b85200d 100644
 --- a/fs/btrfs/qgroup.c
 +++ b/fs/btrfs/qgroup.c
 @@ -2408,14 +2408,14 @@ int btrfs_qgroup_reserve(struct btrfs_root *root, u64 
 num_bytes)

 if ((qg-lim_flags  BTRFS_QGROUP_LIMIT_MAX_RFER) 
 qg-reserved + (s64)qg-rfer + num_bytes 
 -   qg-max_rfer) {
 +   qg-max_rfer - 1 ) {
 ret = -EDQUOT;
 goto out;
 }

 if ((qg-lim_flags  BTRFS_QGROUP_LIMIT_MAX_EXCL) 
 qg-reserved + (s64)qg-excl + num_bytes 
 -   qg-max_excl) {
 +   qg-max_excl - 1) {
 ret = -EDQUOT;
 goto out;
 }
 --
 2.1.0

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


scrub wedged (both running and not running at the same time)

2015-01-02 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I can't start a scrub because it is running, and can't cancel it
because it isn't running!  How do I get out of this state?  OS is
Ubuntu 14.10.

$ uname -r
3.16.0-28-generic

# btrfs scrub start .
ERROR: scrub is already running.
To cancel use 'btrfs scrub cancel .'.
To see the status use 'btrfs scrub status [-d] .'.
# btrfs scrub cancel .
ERROR: scrub cancel failed on .: not running
# btrfs scrub status .
scrub status for b02cc605-dd78-40bc-98a5-8f5543d83b66
scrub started at Mon Nov 17 20:27:17 2014, running for 64491 seconds
total bytes scrubbed: 3.43GiB with 1 errors
error details: read=1
corrected errors: 1, uncorrectable errors: 0, unverified errors: 0

Even a reboot doesn't make this go away.

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iEYEARECAAYFAlSnL88ACgkQmOOfHg372QTpogCgvOpEAjIQI5dq+QPtRPty1gB/
3q0An0llPrQkIeDprwiH4pRBzuZdWdRg
=NXGR
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs-progs: Documentation: add T/P/E description for resize cmd

2015-01-02 Thread David Sterba
On Fri, Jan 02, 2015 at 05:12:04PM +0100, David Sterba wrote:
 On Thu, Jan 01, 2015 at 08:27:55PM -0700, Chris Murphy wrote:
  Small problem with the rendering of this commit
  d4ef1a06f8be623ae94e4d498c306e8dd1605bef, when I use 'man btrfs
  filesystem' the above portion looks like this:
  
   'K', 'M', 'G', 'T', 'P', or 'E\',
  
  I'm not sure why there's a trailing slash after the E.
 
 Me neither, but it looks like a bug in the asciidoc processing.

Seems that only the first ' has to be quoted, and consumes the next
unquoted ' as a pair, so with the last \' the next one is missing and
is printed verbatim:

Fixed by:

-units designators: \'K\', \'M\', \'G\', \'T\', \'P\', or \'E\', which represent
+units designators: \'K', \'M', \'G', \'T', \'P', or \'E', which represent

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: scrub wedged (both running and not running at the same time)

2015-01-02 Thread David Sterba
On Fri, Jan 02, 2015 at 03:54:55PM -0800, Roger Binns wrote:
 I can't start a scrub because it is running, and can't cancel it
 because it isn't running!  How do I get out of this state?  OS is
 Ubuntu 14.10.

This has been fixed in btrfs-progs 3.16.2 by commit
d5fd05a773e2b19455be7e1208e9003a607483c6
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Uncorrectable errors on RAID-1?

2015-01-02 Thread Chris Murphy
On Tue, Dec 30, 2014 at 8:16 PM, Phillip Susi ps...@ubuntu.com wrote:
 Just because I want a raid doesn't mean I need it to operate reliably
 24x7.  For that matter, it has long been established that power
 cycling drives puts more wear and tear on them and as a general rule,
 leaving them on 24x7 results in them lasting longer.

It's not a made to order hard drive industry. Maybe one day you'll be
able to 3D print your own with its own specs.


 And of course you completely ignored, and deleted, my point about
 the difference in warranties.

 Because I don't care?

Sticking fingers in your ears doesn't change the fact there's a
measurable difference in support requirements.


 It's nice and all that they warranty the more
 expensive drive more, and it may possibly even mean that they are
 actually more reliable ( but not likely ), but that doesn't mean that
 the system should have an unnecessarily terrible response to the
 behavior of the cheaper drives.  Is it worth recommending the more
 expensive drives?  Sure... but the system should also handle the
 cheaper drives with grace.

This is architecture astronaut territory.

The system only has a terrible response for two reasons: 1. The user
spec'd the wrong hardware for the use case; 2. The distro isn't
automatically leveraging existing ways to mitigate that user mistake
by changing either SCT ERC on the drives, or the SCSI command timer
for each block device.

Now, even though that solution *might* mean long recoveries on
occasion, it's still better than link reset behavior which is what we
have today because it causes the underlying problem to be fixed by
md/dm/Btrfs once the read error is reported. But no distro has
implemented this $500 man hour solution. Instead you're suggesting a
$500,000 fix that will take hundreds of man hours and end user testing
to find all the edge cases. It's like, seriously, WTF?


 Does the SATA specification require configurable SCT ERC? Does it
 require even supporting SCT ERC? I think your argument is flawed
 by mis-distributing the economic burden while simultaneously
 denying one even exists or that these companies should just eat the
 cost differential if it does. In any case the argument is asinine.

 There didn't used to be any such thing; drives simply did not *ever*
 go into absurdly long internal retries so there was no need.  The fact
 that they do these days I consider a misfeature, and one that *can* be
 worked around in software, which is the point here.

Ok well I think that's hubris unless you're a hard drive engineer.
You're referring to how drives behaved over a decade ago, when bad
sectors were persistent rather than remapped, and we had to scan the
drive at format time to build a map so the bad ones wouldn't be used
by the filesystem.


 When the encoded data signal weakens, they effectively becomes
 fuzzy bits. Each read produces different results. Obviously this is
 a very rare condition or there'd be widespread panic. However, it's
 common and expected enough that the drive manufacturers are all, to
 very little varying degree, dealing with this problem in a similar
 way, which is multiple reads.

 Sure, but the noise introduced by the read ( as opposed to the noise
 in the actual signal on the platter ) isn't that large, and so
 retrying 10,000 times isn't going to give any better results than
 retrying say, 100 times, and if the user really desires that many
 retries, they have always been able to do so in the software level
 rather than depending on the drive to try that much.  There is no
 reason for the drives to have increased their internal retries that
 much, and then deliberately withed the essentially zero cost ability
 to limit those internal retries, other than to drive customers to pay
 for the more expensive models.

http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf

That's a high end SAS drive. It's default is to retry up to 20 times,
which takes ~1.4 seconds, per sector. But also note how it says
lowering the default increases the unrecoverable error rate. That
makes sense. So even if the probability is low that retrying up to 120
seconds could work, statistically it affects the unrecoverable error
rate positively to increase the default.

If I'm going to be a conspiracy theorist, I'd say the recoveries are
getting longer by default in order to keep the specifications
reporting sane unrecoverable error rates.

Maybe you'd prefer seeing these big, cheap, green drives have
shorter ERC times, with a commensurate reality check with their
unrecoverable error rate, which right now is already two orders
magnitude higher than enterprise SAS drives. So what if this means
that rate is 3 or 4 orders magnitude higher?

Now I'm just going to wait for you to suggest that sucks donkey tail
and how the manufacturer's should produce drives with the same UER as
drives 10 years ago *and* with the same error