possible bug in balance
Hi, I have today added one device and I have converted metadata to raid1. Than I wanted to convert to raid1 also some data (with balance filter) and try if there is some speedup when reading files (starting programs)... I have issued this command: luvar@blackdawn:~$ sudo time btrfs balance start -dconvert=raid1 -dusage=20 /home/luvar/programs/ [sudo] password for luvar: ERROR: error during balancing '/home/luvar/programs/' - Input/output error There may be more info in syslog - try dmesg | tail Command exited with non-zero status 19 0.00user 0.08system 0:08.29elapsed 1%CPU (0avgtext+0avgdata 768maxresident)k 14696inputs+6584outputs (2major+253minor)pagefaults 0swaps Part of df command: luvar@blackdawn:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sdb2458G 177G 59G 75% / /dev/sdb2458G 177G 59G 75% /home /dev/sdb1226M 96M 114M 46% /boot /dev/sdb2458G 177G 59G 75% /home/luvar/eclipseWorkspaceAndroid /dev/sdb2458G 177G 59G 75% /home/luvar/eclipseWorkspaceErlang /dev/sdb2458G 177G 59G 75% /home/luvar/programs root@blackdawn:/home/luvar# dmesg|tail -n 50 [ 8107.693414] attempt to access beyond end of device [ 8107.693425] sdb2: rw=32, want=480102272, limit=473956352 [ 8107.711854] attempt to access beyond end of device [ 8107.711863] sdb2: rw=1041, want=480102272, limit=473956352 [ 8107.771103] attempt to access beyond end of device [ 8107.771114] sdb2: rw=32, want=482410504, limit=473956352 [ 8107.784037] attempt to access beyond end of device [ 8107.784045] sdb2: rw=1041, want=482410504, limit=473956352 [ 8107.804923] attempt to access beyond end of device [ 8107.804933] sdb2: rw=32, want=478657496, limit=473956352 [ 8107.817134] attempt to access beyond end of device [ 8107.817142] sdb2: rw=1041, want=478657496, limit=473956352 [ 8107.835377] attempt to access beyond end of device [ 8107.835384] sdb2: rw=32, want=480795752, limit=473956352 [ 8107.842977] attempt to access beyond end of device [ 8107.842985] sdb2: rw=1041, want=480795752, limit=473956352 [ 8107.887768] attempt to access beyond end of device [ 8107.887778] sdb2: rw=32, want=478931480, limit=473956352 [ 8107.898939] attempt to access beyond end of device [ 8107.898946] sdb2: rw=1041, want=478931480, limit=473956352 [ 8107.958691] attempt to access beyond end of device [ 8107.958699] sdb2: rw=32, want=479426840, limit=473956352 [ 8107.966368] attempt to access beyond end of device [ 8107.966375] sdb2: rw=1041, want=479426840, limit=473956352 [ 8116.097908] attempt to access beyond end of device [ 8116.097919] sdb2: rw=32, want=478334096, limit=473956352 [ 8116.097923] btrfs_dev_stat_print_on_error: 12 callbacks suppressed [ 8116.097926] btrfs: bdev /dev/sdb2 errs: wr 638625, rd 65863, flush 0, corrupt 0, gen 0 [ 8116.133108] attempt to access beyond end of device [ 8116.133118] sdb2: rw=1041, want=478334096, limit=473956352 [ 8116.133124] btrfs: bdev /dev/sdb2 errs: wr 638626, rd 65863, flush 0, corrupt 0, gen 0 [ 8125.065061] attempt to access beyond end of device [ 8125.065073] sdb2: rw=32, want=481418928, limit=473956352 [ 8125.065077] btrfs: bdev /dev/sdb2 errs: wr 638626, rd 65864, flush 0, corrupt 0, gen 0 [ 8125.084522] attempt to access beyond end of device [ 8125.084533] sdb2: rw=1041, want=481418928, limit=473956352 [ 8125.084539] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65864, flush 0, corrupt 0, gen 0 [ 8131.848768] btrfs: relocating block group 472710643712 flags 1 [ 8133.866427] attempt to access beyond end of device [ 8133.866436] sdb2: rw=0, want=476739152, limit=473956352 [ 8133.866441] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65865, flush 0, corrupt 0, gen 0 [ 8133.866516] attempt to access beyond end of device [ 8133.866520] sdb2: rw=0, want=476739152, limit=473956352 [ 8133.866523] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65866, flush 0, corrupt 0, gen 0 [ 8159.272179] attempt to access beyond end of device [ 8159.272191] sdb2: rw=32, want=480110048, limit=473956352 [ 8159.272196] btrfs: bdev /dev/sdb2 errs: wr 638627, rd 65867, flush 0, corrupt 0, gen 0 [ 8159.300427] attempt to access beyond end of device [ 8159.300434] sdb2: rw=1041, want=480110048, limit=473956352 [ 8159.300440] btrfs: bdev /dev/sdb2 errs: wr 638628, rd 65867, flush 0, corrupt 0, gen 0 root@blackdawn:/home/luvar# uname -a Linux blackdawn 3.13.0-30-generic #55-Ubuntu SMP Fri Jul 4 21:40:53 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux root@blackdawn:/home/luvar# btrfs v Btrfs v0.20-rc1-189-g704a08c Am I doing something forbidden (I have not see any structure where raid type is stored per file/subvolume item), or I just hit some problem? What should I try? PS: After all I will convert all data to raid1, but I want to play first :-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More
Re: I need to P. are we almost there yet?
On 2014-12-31 12:27, ashf...@whisperpc.com wrote: Phillip I had a similar question a year or two ago ( specifically about raid10 ) so I both experimented and read the code myself to find out. I was disappointed to find that it won't do raid10 on 3 disks since the chunk metadata describes raid10 as a stripe layered on top of a mirror. Jose's point was also a good one though; one chunk may decide to mirror disks A and B, so a failure of A and C it could recover from, but a different chunk could choose to mirror on disks A and C, so that chunk would be lost if A and C fail. It would probably be nice if the chunk allocator tried to be more deterministic about that. I see this as a CRITICAL design flaw. The reason for calling it CRITICAL is that System Administrators have been trained for 20 years that RAID-10 can usually handle a dual-disk failure, but the BTRFS implementation has effectively ZERO chance of doing so. No, some rather simple math will tell you that a 4 disk BTRFS filesystem in raid10 mode has exactly a 50% chance of surviving a dual disk failure, and that as the number of disks goes up, the chance of survival will asymptotically approach 100% (but never reach it). This is the case for _every_ RAID-10 implementation that I have ever seen, including hardware raid controllers; the only real difference is in the stripe length (usually 512 bytes * half the number of disks for hardware raid, 4k * half the number of disks for software raid, and the filesystem block size (default is 16k in current versions) * half the number of disks for BTRFS). smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH] xfstests: btrfs: fix up 001.out
On Wed, Dec 31, 2014 at 7:48 PM, Anand Jain anand.j...@oracle.com wrote: The subvol delete output has changed with btrfs-progs -Delete subvolume 'SCRATCH_MNT/snap' +Delete subvolume (no-commit): 'SCRATCH_MNT/snap' make the matching changes in the xfstests btrfs 001.out Hi Anand, This is a wrong approach to fix it. With this change it means the test will now fail with a btrfs-progs release older than v3.18... The test should just ignore the output and check if the snapshot creation command succeeds. See how more recent tests do it - they are calling _run_btrfs_util_prog (which calls run_check). thanks Signed-off-by: Anand Jain anand.j...@oracle.com --- tests/btrfs/001.out | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/btrfs/001.out b/tests/btrfs/001.out index c782bde..8dc6eac 100644 --- a/tests/btrfs/001.out +++ b/tests/btrfs/001.out @@ -33,7 +33,7 @@ subvol Listing subvolumes snap subvol -Delete subvolume 'SCRATCH_MNT/snap' +Delete subvolume (no-commit): 'SCRATCH_MNT/snap' List root dir subvol List root dir -- 2.0.0.153.g79d -- To unsubscribe from this list: send the line unsubscribe fstests in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] E2fsprogs: add compress and cow support in chattr, lsattr
On 04/18/2011 09:37 AM, liubo wrote: Modify command 'chattr' and 'lsattr' to support compress and cow. - use 'C' to indicate NOCOW attribute. It's kind of confusing for new users that when one sets chattr +C someexistingfile on btrfs, a subsequent lsattr someexistingfile will show the C flag as not set. It takes some reading to realize that btrfs cannot change the non-COW flag on files bigger than 0 bytes. Maybe chattr +C could print a warning if a file to change attributes for is 0 bytes long? Regards, Lutz Vieweg Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- lib/e2p/pf.c |1 + lib/ext2fs/ext2_fs.h |1 + misc/chattr.1.in | 15 +++ misc/chattr.c| 15 ++- 4 files changed, 27 insertions(+), 5 deletions(-) diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c index cc50896..c9385dd 100644 --- a/lib/e2p/pf.c +++ b/lib/e2p/pf.c @@ -48,6 +48,7 @@ static struct flags_name flags_array[] = { { FS_TOPDIR_FL, T, Top_of_Directory_Hierarchies }, { EXT4_EXTENTS_FL, e, Extents }, { EXT4_HUGE_FILE_FL, h, Huge_file }, + { FS_NOCOW_FL, C, NOCOW }, { 0, NULL, NULL } }; diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h index 858c103..776be92 100644 --- a/lib/ext2fs/ext2_fs.h +++ b/lib/ext2fs/ext2_fs.h @@ -276,6 +276,7 @@ struct ext2_dx_countlimit { #define EXT4_EXTENTS_FL 0x0008 /* Inode uses extents */ #define EXT4_EA_INODE_FL 0x0020 /* Inode used for large EA */ #define EXT4_EOFBLOCKS_FL 0x0040 /* Blocks allocated beyond EOF */ +#define FS_NOCOW_FL0x0080 /* Do not cow file */ #define EXT4_SNAPFILE_FL 0x0100 /* Inode is a snapshot */ #define EXT4_SNAPFILE_DELETED_FL 0x0400 /* Snapshot is being deleted */ #define EXT4_SNAPFILE_SHRUNK_FL 0x0800 /* Snapshot shrink has completed */ diff --git a/misc/chattr.1.in b/misc/chattr.1.in index 92f6d70..434eb04 100644 --- a/misc/chattr.1.in +++ b/misc/chattr.1.in @@ -19,17 +19,18 @@ chattr \- change file attributes on a Linux file system .B chattr changes the file attributes on a Linux file system. .PP -The format of a symbolic mode is +-=[acdeijstuADST]. +The format of a symbolic mode is +-=[acdeijstuACDST]. .PP The operator `+' causes the selected attributes to be added to the existing attributes of the files; `-' causes them to be removed; and `=' causes them to be the only attributes that the files have. .PP -The letters `acdeijstuADST' select the new attributes for the files: +The letters `acdeijstuACDST' select the new attributes for the files: append only (a), compressed (c), no dump (d), extent format (e), immutable (i), data journalling (j), secure deletion (s), no tail-merging (t), -undeletable (u), no atime updates (A), synchronous directory updates (D), -synchronous updates (S), and top of directory hierarchy (T). +undeletable (u), no atime updates (A), no copy on write (C), +synchronous directory updates (D), synchronous updates (S), +and top of directory hierarchy (T). .PP The following attributes are read-only, and may be listed by .BR lsattr (1) @@ -64,6 +65,10 @@ this file compresses data before storing them on the disk. Note: please make sure to read the bugs and limitations section at the end of this document. .PP +A file with the `C' attribute set is marked without COW (copy on write). Note: +please make sure to read the bugs and limitations section at the end of this +document. +.PP When a directory with the `D' attribute set is modified, the changes are written synchronously on the disk; this is equivalent to the `dirsync' mount option applied to a subset of the files. @@ -161,6 +166,8 @@ The `c', 's', and `u' attributes are not honored by the ext2 and ext3 filesystems as implemented in the current mainline Linux kernels.These attributes may be implemented in future versions of the ext2 and ext3 filesystems. +The `C' attribute is only used in btrfs filesystem in the current mainline +Linux kernels. .PP The `j' option is only useful if the filesystem is mounted as ext3. .PP diff --git a/misc/chattr.c b/misc/chattr.c index 78e3736..8c8231e 100644 --- a/misc/chattr.c +++ b/misc/chattr.c @@ -82,7 +82,7 @@ static unsigned long sf; static void usage(void) { fprintf(stderr, - _(Usage: %s [-RVf] [-+=AacDdeijsSu] [-v version] files...\n), + _(Usage: %s [-RVf] [-+=AacDdeijsSuC] [-v version] files...\n), program_name); exit(1); } @@ -106,6 +106,7 @@ static const struct flags_char flags_array[] = { { FS_UNRM_FL, 'u' }, { FS_NOTAIL_FL, 't' }, { FS_TOPDIR_FL, 'T' }, + { FS_NOCOW_FL, 'C' }, { 0, 0 } }; @@ -159,6 +160,12 @@ static int decode_arg (int * i, int argc, char ** argv) } if ((fl = get_flag(*p)) == 0)
Re: I need to P. are we almost there yet?
On 2015-01-02 12:45, Brendan Hide wrote: On 2015/01/02 15:42, Austin S Hemmelgarn wrote: On 2014-12-31 12:27, ashf...@whisperpc.com wrote: I see this as a CRITICAL design flaw. The reason for calling it CRITICAL is that System Administrators have been trained for 20 years that RAID-10 can usually handle a dual-disk failure, but the BTRFS implementation has effectively ZERO chance of doing so. No, some rather simple math That's the problem. The math isn't as simple as you'd expect: The example below is probably a pathological case - but here goes. Let's say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where d1 is the first bit of data and d2 is the second: Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2 Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4 Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6 Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8 Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10 Lose any two disks and you have a 50% chance on *each* chunk to have lost that chunk. With traditional RAID10 you have a 50% chance of losing the array entirely. With btrfs, the more data you have stored, the chances get closer to 100% of losing *some* data in a 2-disk failure. In the above example, losing A and B means you lose d3, d6, and d7 (which ends up being 60% of all chunks). Losing A and C means you lose d1 (20% of all chunks).OK Losing A and D means you lose d9 (20% of all chunks). Losing B and C means you lose d10 (20% of all chunks). Losing B and D means you lose d2 (20% of all chunks). Losing C and D means you lose d4,d5, AND d8 (60% of all chunks) The above skewed example has an average of 40% of all chunks failed. As you add more data and randomise the allocation, this will approach 50% - BUT, the chances of losing *some* data is already clearly shown to be very close to 100%. OK, I forgot about the randomization effect that the chunk allocation and freeing has. We really should slap a *BIG* warning label on that (and ideally find some better way to do it so it's more reliable). As an aside, I've found that a BTRFS raid1 set on top of 2 LVM/MD RAID0 sets is actually faster than using a BTRFS raid10 set with the same number of disks (how much faster is workload dependent), and provides better guarantees than a BTRFS raid10 set. smime.p7s Description: S/MIME Cryptographic Signature
Re: I need to P. are we almost there yet?
On 2015/01/02 15:42, Austin S Hemmelgarn wrote: On 2014-12-31 12:27, ashf...@whisperpc.com wrote: I see this as a CRITICAL design flaw. The reason for calling it CRITICAL is that System Administrators have been trained for 20 years that RAID-10 can usually handle a dual-disk failure, but the BTRFS implementation has effectively ZERO chance of doing so. No, some rather simple math That's the problem. The math isn't as simple as you'd expect: The example below is probably a pathological case - but here goes. Let's say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where d1 is the first bit of data and d2 is the second: Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2 Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4 Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6 Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8 Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10 Lose any two disks and you have a 50% chance on *each* chunk to have lost that chunk. With traditional RAID10 you have a 50% chance of losing the array entirely. With btrfs, the more data you have stored, the chances get closer to 100% of losing *some* data in a 2-disk failure. In the above example, losing A and B means you lose d3, d6, and d7 (which ends up being 60% of all chunks). Losing A and C means you lose d1 (20% of all chunks). Losing A and D means you lose d9 (20% of all chunks). Losing B and C means you lose d10 (20% of all chunks). Losing B and D means you lose d2 (20% of all chunks). Losing C and D means you lose d4,d5, AND d8 (60% of all chunks) The above skewed example has an average of 40% of all chunks failed. As you add more data and randomise the allocation, this will approach 50% - BUT, the chances of losing *some* data is already clearly shown to be very close to 100%. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs-progs: Documentation: add T/P/E description for resize cmd
On Thu, Jan 01, 2015 at 08:27:55PM -0700, Chris Murphy wrote: Small problem with the rendering of this commit d4ef1a06f8be623ae94e4d498c306e8dd1605bef, when I use 'man btrfs filesystem' the above portion looks like this: 'K', 'M', 'G', 'T', 'P', or 'E\', I'm not sure why there's a trailing slash after the E. Me neither, but it looks like a bug in the asciidoc processing, ends up in the intermediate xml output. I'll probably drop/change the quoting. Separately, for -t option, it reads: For start, len, size it is possible to append a suffix like k for 1 KBytes, m for 1 MBytes... So there's a reference of small k and m there, but then later references for capitalized KMGTPE, so maybe the reference could be more like e.g. LVM where it's described as [bBsSkKmMgGtTpPeE] and just omit the sS for sectors since this isn't supported. Yep, this should be unified. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fixing quota error when removing files from a limit exceeded subvols
Hi Khaled, Could you give use more description about the problem this patch is trying to solve? Maybe an example will help a lot to understand it. Thanx On Fri, Jan 2, 2015 at 7:48 AM, Khaled Ahmed khaled@gmail.com wrote: Signed-off-by: Khaled Ahmed khaled@gmail.com --- fs/btrfs/qgroup.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 48b60db..b85200d 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2408,14 +2408,14 @@ int btrfs_qgroup_reserve(struct btrfs_root *root, u64 num_bytes) if ((qg-lim_flags BTRFS_QGROUP_LIMIT_MAX_RFER) qg-reserved + (s64)qg-rfer + num_bytes - qg-max_rfer) { + qg-max_rfer - 1 ) { ret = -EDQUOT; goto out; } if ((qg-lim_flags BTRFS_QGROUP_LIMIT_MAX_EXCL) qg-reserved + (s64)qg-excl + num_bytes - qg-max_excl) { + qg-max_excl - 1) { ret = -EDQUOT; goto out; } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
scrub wedged (both running and not running at the same time)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I can't start a scrub because it is running, and can't cancel it because it isn't running! How do I get out of this state? OS is Ubuntu 14.10. $ uname -r 3.16.0-28-generic # btrfs scrub start . ERROR: scrub is already running. To cancel use 'btrfs scrub cancel .'. To see the status use 'btrfs scrub status [-d] .'. # btrfs scrub cancel . ERROR: scrub cancel failed on .: not running # btrfs scrub status . scrub status for b02cc605-dd78-40bc-98a5-8f5543d83b66 scrub started at Mon Nov 17 20:27:17 2014, running for 64491 seconds total bytes scrubbed: 3.43GiB with 1 errors error details: read=1 corrected errors: 1, uncorrectable errors: 0, unverified errors: 0 Even a reboot doesn't make this go away. Roger -BEGIN PGP SIGNATURE- Version: GnuPG v1 iEYEARECAAYFAlSnL88ACgkQmOOfHg372QTpogCgvOpEAjIQI5dq+QPtRPty1gB/ 3q0An0llPrQkIeDprwiH4pRBzuZdWdRg =NXGR -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs-progs: Documentation: add T/P/E description for resize cmd
On Fri, Jan 02, 2015 at 05:12:04PM +0100, David Sterba wrote: On Thu, Jan 01, 2015 at 08:27:55PM -0700, Chris Murphy wrote: Small problem with the rendering of this commit d4ef1a06f8be623ae94e4d498c306e8dd1605bef, when I use 'man btrfs filesystem' the above portion looks like this: 'K', 'M', 'G', 'T', 'P', or 'E\', I'm not sure why there's a trailing slash after the E. Me neither, but it looks like a bug in the asciidoc processing. Seems that only the first ' has to be quoted, and consumes the next unquoted ' as a pair, so with the last \' the next one is missing and is printed verbatim: Fixed by: -units designators: \'K\', \'M\', \'G\', \'T\', \'P\', or \'E\', which represent +units designators: \'K', \'M', \'G', \'T', \'P', or \'E', which represent -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: scrub wedged (both running and not running at the same time)
On Fri, Jan 02, 2015 at 03:54:55PM -0800, Roger Binns wrote: I can't start a scrub because it is running, and can't cancel it because it isn't running! How do I get out of this state? OS is Ubuntu 14.10. This has been fixed in btrfs-progs 3.16.2 by commit d5fd05a773e2b19455be7e1208e9003a607483c6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Uncorrectable errors on RAID-1?
On Tue, Dec 30, 2014 at 8:16 PM, Phillip Susi ps...@ubuntu.com wrote: Just because I want a raid doesn't mean I need it to operate reliably 24x7. For that matter, it has long been established that power cycling drives puts more wear and tear on them and as a general rule, leaving them on 24x7 results in them lasting longer. It's not a made to order hard drive industry. Maybe one day you'll be able to 3D print your own with its own specs. And of course you completely ignored, and deleted, my point about the difference in warranties. Because I don't care? Sticking fingers in your ears doesn't change the fact there's a measurable difference in support requirements. It's nice and all that they warranty the more expensive drive more, and it may possibly even mean that they are actually more reliable ( but not likely ), but that doesn't mean that the system should have an unnecessarily terrible response to the behavior of the cheaper drives. Is it worth recommending the more expensive drives? Sure... but the system should also handle the cheaper drives with grace. This is architecture astronaut territory. The system only has a terrible response for two reasons: 1. The user spec'd the wrong hardware for the use case; 2. The distro isn't automatically leveraging existing ways to mitigate that user mistake by changing either SCT ERC on the drives, or the SCSI command timer for each block device. Now, even though that solution *might* mean long recoveries on occasion, it's still better than link reset behavior which is what we have today because it causes the underlying problem to be fixed by md/dm/Btrfs once the read error is reported. But no distro has implemented this $500 man hour solution. Instead you're suggesting a $500,000 fix that will take hundreds of man hours and end user testing to find all the edge cases. It's like, seriously, WTF? Does the SATA specification require configurable SCT ERC? Does it require even supporting SCT ERC? I think your argument is flawed by mis-distributing the economic burden while simultaneously denying one even exists or that these companies should just eat the cost differential if it does. In any case the argument is asinine. There didn't used to be any such thing; drives simply did not *ever* go into absurdly long internal retries so there was no need. The fact that they do these days I consider a misfeature, and one that *can* be worked around in software, which is the point here. Ok well I think that's hubris unless you're a hard drive engineer. You're referring to how drives behaved over a decade ago, when bad sectors were persistent rather than remapped, and we had to scan the drive at format time to build a map so the bad ones wouldn't be used by the filesystem. When the encoded data signal weakens, they effectively becomes fuzzy bits. Each read produces different results. Obviously this is a very rare condition or there'd be widespread panic. However, it's common and expected enough that the drive manufacturers are all, to very little varying degree, dealing with this problem in a similar way, which is multiple reads. Sure, but the noise introduced by the read ( as opposed to the noise in the actual signal on the platter ) isn't that large, and so retrying 10,000 times isn't going to give any better results than retrying say, 100 times, and if the user really desires that many retries, they have always been able to do so in the software level rather than depending on the drive to try that much. There is no reason for the drives to have increased their internal retries that much, and then deliberately withed the essentially zero cost ability to limit those internal retries, other than to drive customers to pay for the more expensive models. http://www.seagate.com/files/www-content/support-content/documentation/product-manuals/en-us/Enterprise/Savvio/Savvio%2015K.3/100629381e.pdf That's a high end SAS drive. It's default is to retry up to 20 times, which takes ~1.4 seconds, per sector. But also note how it says lowering the default increases the unrecoverable error rate. That makes sense. So even if the probability is low that retrying up to 120 seconds could work, statistically it affects the unrecoverable error rate positively to increase the default. If I'm going to be a conspiracy theorist, I'd say the recoveries are getting longer by default in order to keep the specifications reporting sane unrecoverable error rates. Maybe you'd prefer seeing these big, cheap, green drives have shorter ERC times, with a commensurate reality check with their unrecoverable error rate, which right now is already two orders magnitude higher than enterprise SAS drives. So what if this means that rate is 3 or 4 orders magnitude higher? Now I'm just going to wait for you to suggest that sucks donkey tail and how the manufacturer's should produce drives with the same UER as drives 10 years ago *and* with the same error