Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Duncan
Bob Williams posted on Thu, 11 Sep 2014 10:56:14 +0100 as excerpted:

> So if a RAID1/two disk system uses the disks symmetrically, why did my
> balance command take 22 hours? That's what puzzles me, as my
> understanding of RAID1 is that the disk use *is* symmetrical.

What you're missing is what balance actually /does/.

Balance will take every chunk it sees, data or metadata (with metadata 
including system as well), and rewrite it to a new location.  In simplest 
conception that's /all/ it does.

So your 22 hours was the time it took to rewrite-shift, effectively, the 
entire filesystem, one chunk at a time, from one location to another.

Now it so happens that in the process balance does a bunch of other stuff 
too, like combine partially empty blocks of the same type during the 
rewrite, filling them up such that the rewritten version likely takes 
fewer chunks than the original, thus having the effect of freeing the 
extra chunks back to unallocated space that's now again free to be used 
for data or metadata instead of tied up in chunks that are one or the 
other but can't be switched.

And after adding/deleting devices, that rewrite process balances out 
usage between devices.

And with the convert option (used with -d or -m, below) that rewrite can 
be used to convert the rewritten chunks to some other raid layout than 
the original.

And with the -d and -m options (along with -s), you can limit the chunks 
balance looks at to data or metadata (the latter including system as 
well, -s) instead of all chunks.

And with the usage option (along with -d or -m, above), you can limit the 
chunks looked at to those under a particular percentage fill, thus 
allowing to do the chunk consolidation more efficiently without taking 
time to do ALL chunks of that type as it would otherwise do.

But bottom line, balance is a chunk rewriter, and you told it to rewrite 
everything on the filesystem, so that's exactly what it did.  And with 
nearly a TB of data on spinning rust, that took awhile, about 22 hours 
"awhile"!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Bob Williams
On 10/09/14 19:43, Goffredo Baroncelli wrote:
> On 09/10/2014 02:27 PM, Bob Williams wrote:
>> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
>> data and metadata. Last night I started
>>
>> # btrfs filesystem balance 
> 
> 
> May be that I am missing something obvious, however I have to ask which 
> would be the purpose to balance a two disks RAID1 system.
> The balance command should move the data between the disks in order to
> avoid some disk full and other empty; but this assume that there is a
> not symmetrical uses of the disks. Which is not the case for a RAID1/two
> disks system.
> 
> If the disk were more than two the situation would be completely different.
> But Bob reports that the system is compose by two disks only.
> 
>>
>> and it is still running 18 hours later. This suggests that most stuff
>> only gets written to one physical device, which in turn suggests that
>> there is a risk of lost data if one physical device fails. Or is
>> there something clever about btrfs raid that I've missed? I've used
>> linux software raid (mdraid) before, and it appeared to write to both
>> devices simultaneously.
>>

So if a RAID1/two disk system uses the disks symmetrically, why did my
balance command take 22 hours? That's what puzzles me, as my
understanding of RAID1 is that the disk use *is* symmetrical.

Bob




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-11 Thread Bob Williams
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/09/14 05:30, Zygo Blaxell wrote:
> On Wed, Sep 10, 2014 at 01:27:36PM +0100, Bob Williams wrote:
>> I have two 2TB disks formatted as a btrfs raid1 array, mirroring
>> both data and metadata. Last night I started
>> 
>> # btrfs filesystem balance 
>> 
[...]
> 
>> As a rough guide, how often should one perform
>> 
>> a) balance
> 
> I have a cron job that runs 'btrfs balance resume' or 'btrfs
> balance start' (depending on whether a balance is already in
> progress) nightly at 1AM.  Another cron job comes along at 6AM to
> run 'btrfs balance pause' on my headless servers.  On my desktops
> and laptops I have a daemon that detects keyboard/mouse input and
> does 'btrfs balance pause' when some is detected (the balance
> remains paused until the next day at 1AM, as it is really heavy and
> takes a long time to come to a stop).
> 
[...]

Many thanks to everyone who has contributed to this thread. I have
learnt a lot, and now have weekly cronjobs to balance and scrub.

Bob
- -- 
Bob Williams
System:  Linux 3.11.10-21-desktop
Distro:  openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0
Uptime:  06:00am up 6 days 15:04, 4 users, load average: 0.02, 0.02, 0.05
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlQRZp4ACgkQ0Sr7eZJrmU7YcACgpMcD4w3J8IV8m4MpYSG8jl1/
kLEAoITiw3EcpvnELw2KlRaa3GlYEWUV
=ICmv
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Duncan
Zygo Blaxell posted on Wed, 10 Sep 2014 23:51:19 -0400 as excerpted:

> Spinning disks stop being able to position their heads properly around
> -10C or so, a fact that will be familiar to anyone who's tried to use a
> laptop outside in winter.

Depends on where that winter is.  Here in Phoenix, snow makes news (and 
whatever you do, don't ask Phoenicians to drive in it, they're bad enough 
in rain!) and with the exception of outlying areas, there's now seldom 
even frost in the morning.  -10C or so?  YIKES!

So using a laptop outside in winter here isn't likely to trigger the 
behavior in question.

OTOH, I've personally had the opposite issue, go away for some hours in 
the summer with the computer left running, AC fails when it's already 45C 
in the shade outside, come back to a house baking at 55-60C inside, a 
computer still on but of course crashed, and head-crashed disk that 
likely reached well over 70C...

Still, once I shut down and cooled everything down, the system but for 
the disk was fine, and the disk was fine too, outside the zones where the 
heads happened to be at the time.  I had (unmounted at the time) backup 
partitions on the same disk that I was able to boot and run from for 
several months, until I was able to buy and install a replacement.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Zygo Blaxell
On Wed, Sep 10, 2014 at 01:27:36PM +0100, Bob Williams wrote:
> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
> data and metadata. Last night I started
> 
> # btrfs filesystem balance 
> 
> and it is still running 18 hours later. This suggests that most stuff
> only gets written to one physical device, which in turn suggests that
> there is a risk of lost data if one physical device fails. Or is there
> something clever about btrfs raid that I've missed? I've used linux
> software raid (mdraid) before, and it appeared to write to both
> devices simultaneously.
> 
> Is it safe to interrupt [^Z] the btrfs balancing process?

The ioctl isn't interruptible, so ^Z won't do much.  You'll need a
separate window to run 'btrfs balance pause' or 'btrfs balance cancel'.
It seems to wait until it's reached the end of a block group before it
actually stops, so it may take a few minutes or a few hours depending
on how much other load you have on the filesystem.

> As a rough guide, how often should one perform
> 
> a) balance

I have a cron job that runs 'btrfs balance resume' or 'btrfs balance
start' (depending on whether a balance is already in progress) nightly
at 1AM.  Another cron job comes along at 6AM to run 'btrfs balance pause'
on my headless servers.  On my desktops and laptops I have a daemon that
detects keyboard/mouse input and does 'btrfs balance pause' when some
is detected (the balance remains paused until the next day at 1AM,
as it is really heavy and takes a long time to come to a stop).

Using that schedule, a full balance can take weeks to run to completion on
a busy server.  Enough progress is made each day to have some benefit.

I keep my disks 90-99% full.  Free space fragmentation can be a huge
performance problem, since it causes severe file fragmentation when
large files are created or modified.  Balancing rearranges the allocated
space so that there are big enough contiguous free spaces so that new
data isn't scattered sparsely across the entire surface of the disk.

> b) defragment

I run this once daily on a continuously active 4GB PostgreSQL DB.
I generally don't bother with defragment otherwise.

I have millions of tiny files that are already unfragmented, and a few
huge files that can't be defragmented without moving around hundreds
of GB of data to make a free contiguous extent.  For me defragment is
usually either pointless or too expensive to be worthwhile, with a
few specific exceptions like highly active large database files.

Defragmentation also does not seem to play well with snapshots and
deduplication.

> c) scrub

Every 14 days.  The ZFS guideline is one scrub every 4 weeks for
enterprise drives, and every week for consumer drives.  I split the
difference.  scrub is relatively fast so it's not too painful to run
it often.  On the opposite weeks I run SMART self-tests on the drives too.

On desktops and laptops I have a daemon that listens for keyboard/mouse
input and pause/resumes scrubs, so I don't have to wait around while
scrub competes for my disk bandwidth.  The servers just get a little
slower for a couple of hours a month.

> on a btrfs raid setup?

I use roughly the same policies on all my btrfs filesystems from SSDs
and single spinning disks to six-disk RAID1 arrays.  I don't rebalance
SD cards--they are too fragile for the write load, and chances are good
that they'll wear out before balancing becomes necessary anyway.  On
busy server machines I'll rearrange the balancing hours to avoid high
load times.

> Bob
> - -- 
> Bob Williams
> System:  Linux 3.11.10-21-desktop
> Distro:  openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0
> Uptime:  06:00am up 5 days 15:04, 4 users, load average: 1.94, 2.21, 2.36
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.22 (GNU/Linux)
> 
> iEYEARECAAYFAlQQQ6wACgkQ0Sr7eZJrmU5WlwCfd+OcuqFoz/vYSZEHg+5zNwlo
> oTQAn2ZKhx4gCdQIy0gl9EBb8XXVJu1G
> =2GSh
> -END PGP SIGNATURE-
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Wed, Sep 10, 2014 at 11:51:19PM -0400, Zygo Blaxell wrote:
> This is a complex topic. 

I agree, and I make no claim to be an expert in any of this.

> Some disks have bugs in their firmware, and some of those bugs make the
> data sheets and most of this discussion entirely moot.  The firmware is
> gonna do what the firmware's gonna do.

Agreed. That's why I like that fact that btrfs provides another layer of
error checking / correction.

> It's a bad idea to try to rewrite a fading sector in some cases.
> If the drive is located in a climate-controlled data center then it
> should be OK; however, there are multiple causes of read failure and
> some of them will also cause writes to damage adjacent data on the disk.
> Spinning disks stop being able to position their heads properly around
> -10C or so, a fact that will be familiar to anyone who's tried to use a
> laptop outside in winter.  Maybe someone dropped the computer, and the
> read errors are due to the heads vibrating with the shock--a read retry
> a few milliseconds later would be OK, but a rewrite (without a delay,
> so the heads are still vibrating from the shock) would just wipe out
> some nearby data with no possibility of recovery.

Of course, the drive can't always know what's going on outside. It just
tries its best (we hope). 

> Most of the reallocations I've observed in the field happen when a
> sector is written, not read.

Very true. I believe what happens is that a sector is marked for
re-allocation when the read fails, and a write to that sector will
trigger the actual reallocation. Hence the "pending reallocations" SMART
attribute.

> Most disks can search for defects on their own, but the host has to issue a
> SMART command to initiate such a search.  They will also track defect
> rates and log recent error details (with varying degrees of bugginess).

And again, it's up to the questionable firmware's discretion as to how
that search is done / how thorough it is. And it has to be triggered by
the user / script. I don't consider that to really be "on its own," as
btrfs scrub requires the same level of input/scripting.

> smartmontools is your friend.  It's not a replacement for btrfs scrub, but
> it collects occasionally useful complementary information about the
> health of the drive.

I can't find the link, but there was a study done that shows an
alarmingly high percentage of disk failures showed no SMART errors
before failing. 

> There used to be a firmware feature for drives to test themselves
> whenever they are spinning and idle for four continuous hours, but most
> modern disks will power themselves down if they are idle for much less
> time...and who has a disk that's idle for four hours at a time anyway?  ;)

My backup destination is touched once a day. It averages about 20 hours
a day idle. Though it probably doesn't need to be testing itself 80% of
the time. That would be a mite excessive =P

> > Scrub your disks, folks. A scrubbed disk is a happy disk.
> 
> Seconded.  Also remember that not all storage errors are due to disk
> failure.  There's a lot of RAM, high-speed signalling, and wire between
> the host CPU and a disk platter.  SMART self-tests won't detect failures
> in those, but scrubs will.

But we'll save the ECC RAM discussion for another day, perhaps.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Zygo Blaxell
On Wed, Sep 10, 2014 at 09:25:17PM -0400, Sean Greenslade wrote:
> On Thu, Sep 11, 2014 at 12:28:56AM +0200, Goffredo Baroncelli wrote:
> > The WD datasheet says something different. It reports "Non-recoverable 
> > read errors per bits read" less than 1/10^14. They express the number of 
> > error in terms of number of bit reading.
> > 
> > You instead are saying that the error depends by the disk age.
> > 
> > These two sentence are very different.
> > 
> > ( and of course all these values depend also by the product quality).
> 
> I'm not certain how those specs are determined. I was basing my
> statements on knowledge of how read errors occur in rotating media.

This is a complex topic.  Different drives built by the same vendor have
different behavior coded in their firmware (this is why WD drives come in
half a dozen colors).  A consumer drive will keep retrying to read
data and hide errors from the host as long as possible, while a drive
intended for deployment in a RAID array will fail out quickly based on
the assumption that another drive somewhere in the system has a redundant
copy that the host can use to recover the lost data.  Some disks even
support configurable error responses in their firmware.

Some disks have bugs in their firmware, and some of those bugs make the
data sheets and most of this discussion entirely moot.  The firmware is
gonna do what the firmware's gonna do.

It's a bad idea to try to rewrite a fading sector in some cases.
If the drive is located in a climate-controlled data center then it
should be OK; however, there are multiple causes of read failure and
some of them will also cause writes to damage adjacent data on the disk.
Spinning disks stop being able to position their heads properly around
-10C or so, a fact that will be familiar to anyone who's tried to use a
laptop outside in winter.  Maybe someone dropped the computer, and the
read errors are due to the heads vibrating with the shock--a read retry
a few milliseconds later would be OK, but a rewrite (without a delay,
so the heads are still vibrating from the shock) would just wipe out
some nearby data with no possibility of recovery.

> They are both the same, generally. If the sector is damaged (e.g.
> manufacturing fault), then it can do several things. It can always
> return bad data, which will result in a reallocation. It can also
> partially fail. For example, accept the data, but slowly lose it over
> some period of time. It's still due to bad media, but if you were to
> read it quickly enough, you may be able to catch it before it goes bad.
> If the drive catches (and re-writes) it, then it may have staved off
> losing that data that time around. 

Most of the reallocations I've observed in the field happen when a
sector is written, not read.  If bad sectors were reallocated on reads
then repeatedly attempting to read a marginal bad sector would make it go
away as soon as one of the reads is successful.  Also this theory (that
reads correct bad sectors) doesn't match the behavior of SMART statistics
for disks with bad sector counters when they do have read errors.

> Yes, the error rate is almost entirely determined by the manufacturing
> of the physical media. Controllers can attempt to work around that, but
> they won't go searching for media defects on their own (at least, I've
> never seen a drive that does.)

Most disks can search for defects on their own, but the host has to issue a
SMART command to initiate such a search.  They will also track defect
rates and log recent error details (with varying degrees of bugginess).

smartmontools is your friend.  It's not a replacement for btrfs scrub, but
it collects occasionally useful complementary information about the
health of the drive.

There used to be a firmware feature for drives to test themselves
whenever they are spinning and idle for four continuous hours, but most
modern disks will power themselves down if they are idle for much less
time...and who has a disk that's idle for four hours at a time anyway?  ;)

> Disks have latent errors. Nothing you can do will change this, and the
> number of reads you do will not affect the error rate of the media. It
> _will_ affect how often those errors are detected, however. And with
> btrds, this is a Good Thing(TM). If errors are found, they can be
> corrected by either the disk controller itself (on the block level) or
> the filesystem on its level. 
> 
> Scrub your disks, folks. A scrubbed disk is a happy disk.

Seconded.  Also remember that not all storage errors are due to disk
failure.  There's a lot of RAM, high-speed signalling, and wire between
the host CPU and a disk platter.  SMART self-tests won't detect failures
in those, but scrubs will.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Thu, Sep 11, 2014 at 12:28:56AM +0200, Goffredo Baroncelli wrote:
> The WD datasheet says something different. It reports "Non-recoverable 
> read errors per bits read" less than 1/10^14. They express the number of 
> error in terms of number of bit reading.
> 
> You instead are saying that the error depends by the disk age.
> 
> These two sentence are very different.
> 
> ( and of course all these values depend also by the product quality).

I'm not certain how those specs are determined. I was basing my
statements on knowledge of how read errors occur in rotating media.

> I think that there is two source of error:
> - a platter/disk degradation (due to ageing, wearing...), which may require a 
> sector relocation
> - other sources of error which are not permanent and that may be corrected
> by a 2nd read
> 
> I don't have any idea about which one is bigger (even I suspect the second).

They are both the same, generally. If the sector is damaged (e.g.
manufacturing fault), then it can do several things. It can always
return bad data, which will result in a reallocation. It can also
partially fail. For example, accept the data, but slowly lose it over
some period of time. It's still due to bad media, but if you were to
read it quickly enough, you may be able to catch it before it goes bad.
If the drive catches (and re-writes) it, then it may have staved off
losing that data that time around. 

> > So doing reads, especially across the entire media surface, is a great
> > way to make the disk perform these sector checks. But sometimes the disk
> > cannot correct the error. 
> 
> I read this as: the error rate is greater than 1/10^14, but the CRC and
> some multiple reading and sector remapping lower the error rate below 1/10^14.
> 
> If behind this there are a "dumb" drive which returns an error as soon as 
> the CRC doesn't match, or a smart drive which retries several time until
> it got a good value doesn't matter: the error rate is still 1/10^14.

Yes, the error rate is almost entirely determined by the manufacturing
of the physical media. Controllers can attempt to work around that, but
they won't go searching for media defects on their own (at least, I've
never seen a drive that does.)

> > Long story short, reads don't cause media errors, and scrubs help detect
> > errors early.
> 
> Nobody told that a reading "cause" a media "error"; however assuming (this is 
> how
> I read the WD datasheet) the error rate constant, if you increase the number 
> of reading then you have more errors.
> 
> May be that I was not clear, however I didn't want to say that "scrubbing 
> reduces 
> the life of disk", I wanted to point out that the size of the disk and the 
> error
> rate are becoming comparable.

I know that wasn't your implication, but I wanted to be sure that things
weren't misinterpreted. I'll clarify:

Disks have latent errors. Nothing you can do will change this, and the
number of reads you do will not affect the error rate of the media. It
_will_ affect how often those errors are detected, however. And with
btrds, this is a Good Thing(TM). If errors are found, they can be
corrected by either the disk controller itself (on the block level) or
the filesystem on its level. 

Scrub your disks, folks. A scrubbed disk is a happy disk.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Goffredo Baroncelli
On 09/10/2014 09:32 PM, Sean Greenslade wrote:
> On Wed, Sep 10, 2014 at 08:43:25PM +0200, Goffredo Baroncelli wrote:
>> May be that I am missing something obvious, however I have to ask which 
>> would be the purpose to balance a two disks RAID1 system.
>> The balance command should move the data between the disks in order to
>> avoid some disk full and other empty; but this assume that there is a
>> not symmetrical uses of the disks. Which is not the case for a RAID1/two
>> disks system.
> 
> Balancing is not necessarily about data distribution between two disks.
> You can balance a single disk BTRFS partition. It's more about balancing
> how the data / metadata chunks are allocated and used. It also (during a
> re-write of a chunk) honors the RAID rules of that chunk type.

True, I forgot that you can balance across the chunk
> 
>> *scrub
>> Regarding scrub, pay attention that some (consumer) disks are 
>> guarantee for a (not recoverable) error rate less than 1/10^14 [1] 
>> bit reads. 10^14 bit are something like 10TB. This means that if you 
>> read your system 5 times, you may got an error bit. I suppose 
>> that these are very conservative number, so the likelihood of an 
>> undetected error is (I hope)lower. But also I am inclined to think 
>> these number are evaluated in an ideal case (in term of temperature, 
>> voltage, vibration); this means that the true might be worse.
>>
>> So if you compare these numbers with your average throughput, 
>> you can estimate which is the likelihood of an error. Pay attention
>> that a scrub job means read all your data: If you have 1T of data,
>> and you performs a scrub each week, in three months you reach the 10^14
>> bit reads.
>>
>> This explains the interest in higher redundancy level (raid 6 or more).
>>  
>> G.Baroncelli
> 
> I think there is a bit of misunderstanding here. Those disk error rates
> are latent media errors. They're a function of production quality of the
> platters and the amount of time the data rests on the drive. Reads do
> not affect this, and in fact, can actually help reduce the error rate. 

The WD datasheet says something different. It reports "Non-recoverable 
read errors per bits read" less than 1/10^14. They express the number of 
error in terms of number of bit reading.

You instead are saying that the error depends by the disk age.

These two sentence are very different.

( and of course all these values depend also by the product quality).
 
> When a hard drive does a read, it also reads the CRC values for the
> sector that it just read. If it matches, the drive passes it on as good
> data. If not, it attempts error correction on it. If it can correct the
> error, it will return the corrected data and (hopefully) re-write the
> data on the disk to fix the error "permanently." I use quotes because
> this could mean that that zone of media is damaged, and it will probably
> error again. The disk will eventually re-allocate a sector that
> repeatedly returns bad data. This is what you want to happen.

I think that there is two source of error:
- a platter/disk degradation (due to ageing, wearing...), which may require a 
sector relocation
- other sources of error which are not permanent and that may be corrected
by a 2nd read

I don't have any idea about which one is bigger (even I suspect the second).

> So doing reads, especially across the entire media surface, is a great
> way to make the disk perform these sector checks. But sometimes the disk
> cannot correct the error. 

I read this as: the error rate is greater than 1/10^14, but the CRC and
some multiple reading and sector remapping lower the error rate below 1/10^14.

If behind this there are a "dumb" drive which returns an error as soon as 
the CRC doesn't match, or a smart drive which retries several time until
it got a good value doesn't matter: the error rate is still 1/10^14.

> Then the controller (if it is well-behaved)
> will return a read error, or sometimes just bunk data. If the BTRFS
> scrub sees bad data, it will detect it with its checksums, and if in a
> RAID configuration, be able to locate a good copy of the data to
> restore. 
 
> Long story short, reads don't cause media errors, and scrubs help detect
> errors early.

Nobody told that a reading "cause" a media "error"; however assuming (this is 
how
I read the WD datasheet) the error rate constant, if you increase the number 
of reading then you have more errors.

May be that I was not clear, however I didn't want to say that "scrubbing 
reduces 
the life of disk", I wanted to point out that the size of the disk and the error
rate are becoming comparable.

> 
> --Sean
> 
Goffredo

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majord

Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Sean Greenslade
On Wed, Sep 10, 2014 at 08:43:25PM +0200, Goffredo Baroncelli wrote:
> May be that I am missing something obvious, however I have to ask which 
> would be the purpose to balance a two disks RAID1 system.
> The balance command should move the data between the disks in order to
> avoid some disk full and other empty; but this assume that there is a
> not symmetrical uses of the disks. Which is not the case for a RAID1/two
> disks system.

Balancing is not necessarily about data distribution between two disks.
You can balance a single disk BTRFS partition. It's more about balancing
how the data / metadata chunks are allocated and used. It also (during a
re-write of a chunk) honors the RAID rules of that chunk type.

> *scrub
> Regarding scrub, pay attention that some (consumer) disks are 
> guarantee for a (not recoverable) error rate less than 1/10^14 [1] 
> bit reads. 10^14 bit are something like 10TB. This means that if you 
> read your system 5 times, you may got an error bit. I suppose 
> that these are very conservative number, so the likelihood of an 
> undetected error is (I hope)lower. But also I am inclined to think 
> these number are evaluated in an ideal case (in term of temperature, 
> voltage, vibration); this means that the true might be worse.
> 
> So if you compare these numbers with your average throughput, 
> you can estimate which is the likelihood of an error. Pay attention
> that a scrub job means read all your data: If you have 1T of data,
> and you performs a scrub each week, in three months you reach the 10^14
> bit reads.
> 
> This explains the interest in higher redundancy level (raid 6 or more).
>  
> G.Baroncelli

I think there is a bit of misunderstanding here. Those disk error rates
are latent media errors. They're a function of production quality of the
platters and the amount of time the data rests on the drive. Reads do
not affect this, and in fact, can actually help reduce the error rate. 

When a hard drive does a read, it also reads the CRC values for the
sector that it just read. If it matches, the drive passes it on as good
data. If not, it attempts error correction on it. If it can correct the
error, it will return the corrected data and (hopefully) re-write the
data on the disk to fix the error "permanently." I use quotes because
this could mean that that zone of media is damaged, and it will probably
error again. The disk will eventually re-allocate a sector that
repeatedly returns bad data. This is what you want to happen.

So doing reads, especially across the entire media surface, is a great
way to make the disk perform these sector checks. But sometimes the disk
cannot correct the error. Then the controller (if it is well-behaved)
will return a read error, or sometimes just bunk data. If the BTRFS
scrub sees bad data, it will detect it with its checksums, and if in a
RAID configuration, be able to locate a good copy of the data to
restore. 

Long story short, reads don't cause media errors, and scrubs help detect
errors early.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Goffredo Baroncelli
On 09/10/2014 02:27 PM, Bob Williams wrote:
> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
> data and metadata. Last night I started
> 
> # btrfs filesystem balance 


May be that I am missing something obvious, however I have to ask which 
would be the purpose to balance a two disks RAID1 system.
The balance command should move the data between the disks in order to
avoid some disk full and other empty; but this assume that there is a
not symmetrical uses of the disks. Which is not the case for a RAID1/two
disks system.

If the disk were more than two the situation would be completely different.
But Bob reports that the system is compose by two disks only.

> 
> and it is still running 18 hours later. This suggests that most stuff
> only gets written to one physical device, which in turn suggests that
> there is a risk of lost data if one physical device fails. Or is
> there something clever about btrfs raid that I've missed? I've used
> linux software raid (mdraid) before, and it appeared to write to both
> devices simultaneously.
> 
> Is it safe to interrupt [^Z] the btrfs balancing process?
> 
> As a rough guide, how often should one perform
> 
> a) balance b) defragment c) scrub
> on a btrfs raid setup?

*defrag
I don't have any hard rule for that. However I made a systemd unit
which defrags /var each day (for file bigger than 5M) . It helps a 
lot for some critical files like systemd journal and/or the 
apt-get/deb databases.
Time to time I defrag /usr, but without a rule.

*scrub
Regarding scrub, pay attention that some (consumer) disks are 
guarantee for a (not recoverable) error rate less than 1/10^14 [1] 
bit reads. 10^14 bit are something like 10TB. This means that if you 
read your system 5 times, you may got an error bit. I suppose 
that these are very conservative number, so the likelihood of an 
undetected error is (I hope)lower. But also I am inclined to think 
these number are evaluated in an ideal case (in term of temperature, 
voltage, vibration); this means that the true might be worse.

So if you compare these numbers with your average throughput, 
you can estimate which is the likelihood of an error. Pay attention
that a scrub job means read all your data: If you have 1T of data,
and you performs a scrub each week, in three months you reach the 10^14
bit reads.

This explains the interest in higher redundancy level (raid 6 or more).
 
G.Baroncelli

[1] 
- http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf
- 
http://forums.storagereview.com/index.php/topic/31688-western-digital-red-nas-hard-drive-review-discussion/
> 
> Bob -- To unsubscribe from this list: send the line "unsubscribe 
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Bob Williams
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/09/14 14:06, Austin S Hemmelgarn wrote:
> On 2014-09-10 08:27, Bob Williams wrote:
>> I have two 2TB disks formatted as a btrfs raid1 array, mirroring
>> both data and metadata. Last night I started
>> 
>> # btrfs filesystem balance 
>> 
> In general, unless things are really bad, you don't ever want to
> use balance on such a big filesystem without some filters to
> control what gets balanced (especially if the filesystem is more
> than about 50% full most of the time).
> 
Thank you. These disks are in an external, SATA II enclosure, and I
use them for backups. They contain about 6 subvolumes, each containing
 rsynced data and about 50 snapshots. btrfs fi sh says that I've used
890Gib out of 1.82Tib, so approaching 50%.

> My suggestion in this case would be to use: # btrfs balance start
> -dusage=25 -musage=25  on a roughly weekly basis.  This will
> only balance chunks that are less than 25% full, and therefore run
> much faster.  If you are particular about high storage efficiency,
> then try 50 instead of 25.
>> and it is still running 18 hours later. This suggests that most
>> stuff only gets written to one physical device, which in turn
>> suggests that there is a risk of lost data if one physical device
>> fails. Or is there something clever about btrfs raid that I've
>> missed? I've used linux software raid (mdraid) before, and it
>> appeared to write to both devices simultaneously.
> The reason that a full balance takes so long on a big (and I'm
> assuming based on the 18 hours it's taken, very full) filesystem is
> that it reads all of the data, and writes it out to both disks, but
> it doesn't do very good load-balancing like mdraid or LVM do.  I've
> got a 4x 500Gib BTRFS RAID10 filesystem that I use for my home
> directory on my desktop system, and a full balance on that takes
> about 6 hours.

See above re how full the filesystem is. The process finished after
about 22 hours, with the message:

"Done, had to relocate 1230 out of 1230 chunks"
>> 
>> Is it safe to interrupt [^Z] the btrfs balancing process?
> ^Z sends a SIGSTOP, which is a really bad idea with something that
> is doing low-level stuff to a filesystem.  If you need to stop the
> balance process (and are using a recent enough kernel and
> btrfs-progs), the preferred way to do so is to use the following
> from another terminal: # btrfs balance stop  Depending on
> what the balance operation is working when you do this, it may take
> a few minutes before it actually stops (the longest that I've seen
> it take is ~200 seconds).
>> 
>> As a rough guide, how often should one perform
>> 
>> a) balance b) defragment c) scrub
>> 
>> on a btrfs raid setup?
> In general, you should be running scrub regularly, and balance and 
> defragment as needed.  On the BTRFS RAID filesystems that I have, I
> use the following policy: 1) Run a 25% balance (the command I
> mentioned above) on a weekly basis. 2) If the filesystem has less
> than 50% of either the data or metadata chunks full at the end of
> the month, run a full balance on it. 3) Run a scrub on a daily
> basis. 4) Defragment files only as needed (which isn't often for me
> because I use the autodefrag mount option). 5) Make sure than only
> one of balance, scrub or defrag is running at a given time.

Useful advice, thanks. I'm already doing a weekly scrub on my / and
/home partitions. I'll try adding the 25% balance routine as well.

Bob

- -- 
Bob Williams
System:  Linux 3.11.10-21-desktop
Distro:  openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0
Uptime:  06:00am up 5 days 15:04, 4 users, load average: 1.94, 2.21, 2.36
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEUEARECAAYFAlQQjfMACgkQ0Sr7eZJrmU5V2wCggtKbz3Iwr0cSJrC4c7kNG1dT
fLcAl20Z+YA2mrJTYAWT2Rf9HhIICT8=
=CBHC
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn
On 2014-09-10 09:48, Rich Freeman wrote:
> On Wed, Sep 10, 2014 at 9:06 AM, Austin S Hemmelgarn
>  wrote:
>> Normally, you shouldn't need to run balance at all on most BTRFS
>> filesystems, unless your usage patterns vary widely over time (I'm
>> actually a good example of this, most of the files in my home directory
>> are relatively small, except for when I am building a system with
>> buildroot or compiling a kernel, and on occasion I have VM images that
>> I'm working with).
> 
> Tend to agree, but I do keep a close eye on free space.  If I get to
> the point where I'm over 90% allocated to chunks with lots of unused
> space otherwise I run a balance.  I tend to have the most problems
> with my root/OS filesystem running on a 64GB SSD, likely because it is
> so small.
> 
> Is there a big performance penalty running mixed chunks on an SSD?  I
> believe this would get rid of the risk of ENOSPC issues if everything
> gets allocated to chunks.  There are obviously no issues with random
> access on an SSD, but there could be other problems (cache
> utilization, etc).
There shouldn't be any more performance penalty than for normally
running mixed chunks.  Also, a 64GB SSD is not small, I use a pair of
64GB SSD's in a BTRFS RAID1 configuration for root on my desktop, and
consistently use less than a quarter (12G on average) of the available
space, and that's with stuff like LibreOffice and the entire OpenClipart
distribution (although I'm not running an 'enterprise' distribution, and
keep /tmp and /var/tmp on tmpfs).
> 
> I tend to watch btrfs fi sho and if the total space used starts
> getting high then I run a balance.  Usually I run with -dusage=30 or
> -dusage=50, but sometimes I get to the point where I just need to do a
> full balance.  Often it is helpful to run a series of balance commands
> starting at -dusage=10 and moving up in increments.  This at least
> prevents killing IO continuously for hours.  If we can get to a point
> where balancing can operate at low IO priority that would be helpful.
> 
> IO priority is a problem in btrfs in general.  Even tasks run at idle
> scheduling priority can really block up a disk.  I've seen a lot of
> hurry-and-wait behavior in btrfs.  It seems like the initial commit to
> the log/etc is willing to accept a very large volume of data, and then
> when all the trees get updated the system grinds to a crawl trying to
> deal with all the data that was committed.  The problem is that you
> have two queues, with the second queue being rate-limiting but the
> first queue being the one that applies priority control.  What we
> really need is for the log to have controls on how much it accepts so
> that the updating of the trees/etc never is rate-limiting.   That will
> limit the ability to have short IO write bursts, but it would prevent
> low-priority writes from blocking high-priority read/writes.

You know, you can pretty easily control bandwidth utilization just using
cgroups.  This is what I do, and I get much better results with cgroups
and the deadline IO scheduler than I ever did with CFQ. Abstract
priorities are a not bad for controlling relative CPU utilization, but
they really suck for IO scheduling.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Rich Freeman
On Wed, Sep 10, 2014 at 9:06 AM, Austin S Hemmelgarn
 wrote:
> Normally, you shouldn't need to run balance at all on most BTRFS
> filesystems, unless your usage patterns vary widely over time (I'm
> actually a good example of this, most of the files in my home directory
> are relatively small, except for when I am building a system with
> buildroot or compiling a kernel, and on occasion I have VM images that
> I'm working with).

Tend to agree, but I do keep a close eye on free space.  If I get to
the point where I'm over 90% allocated to chunks with lots of unused
space otherwise I run a balance.  I tend to have the most problems
with my root/OS filesystem running on a 64GB SSD, likely because it is
so small.

Is there a big performance penalty running mixed chunks on an SSD?  I
believe this would get rid of the risk of ENOSPC issues if everything
gets allocated to chunks.  There are obviously no issues with random
access on an SSD, but there could be other problems (cache
utilization, etc).

I tend to watch btrfs fi sho and if the total space used starts
getting high then I run a balance.  Usually I run with -dusage=30 or
-dusage=50, but sometimes I get to the point where I just need to do a
full balance.  Often it is helpful to run a series of balance commands
starting at -dusage=10 and moving up in increments.  This at least
prevents killing IO continuously for hours.  If we can get to a point
where balancing can operate at low IO priority that would be helpful.

IO priority is a problem in btrfs in general.  Even tasks run at idle
scheduling priority can really block up a disk.  I've seen a lot of
hurry-and-wait behavior in btrfs.  It seems like the initial commit to
the log/etc is willing to accept a very large volume of data, and then
when all the trees get updated the system grinds to a crawl trying to
deal with all the data that was committed.  The problem is that you
have two queues, with the second queue being rate-limiting but the
first queue being the one that applies priority control.  What we
really need is for the log to have controls on how much it accepts so
that the updating of the trees/etc never is rate-limiting.   That will
limit the ability to have short IO write bursts, but it would prevent
low-priority writes from blocking high-priority read/writes.

--
Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn
On 2014-09-10 08:27, Bob Williams wrote:
> I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
> data and metadata. Last night I started
> 
> # btrfs filesystem balance 
> 
In general, unless things are really bad, you don't ever want to use
balance on such a big filesystem without some filters to control what
gets balanced (especially if the filesystem is more than about 50% full
most of the time).

My suggestion in this case would be to use:
# btrfs balance start -dusage=25 -musage=25 
on a roughly weekly basis.  This will only balance chunks that are less
than 25% full, and therefore run much faster.  If you are particular
about high storage efficiency, then try 50 instead of 25.
> and it is still running 18 hours later. This suggests that most stuff
> only gets written to one physical device, which in turn suggests that
> there is a risk of lost data if one physical device fails. Or is there
> something clever about btrfs raid that I've missed? I've used linux
> software raid (mdraid) before, and it appeared to write to both
> devices simultaneously.
The reason that a full balance takes so long on a big (and I'm assuming
based on the 18 hours it's taken, very full) filesystem is that it reads
all of the data, and writes it out to both disks, but it doesn't do very
good load-balancing like mdraid or LVM do.  I've got a 4x 500Gib BTRFS
RAID10 filesystem that I use for my home directory on my desktop system,
and a full balance on that takes about 6 hours.
> 
> Is it safe to interrupt [^Z] the btrfs balancing process?
^Z sends a SIGSTOP, which is a really bad idea with something that is
doing low-level stuff to a filesystem.  If you need to stop the balance
process (and are using a recent enough kernel and btrfs-progs), the
preferred way to do so is to use the following from another terminal:
# btrfs balance stop 
Depending on what the balance operation is working when you do this, it
may take a few minutes before it actually stops (the longest that I've
seen it take is ~200 seconds).
> 
> As a rough guide, how often should one perform
> 
> a) balance
> b) defragment
> c) scrub
> 
> on a btrfs raid setup?
In general, you should be running scrub regularly, and balance and
defragment as needed.  On the BTRFS RAID filesystems that I have, I use
the following policy:
1) Run a 25% balance (the command I mentioned above) on a weekly basis.
2) If the filesystem has less than 50% of either the data or metadata
chunks full at the end of the month, run a full balance on it.
3) Run a scrub on a daily basis.
4) Defragment files only as needed (which isn't often for me because I
use the autodefrag mount option).
5) Make sure than only one of balance, scrub or defrag is running at a
given time.
Normally, you shouldn't need to run balance at all on most BTRFS
filesystems, unless your usage patterns vary widely over time (I'm
actually a good example of this, most of the files in my home directory
are relatively small, except for when I am building a system with
buildroot or compiling a kernel, and on occasion I have VM images that
I'm working with).



smime.p7s
Description: S/MIME Cryptographic Signature


Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Bob Williams
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
data and metadata. Last night I started

# btrfs filesystem balance 

and it is still running 18 hours later. This suggests that most stuff
only gets written to one physical device, which in turn suggests that
there is a risk of lost data if one physical device fails. Or is there
something clever about btrfs raid that I've missed? I've used linux
software raid (mdraid) before, and it appeared to write to both
devices simultaneously.

Is it safe to interrupt [^Z] the btrfs balancing process?

As a rough guide, how often should one perform

a) balance
b) defragment
c) scrub

on a btrfs raid setup?

Bob
- -- 
Bob Williams
System:  Linux 3.11.10-21-desktop
Distro:  openSUSE 13.1 (x86_64) with KDE Development Platform: 4.14.0
Uptime:  06:00am up 5 days 15:04, 4 users, load average: 1.94, 2.21, 2.36
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlQQQ6wACgkQ0Sr7eZJrmU5WlwCfd+OcuqFoz/vYSZEHg+5zNwlo
oTQAn2ZKhx4gCdQIy0gl9EBb8XXVJu1G
=2GSh
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html