Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote:
   things.  What I care about is the largest size sector that will (in
   ^^^
   the ordinary course of things anyway) be written atomically.
   Then those are 512-byte-sector drives [...]
   No; because I can do 4K atomic writes, I want to know about that.
  
  And, can't you do that with traditional drives, drives which really do
  have 512-byte sectors?  Do a 4K transfer and you write 8 physical
  sectors with no opportunity for any other operation to see the write
  partially done.  Is that wrong, or am I missing something else?

Insert a kernel panic (or power failure(*)) after five sectors and
it's not atomic. One sector, at least in theory(*), is.

(*) let's ignore for now the various daft things that disks sometimes
do in practice.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Mon, Dec 03, 2012 at 12:19:58AM +, Julian Yon wrote:
  You appear to have just agreed with me, which makes me wonder what I'm
  missing, given you continue as though you disagree.

You asked why 4096-byte-sector disks accept 512-byte writes. I was
trying to explain.

   However, we're talking about hardware here, so you have to also
   consider the possibility that the drive firmware reports 512 because
   that's what someone coded up back in 1992 and nobody got around to
   fixing it.
  
  If that doesn't count as broken, what does? (Also, gosh, when did 1992
  become so long ago?)

By this standard, most hardware is broken.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread Thor Lancelot Simon
On Tue, Dec 04, 2012 at 02:14:27PM +, David Holland wrote:
 On Sat, Dec 01, 2012 at 11:38:55PM -0500, Mouse wrote:
things.  What I care about is the largest size sector that will (in
^^^
the ordinary course of things anyway) be written atomically.
Then those are 512-byte-sector drives [...]
No; because I can do 4K atomic writes, I want to know about that.
   
   And, can't you do that with traditional drives, drives which really do
   have 512-byte sectors?  Do a 4K transfer and you write 8 physical
   sectors with no opportunity for any other operation to see the write
   partially done.  Is that wrong, or am I missing something else?
 
 Insert a kernel panic (or power failure(*)) after five sectors and

What's a kernel panic got to do with it?  If you hand the controller
and thus the drive 4K write, the kernel panicing won't suddenly cause
you to reverse time and have issued 8 512-byte writes instead.

Given how drives actually write data, I would not be so sanguine
that any sector, of whatever size, in-flight when the power fails,
is actually written with the values you expect, or not written
at all.



Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Holland
On Tue, Dec 04, 2012 at 09:26:17AM -0500, Thor Lancelot Simon wrote:
 And, can't you do that with traditional drives, drives which really do
 have 512-byte sectors?  Do a 4K transfer and you write 8 physical
 sectors with no opportunity for any other operation to see the write
 partially done.  Is that wrong, or am I missing something else?
   
   Insert a kernel panic (or power failure(*)) after five sectors and
  
  What's a kernel panic got to do with it?  If you hand the controller
  and thus the drive 4K write, the kernel panicing won't suddenly cause
  you to reverse time and have issued 8 512-byte writes instead.

That depends on additional properties of the pathway from the FS to
the drive firmware. It might have sent 1 of 2 2048-byte writes before
the panic, for example. Or it might be a vintage controller incapable
of handling more than one sector at a time.

Also, if there's a panic while the kernel is in the middle of talking
to the drive, such that the drive receives only part of the data you
intended to send, one can be reasonably certain it will reject a
partial sector... but if it's received 5 of 8 physical sectors and the
6th is partial, it may well write out those 5, which isn't what was
intended.

  Given how drives actually write data, I would not be so sanguine
  that any sector, of whatever size, in-flight when the power fails,
  is actually written with the values you expect, or not written
  at all.

Yes, I'm aware of that. It remains a useful approximation, especially
for already-existing FS code.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-04 Thread David Laight
On Tue, Dec 04, 2012 at 02:57:52PM +, David Holland wrote:
   
   What's a kernel panic got to do with it?  If you hand the controller
   and thus the drive 4K write, the kernel panicing won't suddenly cause
   you to reverse time and have issued 8 512-byte writes instead.
 
 That depends on additional properties of the pathway from the FS to
 the drive firmware. It might have sent 1 of 2 2048-byte writes before
 the panic, for example. Or it might be a vintage controller incapable
 of handling more than one sector at a time.

The ATA command set supports writes of multiple sectors and multi-sector
writes (probably not using those terms though!).

In the first case, although a single command is written the drive
will (effectively) loop through the sectors writing them 1 by 1.
All drives support this mode.

For multi-sector writes, the data transfer for each group of sectors
is done as a single burst. So if the drive supports 8 sector multi-sector
writes, and you are doing PIO transfers, you take a single 'data'
interrupt and then write all 4k bytes at once (assuming 512 byte sectors).
The drive identify response indicates whether multi-sector writes are
supported, and if so how many sectors can be written at once.
If the data transfer is DMA, it probably makes little difference to the
driver.

For quite a long time the netbsd ata driver mixes them up - and would
only request writes of multiple sectors if the drive supported multi-sector
writes.

Multi-sector writes are probably quite difficult to kill part way through
since there is only one DMA transfer block.

   Given how drives actually write data, I would not be so sanguine
   that any sector, of whatever size, in-flight when the power fails,
   is actually written with the values you expect, or not written
   at all.
 
 Yes, I'm aware of that. It remains a useful approximation, especially
 for already-existing FS code.

Given that (AFAIK) a physical sector is not dissimilar from an hdlc frame,
once the write has started the old data is gone, it the write is actually
interrupted you'll get a (correctable) bad sector.
If you are really unlucky the write will be long - and trash the
following sector (I managed to power off a floppy controller before it
wrecked the rest of a track when I'd reset the writer with write enabled).
If you are really, really unlucky I think it is possible to destroy
adjacent tracks.

David

-- 
David Laight: da...@l8s.co.uk


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-03 Thread Michael van Elst
mo...@rodents-montreal.org (Mouse) writes:

 things.  What I care about is the largest size sector that will (in
 ^^^
 the ordinary course of things anyway) be written atomically.
 Then those are 512-byte-sector drives [...]
 No; because I can do 4K atomic writes, I want to know about that.

And, can't you do that with traditional drives, drives which really do
have 512-byte sectors?  Do a 4K transfer and you write 8 physical
sectors with no opportunity for any other operation to see the write
partially done.  Is that wrong, or am I missing something else?

The drive could partially complete the write, i.e. if one of the
latter sectors has a write error or if the drive is powered down
in the middle of the operation.

Sure, you would know about it. But in case of a crash you can't rely
on data consistency.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-02 Thread Julian Yon
On Sun, 2 Dec 2012 04:04:23 +
David Holland dholland-t...@netbsd.org wrote:

 On Sun, Dec 02, 2012 at 03:22:24AM +, Julian Yon wrote:
It's not weird, and there is a gain; it's for compatibility with
large amounts of deployed code that assumes all devices have
512-byte blocks.
   
   If code makes that assumption, how does the reported block size
   affect that? Lying is illogical. Code either assumes a specific
   size (and ignores what you tell it), or it believes what it's
   told. Either way, dishonesty gains nothing.
 
 If code just blindly makes that assumption, it's ignoring what's being
 reported.

You appear to have just agreed with me, which makes me wonder what I'm
missing, given you continue as though you disagree.

 I assume there is or was code in Windows (like we used to have code in
 NetBSD) that would check the sector size and refuse to run if it
 wasn't 512.

IMHO any time you do the same thing as Windows, you're almost certainly
doing it wrong.

 However, we're talking about hardware here, so you have to also
 consider the possibility that the drive firmware reports 512 because
 that's what someone coded up back in 1992 and nobody got around to
 fixing it.

If that doesn't count as broken, what does? (Also, gosh, when did 1992
become so long ago?)


Julian

-- 
3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me


signature.asc
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-02 Thread Michael van Elst
mo...@rodents-montreal.org (Mouse) writes:

 These disks lie about their actual sector size.
 These disks just follow their specification.

That's as meaningless as...on, to pick an unreasonably extreme example,
a hitman saying I was just following orders.

Apparently as meaningless saying lies about.

 They also report the true sector size.

Not according to the documentation, at least not in the one case I
investigated.  The documentation flat-out says the sector size is 4K,
but the disk claims to have half-K sectors.

The problem is that there are two sizes here,

That's why the disk has multiple attributes that it can report.

Neither.  The sector size claimed to the host should equal both the
sector size on the media and the granularity of the interface.

Apparently that doesn't work out :)

Anything else is IMO a bug in the drive and should be treated as such,
which in NetBSD's case I would say means a quirk entry, documented as
being a workaround for broken hardware, for it.

Believing the drive that it has standard sector sizes works fine.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-02 Thread Michael van Elst
jul...@yon.org.uk (Julian Yon) writes:

If it's smaller than the atomic write size that's equally weird.
Because that implies that the designers have made the explicit decision
to sacrifice performance for no gain.

The gain of course is that people can use the drive and will buy it.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Michael van Elst
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote:
 On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote:
  da...@l8s.co.uk (David Laight) writes:
  
  I must look at how to determine that disks have 4k sectors and to
  ensure filesystesm have 4k fragments - regardless of the fs size.
  
  newfs should already ensure that fragment = sector.
 
 These disks lie about their actual sector size.
 The disk's own software does RMW cycles for 512 byte writes.

These disks just follow their specification. They also report
the true sector size. The problem is how to interpret it, obviously
you can access the disk in 512 byte units and the real size and
alignment just affects performance. So should the disk lie about
the blocks you can address or lie about some recommended block size
for accesses?

The rest of the world just ignores such problems by using some
values that are sufficiently sized/aligned for old and new disks.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 These disks lie about their actual sector size.
 These disks just follow their specification.

That's as meaningless as...on, to pick an unreasonably extreme example,
a hitman saying I was just following orders.

 They also report the true sector size.

Not according to the documentation, at least not in the one case I
investigated.  The documentation flat-out says the sector size is 4K,
but the disk claims to have half-K sectors.

The problem is that there are two sizes here, which have historically
been identical: the sector size on the media and the granularity of the
interface.  Trouble is, they were identical for good reason.  I
consider decoupling them slightly broken.  I consider decoupling them
without updating the interface to report both sizes cripplingly broken.

 So should the disk lie about the blocks you can address or lie about
 some recommended block size for accesses?

Neither.  The sector size claimed to the host should equal both the
sector size on the media and the granularity of the interface.

Either that or a new interface should be defined which reports both the
media sector size and the interface grain size.

Anything else is IMO a bug in the drive and should be treated as such,
which in NetBSD's case I would say means a quirk entry, documented as
being a workaround for broken hardware, for it.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sat, Dec 01, 2012 at 04:27:14PM -0500, Mouse wrote:
  Neither.  The sector size claimed to the host should equal both the
  sector size on the media and the granularity of the interface.

As a consumer of block devices, I don't care about either of these
things. What I care about is the largest size sector that will (in
the ordinary course of things anyway) be written atomically.

I might also care about larger sizes that the drive considers
significant for alignment purposes; but probably not very much.

I don't care about the block granularity of the interface. (Unless I
suppose it's larger than the atomic write size; but that would be
weird.)

I care even less about how the media is organized internally; if it
announces that the atomic write size is 1024 bytes, it's 1024 bytes,
even if it really means that it is writing one bit each to 8192 steel
drum spindles.

Now, we have legacy code that contains additional assumptions, such as
the belief that the atomic write size is the same from device to
device, or that it can be set at newfs time rather than being a
dynamic/run-time property of the block device. And we have a lot of
code that uses DEV_BSIZE as a convenient unit of measurement and mixes
it indiscriminately with other device size properties. However, all
this stuff should be cleaned up in the long term.

It may also be necessary for lower-level code (e.g. the scsi layer) to
know more than this, but any of that can be isolated underneath the
block device interface.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 Neither.  The sector size claimed to the host should equal both the
 sector size on the media and the granularity of the interface.
 As a consumer of block devices, I don't care about either of these
 things.  What I care about is the largest size sector that will (in
 the ordinary course of things anyway) be written atomically.

Then those are 512-byte-sector drives as far as you're concerned; you
can ignore the 4K reality.  At least, absent bugs in the drives, but
that's always a valid caveat.

This is because the RMW cycle that goes on internally for sub-4K writes
is invisible: a 512-byte write always either has completed in full or
has not yet started at all as far as all other interactions with the
drive goes.  That is, such writes (and reads) are atomic.

It's a coherent point of view.  But it's one I don't share; I care more
about performance than that.  This is why I care about visibility into
internal organization.

 I might also care about larger sizes that the drive considers
 significant for alignment purposes; but probably not very much.

That depends on whether you care about performance.

 I don't care about the block granularity of the interface.

Don't you pretty much have to care about it, since that's the unit in
which data addresses are presented to its interface?  Or is that
something you believe should be hidden by...something else?  (It's not
clear to me exactly what the `you' that doesn't care about interface
granularity includes - hardware driver authors? filesystem authors?
midlayer (eg scsipi) authors?)

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Julian Yon

On Sat, 1 Dec 2012 23:46:07 +
David Holland dholland-t...@netbsd.org wrote:

 I don't care about the block granularity of the interface. (Unless I
 suppose it's larger than the atomic write size; but that would be
 weird.)

If it's smaller than the atomic write size that's equally weird.
Because that implies that the designers have made the explicit decision
to sacrifice performance for no gain. But there is a cost: they had to
write firmware code to emulate that block size.


Julian

-- 
3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me


signature.asc
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sat, Dec 01, 2012 at 07:07:36PM -0500, Mouse wrote:
   Neither.  The sector size claimed to the host should equal both the
   sector size on the media and the granularity of the interface.
   As a consumer of block devices, I don't care about either of these
   things.  What I care about is the largest size sector that will (in
   ^^^
   the ordinary course of things anyway) be written atomically.
  
  Then those are 512-byte-sector drives as far as you're concerned; you
  can ignore the 4K reality.  At least, absent bugs in the drives, but
  that's always a valid caveat.

No; because I can do 4K atomic writes, I want to know about that.
(Quite apart from any performance issues.)

Physical realities pretty much guarantee that the largest atomic write
is not going to cause a RMW cycle... at least on items that are
actually block-based. 

RAIDs where you have to RMW a whole stripe or something but it isn't
atomic might be a somewhat different story. I'm not sure how one would
build a journaling FS on one of those without having it suck. (I
guess by stuffing the journal into NVRAM.)

   I don't care about the block granularity of the interface.
  
  Don't you pretty much have to care about it, since that's the unit in
  which data addresses are presented to its interface?  Or is that
  something you believe should be hidden by...something else?

That is something only the device driver should have to be aware of.

(There's an implicit assumption here that block devices should be
addressed with byte offsets, as they are from userland, even though
this typically wastes a dozen or so bits; the minor overhead is far
preferable to the confusion that arises when you have multiple size
units floating around, and the consequences of just one bug that mixes
block offsets measured in different block sizes can be catastrophic.)

  (It's not clear to me exactly what the `you' that doesn't care
  about interface granularity includes - hardware driver authors?
  filesystem authors?  midlayer (eg scsipi) authors?)

I'm speaking from a filesystem point of view; but, more specifically,
I'm talking about the abstraction we call a block device, whith sits
above stuff like scsipi.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sun, Dec 02, 2012 at 01:32:17AM +, Julian Yon wrote:
   I don't care about the block granularity of the interface. (Unless I
   suppose it's larger than the atomic write size; but that would be
   weird.)
  
  If it's smaller than the atomic write size that's equally weird.
  Because that implies that the designers have made the explicit decision
  to sacrifice performance for no gain. But there is a cost: they had to
  write firmware code to emulate that block size.

It's not weird, and there is a gain; it's for compatibility with large
amounts of deployed code that assumes all devices have 512-byte blocks.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 things.  What I care about is the largest size sector that will (in
 ^^^
 the ordinary course of things anyway) be written atomically.
 Then those are 512-byte-sector drives [...]
 No; because I can do 4K atomic writes, I want to know about that.

And, can't you do that with traditional drives, drives which really do
have 512-byte sectors?  Do a 4K transfer and you write 8 physical
sectors with no opportunity for any other operation to see the write
partially done.  Is that wrong, or am I missing something else?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-30 Thread Michael van Elst
da...@l8s.co.uk (David Laight) writes:

I must look at how to determine that disks have 4k sectors and to
ensure filesystesm have 4k fragments - regardless of the fs size.

newfs should already ensure that fragment = sector.

By the sound of it the log ought to be written in fs frag (or block)
sized chunks - even if that means that 'pad' entries get written
in order to flush it to disk after a period of inactivity.

WALBL ?

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-30 Thread David Laight
On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote:
 da...@l8s.co.uk (David Laight) writes:
 
 I must look at how to determine that disks have 4k sectors and to
 ensure filesystesm have 4k fragments - regardless of the fs size.
 
 newfs should already ensure that fragment = sector.

These disks lie about their actual sector size.
The disk's own software does RMW cycles for 512 byte writes.

David

-- 
David Laight: da...@l8s.co.uk


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-30 Thread David Holland
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote:
   I must look at how to determine that disks have 4k sectors and to
   ensure filesystesm have 4k fragments - regardless of the fs size.
   
   newfs should already ensure that fragment = sector.
  
  These disks lie about their actual sector size.
  The disk's own software does RMW cycles for 512 byte writes.

Right, and it's important for FS code to be able to figure out what
the right atomic write size is... on the disk it's using, which might
not be the same as the disk that was newfs'd.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-29 Thread Mouse
 I must look at how to determine that disks have 4k sectors and to
 ensure filesystesm have 4k fragments - regardless of the fs size.

Seems to me the right thing is to believe what the disk tells you.  If
you really want to be friendly to broken hardware, add a quirk for
disks known to lie about their sector size.  (Yes, I consider it broken
for a disk to lie about its sector size.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Greg Troxel

Edgar Fuß e...@math.uni-bonn.de writes:

 I seem to be facing two problems:
 
 1. A certain svn update command is ridicously slow on my to-be file server.
 2. During the svn update, the machine partially locks up and fails to 
 respond 
to NFS requests.
 Thanks to very kind help by hannken@, I now at least know what the problem is.

 Short form: WAPBL is currently completely unusable on RAIDframe (I always 
 suspected something like that), at least on non-Level 0 sets.

 The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non-
 fsbsize boundaries. So RAIDframe is nearly sure to RMW.
 That makes the log being written to disc at about 1MB/s with the write lock 
 on the log being held. So everything else on that fs tstiles on the log's 
 read lock.

Do you see this on RAID-1 too?

I wonder if it's possible (easily) to make the log only use fsbize
boundaries, (maybe forcing it to be bigger as a side effect.)


pgpqIZfF6neHr.pgp
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Edgar Fuß
 Do you see this on RAID-1 too?
Well, I see a performance degradation, albeit not as much as on Level 5.

 I wonder if it's possible (easily) to make the log only use fsbize
 boundaries, (maybe forcing it to be bigger as a side effect.)
Volunteers welcome.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread J. Hannken-Illjes
On Nov 28, 2012, at 6:02 PM, Greg Troxel g...@ir.bbn.com wrote:

 
 Edgar Fuß e...@math.uni-bonn.de writes:
 
 I seem to be facing two problems:
 
 1. A certain svn update command is ridicously slow on my to-be file server.
 2. During the svn update, the machine partially locks up and fails to 
 respond 
   to NFS requests.
 Thanks to very kind help by hannken@, I now at least know what the problem 
 is.
 
 Short form: WAPBL is currently completely unusable on RAIDframe (I always 
 suspected something like that), at least on non-Level 0 sets.
 
 The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non-
 fsbsize boundaries. So RAIDframe is nearly sure to RMW.
 That makes the log being written to disc at about 1MB/s with the write lock 
 on the log being held. So everything else on that fs tstiles on the log's 
 read lock.
 
 Do you see this on RAID-1 too?
 
 I wonder if it's possible (easily) to make the log only use fsbize
 boundaries, (maybe forcing it to be bigger as a side effect.)


Sure -- add fsbsize sized buffer to struct wapbl and teach wapbl_write()
to collect data until the buffers start or end touches a fsbsize boundary.

As long as the writes don't cross the logs end they already come ordered.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Brian Buhrow
Hello.  If running 5.1 or 5.2 is acceptable for you, you could run
ffs+softdep since it has all the namei fixes in it.
-Brian

`i
On Nov 28,  5:15pm, Edgar =?iso-8859-1?B?RnXf?= wrote:
} Subject: Problem identified: WAPL/RAIDframe performance problems
}  I seem to be facing two problems:
}  
}  1. A certain svn update command is ridicously slow on my to-be file server.
}  2. During the svn update, the machine partially locks up and fails to 
respond 
} to NFS requests.
} Thanks to very kind help by hannken@, I now at least know what the problem is.
} 
} Short form: WAPBL is currently completely unusable on RAIDframe (I always 
} suspected something like that), at least on non-Level 0 sets.
} 
} The problem turned out to be wapbl_flush() writing non-fsbsize chunks on non-
} fsbsize boundaries. So RAIDframe is nearly sure to RMW.
} That makes the log being written to disc at about 1MB/s with the write lock 
} on the log being held. So everything else on that fs tstiles on the log's 
} read lock.
} 
} Anyone in a position to improve that? I could simply turn off logging, but 
then 
} any non-clean shutdown is sure to take ages.
-- End of excerpt from Edgar =?iso-8859-1?B?RnXf?=




Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Thor Lancelot Simon
On Wed, Nov 28, 2012 at 06:41:28PM +0100, J. Hannken-Illjes wrote:
 On Nov 28, 2012, at 6:02 PM, Greg Troxel g...@ir.bbn.com wrote:
  Do you see this on RAID-1 too?
  
  I wonder if it's possible (easily) to make the log only use fsbize
  boundaries, (maybe forcing it to be bigger as a side effect.)
 
 Sure -- add fsbsize sized buffer to struct wapbl and teach wapbl_write()
 to collect data until the buffers start or end touches a fsbsize boundary.

It is worth looking at the extensive work they did on this in XFS.



Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Michael van Elst
g...@ir.bbn.com (Greg Troxel) writes:

I wonder if it's possible (easily) to make the log only use fsbize
boundaries, (maybe forcing it to be bigger as a side effect.)

Writing filesystem blocks won't help. RAIDframe needs writes as large
as a stripe.

The log itself could write much larger chunks but flushing is done
in a series of writes as small as a single physical block. I think
the only way to improve that is to copy everything first into a
large buffer. Not very efficient.


-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread J. Hannken-Illjes
On Nov 28, 2012, at 9:20 PM, Brian Buhrow buh...@nfbcal.org wrote:

   Hello.  If running 5.1 or 5.2 is acceptable for you, you could run
 ffs+softdep since it has all the namei fixes in it.

I suppose running fsck on a 6 TByte file system will take hours and
softdep needs this after a crash.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread J. Hannken-Illjes
On Nov 28, 2012, at 10:13 PM, Michael van Elst mlel...@serpens.de wrote:

 g...@ir.bbn.com (Greg Troxel) writes:
 
 I wonder if it's possible (easily) to make the log only use fsbize
 boundaries, (maybe forcing it to be bigger as a side effect.)
 
 Writing filesystem blocks won't help. RAIDframe needs writes as large
 as a stripe.

The file system block size should match the raid stripe size or you
have much more problems than flushing the log.

 The log itself could write much larger chunks but flushing is done
 in a series of writes as small as a single physical block. I think
 the only way to improve that is to copy everything first into a
 large buffer. Not very efficient.

Needing to copy say 8 Mbytes of data and writing it in big chunks
will be much faster than writing it many smaller unaligned segments.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Edgar Fuß
 Writing filesystem blocks won't help.
 RAIDframe needs writes as large as a stripe.
Nothing prevents one from making both quantities the same value

 The log itself could write much larger chunks but flushing is done
 in a series of writes as small as a single physical block. I think
 the only way to improve that is to copy everything first into a
 large buffer. Not very efficient.
As far as I understood hannken@, I'm bitten by writing to the log, not by
flushing it.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Mouse
 I suppose running fsck on a 6 TByte file system will take hours

Based on my own experience with a 7T filesystem, I would suggest you
try it rather than masking assumptions.

Depending on your use case, you may be able to speed fsck up
dramatically by choosing the parameters for your filesystem suitably.
I find that fsck on a filesystem built with -f 8192 -b 65536 -n 1, for
example, is a great deal faster than on a filesystem built on the same
amount of disk space with the defaults.  (I have a few filesystems for
which that combination of parameters is appropriate: a small number of
large files with little churn.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread J. Hannken-Illjes

On Nov 28, 2012, at 10:19 PM, Edgar Fuß e...@math.uni-bonn.de wrote:

 Writing filesystem blocks won't help.
 RAIDframe needs writes as large as a stripe.
 Nothing prevents one from making both quantities the same value
 
 The log itself could write much larger chunks but flushing is done
 in a series of writes as small as a single physical block. I think
 the only way to improve that is to copy everything first into a
 large buffer. Not very efficient.
 As far as I understood hannken@, I'm bitten by writing to the log, not by
 flushing it.


Flushing is just writing to the log.  These writes have sizes between
512 bytes and the file system block size.  Problem is these writes
are not multiples of and are not aligned to file system block size.

Collecting the data and writing MAXPHYS bytes aligned to MAXPHYS
should improve wapbl on raid.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)



Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Mouse
 Writing filesystem blocks won't help.
 RAIDframe needs writes as large as a stripe.
 Nothing prevents one from making both quantities the same value

That's not always true.  For example, I think filesystem block sizes
must be powers of two, but a RAID 5 with four members will necessarily
have a stripe size that's a multiple of three.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Brian Buhrow
Hello.  Well, to each his own, but for comparison, I have a system
running 5.1 withthe the latest namei changes with a 13TB filesystem which,
if fsck needs to run, takes less than an hour to complete.  I've found 5.1
to be very stable, and so haven't had to worry about the penalty of running
fsck after a crash very often.  I've found raidframe to be invaluable in my
installations, and to have WAPBL be broken in 6.x in conjunction with
raidframe seems like a pretty big deturrent for me. 


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Edgar Fuß
 That's not always true.
OK. Nothing prevents me from making these two values equal (I have five discs).


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Manuel Bouyer
On Wed, Nov 28, 2012 at 10:14:58PM +0100, J. Hannken-Illjes wrote:
 On Nov 28, 2012, at 9:20 PM, Brian Buhrow buh...@nfbcal.org wrote:
 
  Hello.  If running 5.1 or 5.2 is acceptable for you, you could run
  ffs+softdep since it has all the namei fixes in it.
 
 I suppose running fsck on a 6 TByte file system will take hours and
 softdep needs this after a crash.

Well, the journal doesn't always avoids the fsck, it depends on the king
of the crash (if it's a panic in filesystem code I know I want to run
fsck anyway :)

Also, the fsck time depends a lot of the filesystems parameters.
A 9Tb filesystem formatted -O2 -b 32k -f4k -i100 can be checked
in less than one hour.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Michael van Elst
On Wed, Nov 28, 2012 at 10:18:04PM +0100, J. Hannken-Illjes wrote:
 On Nov 28, 2012, at 10:13 PM, Michael van Elst mlel...@serpens.de wrote:
 
  g...@ir.bbn.com (Greg Troxel) writes:
  
  I wonder if it's possible (easily) to make the log only use fsbize
  boundaries, (maybe forcing it to be bigger as a side effect.)
  
  Writing filesystem blocks won't help. RAIDframe needs writes as large
  as a stripe.
 
 The file system block size should match the raid stripe size or you
 have much more problems than flushing the log.

True. Still difficult to do, in particular for metadata which is written
in fragsized blocks. Best for speed is probably to use fragsize=blocksize=64k.


  The log itself could write much larger chunks but flushing is done
  in a series of writes as small as a single physical block. I think
  the only way to improve that is to copy everything first into a
  large buffer. Not very efficient.
 
 Needing to copy say 8 Mbytes of data and writing it in big chunks
 will be much faster than writing it many smaller unaligned segments.

One or two MB is probably good enough. A quick test of unpacking
base.tgz produces transactions of ~3MB and 1.5MB and a few smaller
ones of 30-50kB.


-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-11-28 Thread Thor Lancelot Simon
On Wed, Nov 28, 2012 at 04:28:57PM -0500, Mouse wrote:
  Writing filesystem blocks won't help.
  RAIDframe needs writes as large as a stripe.
  Nothing prevents one from making both quantities the same value
 
 That's not always true.  For example, I think filesystem block sizes
 must be powers of two, but a RAID 5 with four members will necessarily
 have a stripe size that's a multiple of three.

True.  But the size of the writes generated by the filesystems, as it
turns out, does not relate in the way you might expect to the filesystem
block size.

For example, in tls-maxphys Manuel and I have eliminated the code that
chose readahead and writebehind (clustering) I/O sizes by shifting the
filesystem blocksize (which always gave power-of-two sizes) and replaced
it with the more relaxed constraint that it must simply write full pages.
So you can have a filesystem with a 4K blocksize but, if you're on a
RAIDframe RAID5 volume with 4 disks and an underlying MAXPHYS of 64K,
find yourself sending 192K transactions to RAIDframe and thus the
desired 64K to each data disk.

I don't see why -- in theory -- the log code couldn't do the analogous
thing.  Though at some point, you end up with the LFS problem -- the
need to flush partial clusters of transactions because you don't want
to let them linger uncommitted for too much time.