re(4) MAC address

2012-12-01 Thread Frank Wille
Hi,

I was testing a NH-230/231 NAS (running sandpoint) and wondered why
the PPCBoot firmware and the previously installed Linux used a different
MAC address than NetBSD does.

I found out that NetBSD's re(4) driver is reading the MAC from EEPROM
while PPCBoot and Linux are reading it from the chip's ID-registers
(RTK_IDRn).

What is correct? This is a Realtek 8169S:

# pcictl pci0 dump -d 15
PCI configuration registers:
  Common header:
0x00: 0x816910ec 0x02b00107 0x0210 0x8008

Vendor Name: Realtek Semiconductor (0x10ec)
Device Name: 8169/8110 10/100/1000 Ethernet (0x8169)
[...]


Sorry for cross-posting, but I couldn't decide whether this belongs to
tech-kern or tech-net.

-- 
Frank Wille


Re: re(4) MAC address

2012-12-01 Thread Izumi Tsutsui
 I found out that NetBSD's re(4) driver is reading the MAC from EEPROM
 while PPCBoot and Linux are reading it from the chip's ID-registers
 (RTK_IDRn).
 
 What is correct? This is a Realtek 8169S:

Probably it's defined by hardware vendors, not chip.

The old RTL8139 (RTL8169 has compat mode) seems to read MAC address
from EEPROM and those values are stored into RTK_IDRn registers.
I guess some NAS vendors overwrite RTK_IDn registers by firmware
to avoid extra EEPROM configurations during production.

We can change values per hardware by adding device properties
(prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc).

---
Izumi Tsutsui


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Michael van Elst
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote:
 On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote:
  da...@l8s.co.uk (David Laight) writes:
  
  I must look at how to determine that disks have 4k sectors and to
  ensure filesystesm have 4k fragments - regardless of the fs size.
  
  newfs should already ensure that fragment = sector.
 
 These disks lie about their actual sector size.
 The disk's own software does RMW cycles for 512 byte writes.

These disks just follow their specification. They also report
the true sector size. The problem is how to interpret it, obviously
you can access the disk in 512 byte units and the real size and
alignment just affects performance. So should the disk lie about
the blocks you can address or lie about some recommended block size
for accesses?

The rest of the world just ignores such problems by using some
values that are sufficiently sized/aligned for old and new disks.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
A potential Snark may lurk in every tree.


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 These disks lie about their actual sector size.
 These disks just follow their specification.

That's as meaningless as...on, to pick an unreasonably extreme example,
a hitman saying I was just following orders.

 They also report the true sector size.

Not according to the documentation, at least not in the one case I
investigated.  The documentation flat-out says the sector size is 4K,
but the disk claims to have half-K sectors.

The problem is that there are two sizes here, which have historically
been identical: the sector size on the media and the granularity of the
interface.  Trouble is, they were identical for good reason.  I
consider decoupling them slightly broken.  I consider decoupling them
without updating the interface to report both sizes cripplingly broken.

 So should the disk lie about the blocks you can address or lie about
 some recommended block size for accesses?

Neither.  The sector size claimed to the host should equal both the
sector size on the media and the granularity of the interface.

Either that or a new interface should be defined which reports both the
media sector size and the interface grain size.

Anything else is IMO a bug in the drive and should be treated as such,
which in NetBSD's case I would say means a quirk entry, documented as
being a workaround for broken hardware, for it.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: re(4) MAC address

2012-12-01 Thread Frank Wille
Izumi Tsutsui wrote:

On 02.12.12 00:40:44 you wrote:

 I found out that NetBSD's re(4) driver is reading the MAC from EEPROM
 while PPCBoot and Linux are reading it from the chip's ID-registers
 (RTK_IDRn).
 
 What is correct? This is a Realtek 8169S:

 Probably it's defined by hardware vendors, not chip.

 The old RTL8139 (RTL8169 has compat mode) seems to read MAC address
 from EEPROM and those values are stored into RTK_IDRn registers.

Who writes it into the IDRn registers? The firmware? The driver? Or the chip
itself? When the chip does that automatically, then re(4) should depend on
RTK_IDRn and not on the EEPROM.


 I guess some NAS vendors overwrite RTK_IDn registers by firmware
 to avoid extra EEPROM configurations during production.

You may be right. I found a modification in the PPCBoot source, which reads
the environment variable ethaddr and copies it to RTK_IDRn.

But the EEPROM seems to have a valid contents (only the last three bytes
differ) and I wonder why it is not used.


 We can change values per hardware by adding device properties
 (prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc).

Yes. I added a mac-address property to sk(4) myself, some time ago. But
re(4) doesn't support it yet.


-- 
Frank Wille



Re: re(4) MAC address

2012-12-01 Thread Izumi Tsutsui
Frank Wille wrote:

  Probably it's defined by hardware vendors, not chip.
 
  The old RTL8139 (RTL8169 has compat mode) seems to read MAC address
  from EEPROM and those values are stored into RTK_IDRn registers.
 
 Who writes it into the IDRn registers? The firmware? The driver? Or the chip
 itself? When the chip does that automatically, then re(4) should depend on
 RTK_IDRn and not on the EEPROM.

IIRC RTL8139 doc says the chip reads the values from EEPROM automatically.
We should follow what 8169 doc specifies, but I don't have 8169 docs.

  I guess some NAS vendors overwrite RTK_IDn registers by firmware
  to avoid extra EEPROM configurations during production.
 
 You may be right. I found a modification in the PPCBoot source, which reads
 the environment variable ethaddr and copies it to RTK_IDRn.
 
 But the EEPROM seems to have a valid contents (only the last three bytes
 differ) and I wonder why it is not used.

Probably all NASes has the same values in EEPROM?
(i.e. no re's EEPROM write operations during manufacture)

  We can change values per hardware by adding device properties
  (prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc).
 
 Yes. I added a mac-address property to sk(4) myself, some time ago. But
 re(4) doesn't support it yet.

You can add it if necessary, to avoid unexpected changes on other NICs.

---
Izumi Tsutsui


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sat, Dec 01, 2012 at 04:27:14PM -0500, Mouse wrote:
  Neither.  The sector size claimed to the host should equal both the
  sector size on the media and the granularity of the interface.

As a consumer of block devices, I don't care about either of these
things. What I care about is the largest size sector that will (in
the ordinary course of things anyway) be written atomically.

I might also care about larger sizes that the drive considers
significant for alignment purposes; but probably not very much.

I don't care about the block granularity of the interface. (Unless I
suppose it's larger than the atomic write size; but that would be
weird.)

I care even less about how the media is organized internally; if it
announces that the atomic write size is 1024 bytes, it's 1024 bytes,
even if it really means that it is writing one bit each to 8192 steel
drum spindles.

Now, we have legacy code that contains additional assumptions, such as
the belief that the atomic write size is the same from device to
device, or that it can be set at newfs time rather than being a
dynamic/run-time property of the block device. And we have a lot of
code that uses DEV_BSIZE as a convenient unit of measurement and mixes
it indiscriminately with other device size properties. However, all
this stuff should be cleaned up in the long term.

It may also be necessary for lower-level code (e.g. the scsi layer) to
know more than this, but any of that can be isolated underneath the
block device interface.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 Neither.  The sector size claimed to the host should equal both the
 sector size on the media and the granularity of the interface.
 As a consumer of block devices, I don't care about either of these
 things.  What I care about is the largest size sector that will (in
 the ordinary course of things anyway) be written atomically.

Then those are 512-byte-sector drives as far as you're concerned; you
can ignore the 4K reality.  At least, absent bugs in the drives, but
that's always a valid caveat.

This is because the RMW cycle that goes on internally for sub-4K writes
is invisible: a 512-byte write always either has completed in full or
has not yet started at all as far as all other interactions with the
drive goes.  That is, such writes (and reads) are atomic.

It's a coherent point of view.  But it's one I don't share; I care more
about performance than that.  This is why I care about visibility into
internal organization.

 I might also care about larger sizes that the drive considers
 significant for alignment purposes; but probably not very much.

That depends on whether you care about performance.

 I don't care about the block granularity of the interface.

Don't you pretty much have to care about it, since that's the unit in
which data addresses are presented to its interface?  Or is that
something you believe should be hidden by...something else?  (It's not
clear to me exactly what the `you' that doesn't care about interface
granularity includes - hardware driver authors? filesystem authors?
midlayer (eg scsipi) authors?)

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Julian Yon

On Sat, 1 Dec 2012 23:46:07 +
David Holland dholland-t...@netbsd.org wrote:

 I don't care about the block granularity of the interface. (Unless I
 suppose it's larger than the atomic write size; but that would be
 weird.)

If it's smaller than the atomic write size that's equally weird.
Because that implies that the designers have made the explicit decision
to sacrifice performance for no gain. But there is a cost: they had to
write firmware code to emulate that block size.


Julian

-- 
3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me


signature.asc
Description: PGP signature


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sat, Dec 01, 2012 at 07:07:36PM -0500, Mouse wrote:
   Neither.  The sector size claimed to the host should equal both the
   sector size on the media and the granularity of the interface.
   As a consumer of block devices, I don't care about either of these
   things.  What I care about is the largest size sector that will (in
   ^^^
   the ordinary course of things anyway) be written atomically.
  
  Then those are 512-byte-sector drives as far as you're concerned; you
  can ignore the 4K reality.  At least, absent bugs in the drives, but
  that's always a valid caveat.

No; because I can do 4K atomic writes, I want to know about that.
(Quite apart from any performance issues.)

Physical realities pretty much guarantee that the largest atomic write
is not going to cause a RMW cycle... at least on items that are
actually block-based. 

RAIDs where you have to RMW a whole stripe or something but it isn't
atomic might be a somewhat different story. I'm not sure how one would
build a journaling FS on one of those without having it suck. (I
guess by stuffing the journal into NVRAM.)

   I don't care about the block granularity of the interface.
  
  Don't you pretty much have to care about it, since that's the unit in
  which data addresses are presented to its interface?  Or is that
  something you believe should be hidden by...something else?

That is something only the device driver should have to be aware of.

(There's an implicit assumption here that block devices should be
addressed with byte offsets, as they are from userland, even though
this typically wastes a dozen or so bits; the minor overhead is far
preferable to the confusion that arises when you have multiple size
units floating around, and the consequences of just one bug that mixes
block offsets measured in different block sizes can be catastrophic.)

  (It's not clear to me exactly what the `you' that doesn't care
  about interface granularity includes - hardware driver authors?
  filesystem authors?  midlayer (eg scsipi) authors?)

I'm speaking from a filesystem point of view; but, more specifically,
I'm talking about the abstraction we call a block device, whith sits
above stuff like scsipi.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread David Holland
On Sun, Dec 02, 2012 at 01:32:17AM +, Julian Yon wrote:
   I don't care about the block granularity of the interface. (Unless I
   suppose it's larger than the atomic write size; but that would be
   weird.)
  
  If it's smaller than the atomic write size that's equally weird.
  Because that implies that the designers have made the explicit decision
  to sacrifice performance for no gain. But there is a cost: they had to
  write firmware code to emulate that block size.

It's not weird, and there is a gain; it's for compatibility with large
amounts of deployed code that assumes all devices have 512-byte blocks.

-- 
David A. Holland
dholl...@netbsd.org


Re: Making forced unmounts work

2012-12-01 Thread David Holland
On Thu, Nov 29, 2012 at 06:19:37PM +0100, J. Hannken-Illjes wrote:
   In short the attached diff:
   
   - Adds a new kernel-internal errno ERESTARTVOP and changes VCALL() to
restart a vnode operation once it returns ERESTARTVOP.
   
   - Changes fstrans_start() to take an optional `hint vnode' and return
ERESTARTVOP if the vnode becomes dead.
   
   Is there any major reason we can't just use ERESTART and rerun the
   whole syscall?
  
  Not all vnode operations come from a syscall and to me it looks cleaner
  to use one private errno for exactly this purpose.

Could be. All those places are supposed to be capable of coping with
ERESTART though (otherwise, they break if it happens) so it shouldn't
make much difference. And if it does make a difference somewhere, that
should be fixed... regardless of ERESTART for signals, we want FS
operations to be able to bail and restart for transaction abort.

   I see there are two references to ERESTARTVOP in genfs_io.c, and I
   don't see what they're for without digging deeper, but given that they
   appear to make locking behavior depend on the error condition maybe it
   would be better not to do that too. :-/
  
  This is the wonderful world of VOP_GETPAGES() and VOP_PUTPAGES().  Both
  are called with vnode interlock held and when it is needed and possible
  to check the vnode the interlock has been released.  When these operations
  return ERESTARTVOP we have to lock the interlock because dead_getpages()
  and dead_putpages need it on entry (just to release it).
  
  It is possible to directly return the error from genfs_XXXpages() though.
  To me it looks clearer to always go the ERESTARTVOP route.

Ugh.

I don't like having the locking behavior be different for different
error cases; it's asking for trouble in the long run. I think this
ends up being a reason to use ERESTART instead.

   Also I wonder if there's any way to accomplish this that doesn't
   require adding fstrans calls to every operation in every fs.
  
  Not in a clean way. We would need some kind of reference counting for
  vnode operations and that is quite impossible as vnode operations on
  devices or fifos sometimes wait forever and are called from other fs
  like ufsspec_read() for example.  How could we protect UFS updating
  access times here?

I'm not entirely convinced of that. There are basically three
problems: (a) new incoming threads, (b) threads that are already in
the fs and running, and (c) threads that are already in the fs and
that are stuck more or less permanently because something broke.

Admittedly I don't really understand how fstrans suspending works.
Does it keep track of all the threads that are in the fs, so the (b)
ones can be interrupted somehow, or so we at least can wait until all
of them either leave the fs or enter fstrans somewhere and stall?

If we're going to track that information we should really do it from
vnode_if.c, both to avoid having to modify every fs and to make sure
all fses support it correctly. (We also need to be careful about how
it's done to avoid causing massive lock contention; that's why such
logic doesn't already exist.)

If, however, fstrans isn't tracking that information, I don't see how
suspending the fs helps deal with the (b) threads, because if they're
currently running they can continue to chew on fs-specific data for
arbitrarily long before they run into anything that stalls them, and
there's no way to know when they're done or how many of them there
are.

I don't really see what the issue with ufsspec_read() is, however, so
we may be talking past each other.

-- 
David A. Holland
dholl...@netbsd.org


Re: Problem identified: WAPL/RAIDframe performance problems

2012-12-01 Thread Mouse
 things.  What I care about is the largest size sector that will (in
 ^^^
 the ordinary course of things anyway) be written atomically.
 Then those are 512-byte-sector drives [...]
 No; because I can do 4K atomic writes, I want to know about that.

And, can't you do that with traditional drives, drives which really do
have 512-byte sectors?  Do a 4K transfer and you write 8 physical
sectors with no opportunity for any other operation to see the write
partially done.  Is that wrong, or am I missing something else?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B