re(4) MAC address
Hi, I was testing a NH-230/231 NAS (running sandpoint) and wondered why the PPCBoot firmware and the previously installed Linux used a different MAC address than NetBSD does. I found out that NetBSD's re(4) driver is reading the MAC from EEPROM while PPCBoot and Linux are reading it from the chip's ID-registers (RTK_IDRn). What is correct? This is a Realtek 8169S: # pcictl pci0 dump -d 15 PCI configuration registers: Common header: 0x00: 0x816910ec 0x02b00107 0x0210 0x8008 Vendor Name: Realtek Semiconductor (0x10ec) Device Name: 8169/8110 10/100/1000 Ethernet (0x8169) [...] Sorry for cross-posting, but I couldn't decide whether this belongs to tech-kern or tech-net. -- Frank Wille
Re: re(4) MAC address
I found out that NetBSD's re(4) driver is reading the MAC from EEPROM while PPCBoot and Linux are reading it from the chip's ID-registers (RTK_IDRn). What is correct? This is a Realtek 8169S: Probably it's defined by hardware vendors, not chip. The old RTL8139 (RTL8169 has compat mode) seems to read MAC address from EEPROM and those values are stored into RTK_IDRn registers. I guess some NAS vendors overwrite RTK_IDn registers by firmware to avoid extra EEPROM configurations during production. We can change values per hardware by adding device properties (prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc). --- Izumi Tsutsui
Re: Problem identified: WAPL/RAIDframe performance problems
On Fri, Nov 30, 2012 at 12:00:52PM +, David Laight wrote: On Fri, Nov 30, 2012 at 08:00:51AM +, Michael van Elst wrote: da...@l8s.co.uk (David Laight) writes: I must look at how to determine that disks have 4k sectors and to ensure filesystesm have 4k fragments - regardless of the fs size. newfs should already ensure that fragment = sector. These disks lie about their actual sector size. The disk's own software does RMW cycles for 512 byte writes. These disks just follow their specification. They also report the true sector size. The problem is how to interpret it, obviously you can access the disk in 512 byte units and the real size and alignment just affects performance. So should the disk lie about the blocks you can address or lie about some recommended block size for accesses? The rest of the world just ignores such problems by using some values that are sufficiently sized/aligned for old and new disks. Greetings, -- Michael van Elst Internet: mlel...@serpens.de A potential Snark may lurk in every tree.
Re: Problem identified: WAPL/RAIDframe performance problems
These disks lie about their actual sector size. These disks just follow their specification. That's as meaningless as...on, to pick an unreasonably extreme example, a hitman saying I was just following orders. They also report the true sector size. Not according to the documentation, at least not in the one case I investigated. The documentation flat-out says the sector size is 4K, but the disk claims to have half-K sectors. The problem is that there are two sizes here, which have historically been identical: the sector size on the media and the granularity of the interface. Trouble is, they were identical for good reason. I consider decoupling them slightly broken. I consider decoupling them without updating the interface to report both sizes cripplingly broken. So should the disk lie about the blocks you can address or lie about some recommended block size for accesses? Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. Either that or a new interface should be defined which reports both the media sector size and the interface grain size. Anything else is IMO a bug in the drive and should be treated as such, which in NetBSD's case I would say means a quirk entry, documented as being a workaround for broken hardware, for it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: re(4) MAC address
Izumi Tsutsui wrote: On 02.12.12 00:40:44 you wrote: I found out that NetBSD's re(4) driver is reading the MAC from EEPROM while PPCBoot and Linux are reading it from the chip's ID-registers (RTK_IDRn). What is correct? This is a Realtek 8169S: Probably it's defined by hardware vendors, not chip. The old RTL8139 (RTL8169 has compat mode) seems to read MAC address from EEPROM and those values are stored into RTK_IDRn registers. Who writes it into the IDRn registers? The firmware? The driver? Or the chip itself? When the chip does that automatically, then re(4) should depend on RTK_IDRn and not on the EEPROM. I guess some NAS vendors overwrite RTK_IDn registers by firmware to avoid extra EEPROM configurations during production. You may be right. I found a modification in the PPCBoot source, which reads the environment variable ethaddr and copies it to RTK_IDRn. But the EEPROM seems to have a valid contents (only the last three bytes differ) and I wonder why it is not used. We can change values per hardware by adding device properties (prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc). Yes. I added a mac-address property to sk(4) myself, some time ago. But re(4) doesn't support it yet. -- Frank Wille
Re: re(4) MAC address
Frank Wille wrote: Probably it's defined by hardware vendors, not chip. The old RTL8139 (RTL8169 has compat mode) seems to read MAC address from EEPROM and those values are stored into RTK_IDRn registers. Who writes it into the IDRn registers? The firmware? The driver? Or the chip itself? When the chip does that automatically, then re(4) should depend on RTK_IDRn and not on the EEPROM. IIRC RTL8139 doc says the chip reads the values from EEPROM automatically. We should follow what 8169 doc specifies, but I don't have 8169 docs. I guess some NAS vendors overwrite RTK_IDn registers by firmware to avoid extra EEPROM configurations during production. You may be right. I found a modification in the PPCBoot source, which reads the environment variable ethaddr and copies it to RTK_IDRn. But the EEPROM seems to have a valid contents (only the last three bytes differ) and I wonder why it is not used. Probably all NASes has the same values in EEPROM? (i.e. no re's EEPROM write operations during manufacture) We can change values per hardware by adding device properties (prop_dictionary(3)) calls (like sys/dev/pci/if_wm.c etc). Yes. I added a mac-address property to sk(4) myself, some time ago. But re(4) doesn't support it yet. You can add it if necessary, to avoid unexpected changes on other NICs. --- Izumi Tsutsui
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 04:27:14PM -0500, Mouse wrote: Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in the ordinary course of things anyway) be written atomically. I might also care about larger sizes that the drive considers significant for alignment purposes; but probably not very much. I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) I care even less about how the media is organized internally; if it announces that the atomic write size is 1024 bytes, it's 1024 bytes, even if it really means that it is writing one bit each to 8192 steel drum spindles. Now, we have legacy code that contains additional assumptions, such as the belief that the atomic write size is the same from device to device, or that it can be set at newfs time rather than being a dynamic/run-time property of the block device. And we have a lot of code that uses DEV_BSIZE as a convenient unit of measurement and mixes it indiscriminately with other device size properties. However, all this stuff should be cleaned up in the long term. It may also be necessary for lower-level code (e.g. the scsi layer) to know more than this, but any of that can be isolated underneath the block device interface. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives as far as you're concerned; you can ignore the 4K reality. At least, absent bugs in the drives, but that's always a valid caveat. This is because the RMW cycle that goes on internally for sub-4K writes is invisible: a 512-byte write always either has completed in full or has not yet started at all as far as all other interactions with the drive goes. That is, such writes (and reads) are atomic. It's a coherent point of view. But it's one I don't share; I care more about performance than that. This is why I care about visibility into internal organization. I might also care about larger sizes that the drive considers significant for alignment purposes; but probably not very much. That depends on whether you care about performance. I don't care about the block granularity of the interface. Don't you pretty much have to care about it, since that's the unit in which data addresses are presented to its interface? Or is that something you believe should be hidden by...something else? (It's not clear to me exactly what the `you' that doesn't care about interface granularity includes - hardware driver authors? filesystem authors? midlayer (eg scsipi) authors?) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, 1 Dec 2012 23:46:07 + David Holland dholland-t...@netbsd.org wrote: I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) If it's smaller than the atomic write size that's equally weird. Because that implies that the designers have made the explicit decision to sacrifice performance for no gain. But there is a cost: they had to write firmware code to emulate that block size. Julian -- 3072D/F3A66B3A Julian Yon (2012 General Use) pgp.2...@jry.me signature.asc Description: PGP signature
Re: Problem identified: WAPL/RAIDframe performance problems
On Sat, Dec 01, 2012 at 07:07:36PM -0500, Mouse wrote: Neither. The sector size claimed to the host should equal both the sector size on the media and the granularity of the interface. As a consumer of block devices, I don't care about either of these things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives as far as you're concerned; you can ignore the 4K reality. At least, absent bugs in the drives, but that's always a valid caveat. No; because I can do 4K atomic writes, I want to know about that. (Quite apart from any performance issues.) Physical realities pretty much guarantee that the largest atomic write is not going to cause a RMW cycle... at least on items that are actually block-based. RAIDs where you have to RMW a whole stripe or something but it isn't atomic might be a somewhat different story. I'm not sure how one would build a journaling FS on one of those without having it suck. (I guess by stuffing the journal into NVRAM.) I don't care about the block granularity of the interface. Don't you pretty much have to care about it, since that's the unit in which data addresses are presented to its interface? Or is that something you believe should be hidden by...something else? That is something only the device driver should have to be aware of. (There's an implicit assumption here that block devices should be addressed with byte offsets, as they are from userland, even though this typically wastes a dozen or so bits; the minor overhead is far preferable to the confusion that arises when you have multiple size units floating around, and the consequences of just one bug that mixes block offsets measured in different block sizes can be catastrophic.) (It's not clear to me exactly what the `you' that doesn't care about interface granularity includes - hardware driver authors? filesystem authors? midlayer (eg scsipi) authors?) I'm speaking from a filesystem point of view; but, more specifically, I'm talking about the abstraction we call a block device, whith sits above stuff like scsipi. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
On Sun, Dec 02, 2012 at 01:32:17AM +, Julian Yon wrote: I don't care about the block granularity of the interface. (Unless I suppose it's larger than the atomic write size; but that would be weird.) If it's smaller than the atomic write size that's equally weird. Because that implies that the designers have made the explicit decision to sacrifice performance for no gain. But there is a cost: they had to write firmware code to emulate that block size. It's not weird, and there is a gain; it's for compatibility with large amounts of deployed code that assumes all devices have 512-byte blocks. -- David A. Holland dholl...@netbsd.org
Re: Making forced unmounts work
On Thu, Nov 29, 2012 at 06:19:37PM +0100, J. Hannken-Illjes wrote: In short the attached diff: - Adds a new kernel-internal errno ERESTARTVOP and changes VCALL() to restart a vnode operation once it returns ERESTARTVOP. - Changes fstrans_start() to take an optional `hint vnode' and return ERESTARTVOP if the vnode becomes dead. Is there any major reason we can't just use ERESTART and rerun the whole syscall? Not all vnode operations come from a syscall and to me it looks cleaner to use one private errno for exactly this purpose. Could be. All those places are supposed to be capable of coping with ERESTART though (otherwise, they break if it happens) so it shouldn't make much difference. And if it does make a difference somewhere, that should be fixed... regardless of ERESTART for signals, we want FS operations to be able to bail and restart for transaction abort. I see there are two references to ERESTARTVOP in genfs_io.c, and I don't see what they're for without digging deeper, but given that they appear to make locking behavior depend on the error condition maybe it would be better not to do that too. :-/ This is the wonderful world of VOP_GETPAGES() and VOP_PUTPAGES(). Both are called with vnode interlock held and when it is needed and possible to check the vnode the interlock has been released. When these operations return ERESTARTVOP we have to lock the interlock because dead_getpages() and dead_putpages need it on entry (just to release it). It is possible to directly return the error from genfs_XXXpages() though. To me it looks clearer to always go the ERESTARTVOP route. Ugh. I don't like having the locking behavior be different for different error cases; it's asking for trouble in the long run. I think this ends up being a reason to use ERESTART instead. Also I wonder if there's any way to accomplish this that doesn't require adding fstrans calls to every operation in every fs. Not in a clean way. We would need some kind of reference counting for vnode operations and that is quite impossible as vnode operations on devices or fifos sometimes wait forever and are called from other fs like ufsspec_read() for example. How could we protect UFS updating access times here? I'm not entirely convinced of that. There are basically three problems: (a) new incoming threads, (b) threads that are already in the fs and running, and (c) threads that are already in the fs and that are stuck more or less permanently because something broke. Admittedly I don't really understand how fstrans suspending works. Does it keep track of all the threads that are in the fs, so the (b) ones can be interrupted somehow, or so we at least can wait until all of them either leave the fs or enter fstrans somewhere and stall? If we're going to track that information we should really do it from vnode_if.c, both to avoid having to modify every fs and to make sure all fses support it correctly. (We also need to be careful about how it's done to avoid causing massive lock contention; that's why such logic doesn't already exist.) If, however, fstrans isn't tracking that information, I don't see how suspending the fs helps deal with the (b) threads, because if they're currently running they can continue to chew on fs-specific data for arbitrarily long before they run into anything that stalls them, and there's no way to know when they're done or how many of them there are. I don't really see what the issue with ufsspec_read() is, however, so we may be talking past each other. -- David A. Holland dholl...@netbsd.org
Re: Problem identified: WAPL/RAIDframe performance problems
things. What I care about is the largest size sector that will (in ^^^ the ordinary course of things anyway) be written atomically. Then those are 512-byte-sector drives [...] No; because I can do 4K atomic writes, I want to know about that. And, can't you do that with traditional drives, drives which really do have 512-byte sectors? Do a 4K transfer and you write 8 physical sectors with no opportunity for any other operation to see the write partially done. Is that wrong, or am I missing something else? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B