[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict
Just to follow up on this situation, I think I've tracked it down to a problem arising from a combination of SMP, the directory entry cache, and NFS client code. After several 24-hour runs of 10 copies of 'find /nfs-mounted-directory -print > /dev/null' running simultaneously, the kernel stops or dies in fs/dcache.c (in dput() or d_lookup(), and it triggered the BUG() on line 129 once). Performing the same 10 finds on a locally mounted ext2 filesystem produces no lockups or hangs. -Bob > I've got a strange situation, and I'm looking for a little direction. > Quick summary: I get sporadic lockups running 2.4.5-ac13 on a > ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, > 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, > 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear > FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers > are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two > identical machines display these problems. > > I've seen three variations of symptoms: > > 1) Almost complete lockout - machine responds to interrupts (indeed, > it can even complete a TCP connection) but no userspace code gets > executed. Alt-SysRq-* still works, console scrollback does not; > 2) Partial lockout - lock_kernel() seems to be getting called without > a corresponding unlock_kernel(). This manifested as programs such > as 'ps' and 'top' getting stuck in kernel space; > 3) Unkillable programs - a test program that allocates 512M of memory > and touches every page; running two copies of this simultaneously > repeatedly results in at least one of the copies getting stuck > in 'raid1_alloc_r1bh'. > > Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 > were observed under 2.4.5-ac13 only. I never get any PANICs, only > these variety of deadlocks. A reboot is the only way to resolve the > problem. > > There seem to be two ways to manifest the problem. As alluded to in > (3), running two copies of the memory eater simultaneously along with > calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute > or two). Another method to manifest the problem is to run multiple > copies of this script (I run 10 simultaneous copies): > > #!/bin/sh > > while /bin/true; do > ssh remote-machine 'sleep 1' > done > > This script causes (1) in about a day or two. > > If anyone has any suggestions about how to proceed to figure out what > the problem is (or if there is already a fix), please let me know. > I would be more than willing to provide a wide range of cooperation on > this problem. I don't have a feel for where to go from here, and I'm > hoping that someone with more experience can give me some > assistance.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict
Just to follow up on this situation, I think I've tracked it down to a problem arising from a combination of SMP, the directory entry cache, and NFS client code. After several 24-hour runs of 10 copies of 'find /nfs-mounted-directory -print /dev/null' running simultaneously, the kernel stops or dies in fs/dcache.c (in dput() or d_lookup(), and it triggered the BUG() on line 129 once). Performing the same 10 finds on a locally mounted ext2 filesystem produces no lockups or hangs. -Bob I've got a strange situation, and I'm looking for a little direction. Quick summary: I get sporadic lockups running 2.4.5-ac13 on a ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two identical machines display these problems. I've seen three variations of symptoms: 1) Almost complete lockout - machine responds to interrupts (indeed, it can even complete a TCP connection) but no userspace code gets executed. Alt-SysRq-* still works, console scrollback does not; 2) Partial lockout - lock_kernel() seems to be getting called without a corresponding unlock_kernel(). This manifested as programs such as 'ps' and 'top' getting stuck in kernel space; 3) Unkillable programs - a test program that allocates 512M of memory and touches every page; running two copies of this simultaneously repeatedly results in at least one of the copies getting stuck in 'raid1_alloc_r1bh'. Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 were observed under 2.4.5-ac13 only. I never get any PANICs, only these variety of deadlocks. A reboot is the only way to resolve the problem. There seem to be two ways to manifest the problem. As alluded to in (3), running two copies of the memory eater simultaneously along with calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute or two). Another method to manifest the problem is to run multiple copies of this script (I run 10 simultaneous copies): #!/bin/sh while /bin/true; do ssh remote-machine 'sleep 1' done This script causes (1) in about a day or two. If anyone has any suggestions about how to proceed to figure out what the problem is (or if there is already a fix), please let me know. I would be more than willing to provide a wide range of cooperation on this problem. I don't have a feel for where to go from here, and I'm hoping that someone with more experience can give me some assistance.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?
> > I've got a strange situation, and I'm looking for a little direction. > > Quick summary: I get sporadic lockups running 2.4.5-ac13 on a > > ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, > > 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, > > 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear > > FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers > > are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two > > identical machines display these problems. > > please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try > again. Done, and I managed to get it to lock solid in under three hours. Two oopses in the syslog (follow). It looks like memory corruption: the BUG() that is called from spin_lock() and spin_unlock() test to see whether the spinlock at the given address has the proper magic; apparently it's gotten to the point where it doesn't. In this case the lock that has gotten mangled is dcache_lock. Unfortunately, I don't think that this particular lockup is repeatable, but I'm going to try again anyway to see if the same pattern of memory corruption occurs. -Bob kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113! invalid operand: CPU:0 EIP:0010:[d_alloc+413/504] EFLAGS: 00010286 eax: 0044 ebx: de9f811c ecx: c027c088 edx: 869b esi: c6e3fed1 edi: c190a14c ebp: c6e3fee8 esp: c6e3fe94 ds: 0018 es: 0018 ss: 0018 Process sshd (pid: 565, stackpage=c6e3f000) Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] [__free_pages+27/28] [free_pages+33/36] [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] [sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56] Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17 eip: c0152f37 (d_lookup) kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101! invalid operand: CPU:1 EIP:0010:[d_lookup+121/476] EFLAGS: 00010282 eax: 0044 ebx: dffe9f68 ecx: c027c088 edx: 8a07 esi: edi: c1933824 ebp: b818 esp: dffe9f04 ds: 0018 es: 0018 ss: 0018 Process init (pid: 1, stackpage=dffe9000) Stack: c0238840 0065 dffe9f68 c1933824 b818 dff40a20 d228d001 0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4 dffe9f68 0004 d228d000 dffe9fa4 b818 c01480ca 0009 Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] [__user_walk+60/88] [sys_stat64+22/120] [system_call+51/56] eip: c021f2f4 (atomic_dec_and_lock) kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101! Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54 invalid operand: Kernel panic: Attempted to kill init! > > I've seen three variations of symptoms: > > > > 1) Almost complete lockout - machine responds to interrupts (indeed, > > it can even complete a TCP connection) but no userspace code gets > > executed. Alt-SysRq-* still works, console scrollback does not; > > 2) Partial lockout - lock_kernel() seems to be getting called without > > a corresponding unlock_kernel(). This manifested as programs such > > as 'ps' and 'top' getting stuck in kernel space; > > 3) Unkillable programs - a test program that allocates 512M of memory > > and touches every page; running two copies of this simultaneously > > repeatedly results in at least one of the copies getting stuck > > in 'raid1_alloc_r1bh'. > > > > Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 > > were observed under 2.4.5-ac13 only. I never get any PANICs, only > > these variety of deadlocks. A reboot is the only way to resolve the > > problem. > > > > There seem to be two ways to manifest the problem. As alluded to in > > (3), running two copies of the memory eater simultaneously along with > > calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute > > or two). Another method to manifest the problem is to run multiple > > copies of this script (I run 10 simultaneous copies): > > > > #!/bin/sh > > > > while /bin/true; do > > ssh remote-machine 'sleep 1' > > done > > > > This script causes (1) in about a day or two. > > > > If anyone has any suggestions about how to proceed to figure out what > > the problem is (or if there is already a fix), please let me know. > > I would be more than willing to provide a wide range of cooperation on > > this problem. I don't have a feel for where to go from here, and I'm > > hoping that someone with more experience can give me some > > assistance.. > > > > -Bob > > - > > To unsubscribe from this list:
Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?
I've got a strange situation, and I'm looking for a little direction. Quick summary: I get sporadic lockups running 2.4.5-ac13 on a ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two identical machines display these problems. please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try again. Done, and I managed to get it to lock solid in under three hours. Two oopses in the syslog (follow). It looks like memory corruption: the BUG() that is called from spin_lock() and spin_unlock() test to see whether the spinlock at the given address has the proper magic; apparently it's gotten to the point where it doesn't. In this case the lock that has gotten mangled is dcache_lock. Unfortunately, I don't think that this particular lockup is repeatable, but I'm going to try again anyway to see if the same pattern of memory corruption occurs. -Bob kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113! invalid operand: CPU:0 EIP:0010:[d_alloc+413/504] EFLAGS: 00010286 eax: 0044 ebx: de9f811c ecx: c027c088 edx: 869b esi: c6e3fed1 edi: c190a14c ebp: c6e3fee8 esp: c6e3fe94 ds: 0018 es: 0018 ss: 0018 Process sshd (pid: 565, stackpage=c6e3f000) Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] [__free_pages+27/28] [free_pages+33/36] [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] [sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56] Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17 eip: c0152f37 (d_lookup) kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101! invalid operand: CPU:1 EIP:0010:[d_lookup+121/476] EFLAGS: 00010282 eax: 0044 ebx: dffe9f68 ecx: c027c088 edx: 8a07 esi: edi: c1933824 ebp: b818 esp: dffe9f04 ds: 0018 es: 0018 ss: 0018 Process init (pid: 1, stackpage=dffe9000) Stack: c0238840 0065 dffe9f68 c1933824 b818 dff40a20 d228d001 0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4 dffe9f68 0004 d228d000 dffe9fa4 b818 c01480ca 0009 Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] [__user_walk+60/88] [sys_stat64+22/120] [system_call+51/56] eip: c021f2f4 (atomic_dec_and_lock) kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101! Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54 invalid operand: Kernel panic: Attempted to kill init! I've seen three variations of symptoms: 1) Almost complete lockout - machine responds to interrupts (indeed, it can even complete a TCP connection) but no userspace code gets executed. Alt-SysRq-* still works, console scrollback does not; 2) Partial lockout - lock_kernel() seems to be getting called without a corresponding unlock_kernel(). This manifested as programs such as 'ps' and 'top' getting stuck in kernel space; 3) Unkillable programs - a test program that allocates 512M of memory and touches every page; running two copies of this simultaneously repeatedly results in at least one of the copies getting stuck in 'raid1_alloc_r1bh'. Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 were observed under 2.4.5-ac13 only. I never get any PANICs, only these variety of deadlocks. A reboot is the only way to resolve the problem. There seem to be two ways to manifest the problem. As alluded to in (3), running two copies of the memory eater simultaneously along with calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute or two). Another method to manifest the problem is to run multiple copies of this script (I run 10 simultaneous copies): #!/bin/sh while /bin/true; do ssh remote-machine 'sleep 1' done This script causes (1) in about a day or two. If anyone has any suggestions about how to proceed to figure out what the problem is (or if there is already a fix), please let me know. I would be more than willing to provide a wide range of cooperation on this problem. I don't have a feel for where to go from here, and I'm hoping that someone with more experience can give me some assistance.. -Bob - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo
[SMP] 2.4.5-ac13 deadlocked?
I've got a strange situation, and I'm looking for a little direction. Quick summary: I get sporadic lockups running 2.4.5-ac13 on a ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two identical machines display these problems. I've seen three variations of symptoms: 1) Almost complete lockout - machine responds to interrupts (indeed, it can even complete a TCP connection) but no userspace code gets executed. Alt-SysRq-* still works, console scrollback does not; 2) Partial lockout - lock_kernel() seems to be getting called without a corresponding unlock_kernel(). This manifested as programs such as 'ps' and 'top' getting stuck in kernel space; 3) Unkillable programs - a test program that allocates 512M of memory and touches every page; running two copies of this simultaneously repeatedly results in at least one of the copies getting stuck in 'raid1_alloc_r1bh'. Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 were observed under 2.4.5-ac13 only. I never get any PANICs, only these variety of deadlocks. A reboot is the only way to resolve the problem. There seem to be two ways to manifest the problem. As alluded to in (3), running two copies of the memory eater simultaneously along with calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute or two). Another method to manifest the problem is to run multiple copies of this script (I run 10 simultaneous copies): #!/bin/sh while /bin/true; do ssh remote-machine 'sleep 1' done This script causes (1) in about a day or two. If anyone has any suggestions about how to proceed to figure out what the problem is (or if there is already a fix), please let me know. I would be more than willing to provide a wide range of cooperation on this problem. I don't have a feel for where to go from here, and I'm hoping that someone with more experience can give me some assistance.. -Bob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[SMP] 2.4.5-ac13 deadlocked?
I've got a strange situation, and I'm looking for a little direction. Quick summary: I get sporadic lockups running 2.4.5-ac13 on a ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs, 512M RAM, 512M+ swap. Machine has 8 active disks, two as RAID 1, 6 as RAID 5. Swap is on RAID 1. Machine also has a 100Mbit Netgear FA310TX and an Intel 82559-based 100Mbit card. SCSI controllers are AIC-7899 (2) and AIC-7895 (1). RAM is PC-133 ECC RAM; two identical machines display these problems. I've seen three variations of symptoms: 1) Almost complete lockout - machine responds to interrupts (indeed, it can even complete a TCP connection) but no userspace code gets executed. Alt-SysRq-* still works, console scrollback does not; 2) Partial lockout - lock_kernel() seems to be getting called without a corresponding unlock_kernel(). This manifested as programs such as 'ps' and 'top' getting stuck in kernel space; 3) Unkillable programs - a test program that allocates 512M of memory and touches every page; running two copies of this simultaneously repeatedly results in at least one of the copies getting stuck in 'raid1_alloc_r1bh'. Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3 were observed under 2.4.5-ac13 only. I never get any PANICs, only these variety of deadlocks. A reboot is the only way to resolve the problem. There seem to be two ways to manifest the problem. As alluded to in (3), running two copies of the memory eater simultaneously along with calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute or two). Another method to manifest the problem is to run multiple copies of this script (I run 10 simultaneous copies): #!/bin/sh while /bin/true; do ssh remote-machine 'sleep 1' done This script causes (1) in about a day or two. If anyone has any suggestions about how to proceed to figure out what the problem is (or if there is already a fix), please let me know. I would be more than willing to provide a wide range of cooperation on this problem. I don't have a feel for where to go from here, and I'm hoping that someone with more experience can give me some assistance.. -Bob - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
> > >Keep it informational. And NEVER EVER make it part of the design. > > > > What about: > > > > 1 (network domain). I have two network interfaces that I connect to > > two different network segments, eth0 & eth1; > > So? > > Informational. You can always ask what "eth0" and "eth1" are. [...] > The location of the device is _meaningless_. [...] Roast me if I'm wrong or if this has been beat to death, but there seem to be two sides of the issue here: 1) Device numbering/naming. It is immaterial to the kernel how the devices are enumerated or named. In fact, I would argue that the naming could be non-deterministic across reboots. 2) Device identification. The end-user or user-space software needs to be able to configure certain devices a certain way. They too don't (shouldn't) care what numbers or names are given to the devices, as long as they can configure the proper device correctly. I don't disagree that a move toward making the move toward dynamic device enumeration/naming is the right way to go; in fact, I don't disagree that a 32-bit dev_t would be more than adequate (and sparse) for most configurations - even the largest configured machines wouldn't have more than several million device names/nodes. However, I *do* see that there is a LOT of end-user software that still depends on static numbering to partially identify devices. Yes, it is half-baked that major 8 gets you SCSI devices, and then after you open all the minor devices you STILL get to do all the device-specific ioctl() calls to identify the device capabilities of the controller or each target on the controller. But I don't think that arbitrarily slamming the door on static naming/numbering to force people to change arguably broken code or semantics is the right move to make either. Instead, what about doing the transformation gradually? Static and dyanmic enumeration shouldn't have to be mutually exclusive. E.g. in the interim devices could be accessed via dynamically enumerated/named nodes as well as the old staticially enumerated/named nodes. The current device enumeration space seems be sparse enough to take care of this for most cases. During this transition, end-user software would have the chance to be re-written to use the new dynamically enumerated/named device scheme, perhaps with a somewhat standardized interface to make identification and capability detection of devices easier from software. At some scheduled point in future kernel development support for the old static enumeration/naming scheme would be dropped. Finally, there has to be an *easy* way of identifying devices from software. You're right, I don't care if my network cards are numbered 0-1-2, 2-0-1, or in any other permutation, *as long as I can write something like this*: # start up networking for i in eth0 eth1 eth2; do identify device $i get configuration/config procedure for device $i identity configure $i done Ideally, the identity of device $i would remain the same across reboots. Note that the device isn't named by its identity, rather, the identity is acquired from the device. This gets difficult for certain situations but I think those situations are rare. Most modern hardware I've seen has some intrinsic identification built on board. > Linux gets this right. We don't give 100Mbps cards different names from > 10Mbps cards - and pcmcia cards show up in the same namespace as cardbus, > which is the same namespace as ISA. And it doesn't matter what _driver_ we > use. > > The "eth0..N" naming is done RIGHT! > > > 2 (disk domain). I have multiple spindles on multiple SCSI adapters. > > So? Same deal. You don't have eth0..N, you have disk0..N. [...] > Linux gets this _somewhat_ right. The /dev/sdxxx naming is correct (or, if > you look at only IDE devices, /dev/hdxxx). The problem is that we don't > have a unified namespace, so unlike eth0..N we do _not_ have a unified > namespace for disks. This numbering seems to be a kernel categorization policy. E.g., I have k eth devices, numbered eth0..k-1. I have m disks, numbered disc0..m-1, I have n video adapters, numbered fb0..n-1, etc. This implies that at some point someone will have to maintain a list of device categories. IMHO the example isn't consistent though. ethXX devices are a different level of classification from diskYY. I would argue that *all* network devices should be named net0, net1, etc., be they Ethernet, Token Ring, Fibre Channel, ATM. Just as different disks be named disk0, disk1, etc., whether they are IDE, SCSI, ESDI, or some other controller format. -Bob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: LANANA: To Pending Device Number Registrants
Keep it informational. And NEVER EVER make it part of the design. What about: 1 (network domain). I have two network interfaces that I connect to two different network segments, eth0 eth1; So? Informational. You can always ask what eth0 and eth1 are. [...] The location of the device is _meaningless_. [...] Roast me if I'm wrong or if this has been beat to death, but there seem to be two sides of the issue here: 1) Device numbering/naming. It is immaterial to the kernel how the devices are enumerated or named. In fact, I would argue that the naming could be non-deterministic across reboots. 2) Device identification. The end-user or user-space software needs to be able to configure certain devices a certain way. They too don't (shouldn't) care what numbers or names are given to the devices, as long as they can configure the proper device correctly. I don't disagree that a move toward making the move toward dynamic device enumeration/naming is the right way to go; in fact, I don't disagree that a 32-bit dev_t would be more than adequate (and sparse) for most configurations - even the largest configured machines wouldn't have more than several million device names/nodes. However, I *do* see that there is a LOT of end-user software that still depends on static numbering to partially identify devices. Yes, it is half-baked that major 8 gets you SCSI devices, and then after you open all the minor devices you STILL get to do all the device-specific ioctl() calls to identify the device capabilities of the controller or each target on the controller. But I don't think that arbitrarily slamming the door on static naming/numbering to force people to change arguably broken code or semantics is the right move to make either. Instead, what about doing the transformation gradually? Static and dyanmic enumeration shouldn't have to be mutually exclusive. E.g. in the interim devices could be accessed via dynamically enumerated/named nodes as well as the old staticially enumerated/named nodes. The current device enumeration space seems be sparse enough to take care of this for most cases. During this transition, end-user software would have the chance to be re-written to use the new dynamically enumerated/named device scheme, perhaps with a somewhat standardized interface to make identification and capability detection of devices easier from software. At some scheduled point in future kernel development support for the old static enumeration/naming scheme would be dropped. Finally, there has to be an *easy* way of identifying devices from software. You're right, I don't care if my network cards are numbered 0-1-2, 2-0-1, or in any other permutation, *as long as I can write something like this*: # start up networking for i in eth0 eth1 eth2; do identify device $i get configuration/config procedure for device $i identity configure $i done Ideally, the identity of device $i would remain the same across reboots. Note that the device isn't named by its identity, rather, the identity is acquired from the device. This gets difficult for certain situations but I think those situations are rare. Most modern hardware I've seen has some intrinsic identification built on board. Linux gets this right. We don't give 100Mbps cards different names from 10Mbps cards - and pcmcia cards show up in the same namespace as cardbus, which is the same namespace as ISA. And it doesn't matter what _driver_ we use. The eth0..N naming is done RIGHT! 2 (disk domain). I have multiple spindles on multiple SCSI adapters. So? Same deal. You don't have eth0..N, you have disk0..N. [...] Linux gets this _somewhat_ right. The /dev/sdxxx naming is correct (or, if you look at only IDE devices, /dev/hdxxx). The problem is that we don't have a unified namespace, so unlike eth0..N we do _not_ have a unified namespace for disks. This numbering seems to be a kernel categorization policy. E.g., I have k eth devices, numbered eth0..k-1. I have m disks, numbered disc0..m-1, I have n video adapters, numbered fb0..n-1, etc. This implies that at some point someone will have to maintain a list of device categories. IMHO the example isn't consistent though. ethXX devices are a different level of classification from diskYY. I would argue that *all* network devices should be named net0, net1, etc., be they Ethernet, Token Ring, Fibre Channel, ATM. Just as different disks be named disk0, disk1, etc., whether they are IDE, SCSI, ESDI, or some other controller format. -Bob - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/