[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict

2001-06-29 Thread Bob Glamm

Just to follow up on this situation, I think I've tracked it down
to a problem arising from a combination of SMP, the directory entry
cache, and NFS client code.  After several 24-hour runs of 10 copies
of

  'find /nfs-mounted-directory -print > /dev/null' 

running simultaneously, the kernel stops or dies in fs/dcache.c
(in dput() or d_lookup(), and it triggered the BUG() on
line 129 once).

Performing the same 10 finds on a locally mounted ext2 filesystem
produces no lockups or hangs.

-Bob

> I've got a strange situation, and I'm looking for a little direction.
> Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
> ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
> 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
> 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
> FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
> are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
> identical machines display these problems.
> 
> I've seen three variations of symptoms:
> 
>   1) Almost complete lockout - machine responds to interrupts (indeed,
>  it can even complete a TCP connection) but no userspace code gets
>  executed.  Alt-SysRq-* still works, console scrollback does not;
>   2) Partial lockout - lock_kernel() seems to be getting called without
>  a corresponding unlock_kernel().  This manifested as programs such
>  as 'ps' and 'top' getting stuck in kernel space;
>   3) Unkillable programs - a test program that allocates 512M of memory
>  and touches every page; running two copies of this simultaneously
>  repeatedly results in at least one of the copies getting stuck
>  in 'raid1_alloc_r1bh'.
> 
> Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
> were observed under 2.4.5-ac13 only.  I never get any PANICs, only
> these variety of deadlocks.  A reboot is the only way to resolve the
> problem.
> 
> There seem to be two ways to manifest the problem.  As alluded to in
> (3), running two copies of the memory eater simultaneously along with
> calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
> or two).  Another method to manifest the problem is to run multiple
> copies of this script (I run 10 simultaneous copies):
> 
>   #!/bin/sh
> 
>   while /bin/true; do
> ssh remote-machine 'sleep 1'
>   done
> 
> This script causes (1) in about a day or two.
> 
> If anyone has any suggestions about how to proceed to figure out what
> the problem is (or if there is already a fix), please let me know.
> I would be more than willing to provide a wide range of cooperation on
> this problem.  I don't have a feel for where to go from here, and I'm
> hoping that someone with more experience can give me some
> assistance..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[SMP] 2.4.5-ac13 through ac18 dcache/NFS conflict

2001-06-29 Thread Bob Glamm

Just to follow up on this situation, I think I've tracked it down
to a problem arising from a combination of SMP, the directory entry
cache, and NFS client code.  After several 24-hour runs of 10 copies
of

  'find /nfs-mounted-directory -print  /dev/null' 

running simultaneously, the kernel stops or dies in fs/dcache.c
(in dput() or d_lookup(), and it triggered the BUG() on
line 129 once).

Performing the same 10 finds on a locally mounted ext2 filesystem
produces no lockups or hangs.

-Bob

 I've got a strange situation, and I'm looking for a little direction.
 Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
 ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
 FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
 are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
 identical machines display these problems.
 
 I've seen three variations of symptoms:
 
   1) Almost complete lockout - machine responds to interrupts (indeed,
  it can even complete a TCP connection) but no userspace code gets
  executed.  Alt-SysRq-* still works, console scrollback does not;
   2) Partial lockout - lock_kernel() seems to be getting called without
  a corresponding unlock_kernel().  This manifested as programs such
  as 'ps' and 'top' getting stuck in kernel space;
   3) Unkillable programs - a test program that allocates 512M of memory
  and touches every page; running two copies of this simultaneously
  repeatedly results in at least one of the copies getting stuck
  in 'raid1_alloc_r1bh'.
 
 Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
 were observed under 2.4.5-ac13 only.  I never get any PANICs, only
 these variety of deadlocks.  A reboot is the only way to resolve the
 problem.
 
 There seem to be two ways to manifest the problem.  As alluded to in
 (3), running two copies of the memory eater simultaneously along with
 calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
 or two).  Another method to manifest the problem is to run multiple
 copies of this script (I run 10 simultaneous copies):
 
   #!/bin/sh
 
   while /bin/true; do
 ssh remote-machine 'sleep 1'
   done
 
 This script causes (1) in about a day or two.
 
 If anyone has any suggestions about how to proceed to figure out what
 the problem is (or if there is already a fix), please let me know.
 I would be more than willing to provide a wide range of cooperation on
 this problem.  I don't have a feel for where to go from here, and I'm
 hoping that someone with more experience can give me some
 assistance..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Bob Glamm

> > I've got a strange situation, and I'm looking for a little direction.
> > Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
> > ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
> > 512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
> > 6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
> > FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
> > are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
> > identical machines display these problems.
> 
> please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try 
> again.

Done, and I managed to get it to lock solid in under three hours.  Two
oopses in the syslog (follow).  It looks like memory corruption: the
BUG() that is called from spin_lock() and spin_unlock() test to see whether
the spinlock at the given address has the proper magic; apparently
it's gotten to the point where it doesn't.  In this case the lock that
has gotten mangled is dcache_lock.

Unfortunately, I don't think that this particular lockup is repeatable,
but I'm going to try again anyway to see if the same pattern of memory
corruption occurs.

-Bob

kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113!
invalid operand: 
CPU:0
EIP:0010:[d_alloc+413/504]
EFLAGS: 00010286
eax: 0044   ebx: de9f811c   ecx: c027c088   edx: 869b
esi: c6e3fed1   edi: c190a14c   ebp: c6e3fee8   esp: c6e3fe94
ds: 0018   es: 0018   ss: 0018
Process sshd (pid: 565, stackpage=c6e3f000)
Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c
   c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d
   bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff
Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] 
[__free_pages+27/28] [free_pages+33/36]
   [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] 
[sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56]

Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17
 eip: c0152f37 (d_lookup)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!
invalid operand: 
CPU:1
EIP:0010:[d_lookup+121/476]
EFLAGS: 00010282
eax: 0044   ebx: dffe9f68   ecx: c027c088   edx: 8a07
esi:    edi: c1933824   ebp: b818   esp: dffe9f04
ds: 0018   es: 0018   ss: 0018
Process init (pid: 1, stackpage=dffe9000)
Stack: c0238840 0065 dffe9f68  c1933824 b818 dff40a20 d228d001
   0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4
   dffe9f68 0004 d228d000  dffe9fa4 b818 c01480ca 0009
Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] 
[__user_walk+60/88] [sys_stat64+22/120]
   [system_call+51/56]
eip: c021f2f4 (atomic_dec_and_lock)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!

Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54
 invalid operand: 
Kernel panic: Attempted to kill init!

> > I've seen three variations of symptoms:
> >
> >   1) Almost complete lockout - machine responds to interrupts (indeed,
> >  it can even complete a TCP connection) but no userspace code gets
> >  executed.  Alt-SysRq-* still works, console scrollback does not;
> >   2) Partial lockout - lock_kernel() seems to be getting called without
> >  a corresponding unlock_kernel().  This manifested as programs such
> >  as 'ps' and 'top' getting stuck in kernel space;
> >   3) Unkillable programs - a test program that allocates 512M of memory
> >  and touches every page; running two copies of this simultaneously
> >  repeatedly results in at least one of the copies getting stuck
> >  in 'raid1_alloc_r1bh'.
> >
> > Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
> > were observed under 2.4.5-ac13 only.  I never get any PANICs, only
> > these variety of deadlocks.  A reboot is the only way to resolve the
> > problem.
> >
> > There seem to be two ways to manifest the problem.  As alluded to in
> > (3), running two copies of the memory eater simultaneously along with
> > calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
> > or two).  Another method to manifest the problem is to run multiple
> > copies of this script (I run 10 simultaneous copies):
> >
> >   #!/bin/sh
> >
> >   while /bin/true; do
> > ssh remote-machine 'sleep 1'
> >   done
> >
> > This script causes (1) in about a day or two.
> >
> > If anyone has any suggestions about how to proceed to figure out what
> > the problem is (or if there is already a fix), please let me know.
> > I would be more than willing to provide a wide range of cooperation on
> > this problem.  I don't have a feel for where to go from here, and I'm
> > hoping that someone with more experience can give me some
> > assistance..
> >
> > -Bob
> > -
> > To unsubscribe from this list: 

Re: [SMP] 2.4.5-ac13 memory corruption/deadlock?

2001-06-19 Thread Bob Glamm

  I've got a strange situation, and I'm looking for a little direction.
  Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
  ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
  512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
  6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
  FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
  are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
  identical machines display these problems.
 
 please adds nmi_watchdog=1 as kernel parameter ( append=... in lilo ) and try 
 again.

Done, and I managed to get it to lock solid in under three hours.  Two
oopses in the syslog (follow).  It looks like memory corruption: the
BUG() that is called from spin_lock() and spin_unlock() test to see whether
the spinlock at the given address has the proper magic; apparently
it's gotten to the point where it doesn't.  In this case the lock that
has gotten mangled is dcache_lock.

Unfortunately, I don't think that this particular lockup is repeatable,
but I'm going to try again anyway to see if the same pattern of memory
corruption occurs.

-Bob

kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:113!
invalid operand: 
CPU:0
EIP:0010:[d_alloc+413/504]
EFLAGS: 00010286
eax: 0044   ebx: de9f811c   ecx: c027c088   edx: 869b
esi: c6e3fed1   edi: c190a14c   ebp: c6e3fee8   esp: c6e3fe94
ds: 0018   es: 0018   ss: 0018
Process sshd (pid: 565, stackpage=c6e3f000)
Stack: c0238840 0071 de38bd04 c7f5ee94 c6e3fed2 0004 c01dae97 c190a11c
   c6e3fee8 de38bd04 c7975f14 c6e3ff14 bca8 3532325b 35373138 c01d005d
   bca8 c6e3ff14 0010 de38bd04 c7975f14 c6e3fec8 0009 002274ff
Call Trace: [sock_map_fd+211/532] [mark_rdev_faulty+17/60] [sys_accept+197/252] 
[__free_pages+27/28] [free_pages+33/36]
   [poll_freewait+58/68] [do_select+523/548] [select_bits_free+10/16] 
[sys_select+1135/1148] [sys_socketcall+180/512] [system_call+51/56]

Code: 0f 0b 83 c4 08 8d b6 00 00 00 00 a0 c0 e2 27 c0 84 c0 7e 17
 eip: c0152f37 (d_lookup)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!
invalid operand: 
CPU:1
EIP:0010:[d_lookup+121/476]
EFLAGS: 00010282
eax: 0044   ebx: dffe9f68   ecx: c027c088   edx: 8a07
esi:    edi: c1933824   ebp: b818   esp: dffe9f04
ds: 0018   es: 0018   ss: 0018
Process init (pid: 1, stackpage=dffe9000)
Stack: c0238840 0065 dffe9f68  c1933824 b818 dff40a20 d228d001
   0023ee05 0003 c014850c c1932de4 dffe9f68 dffe9f68 c0148d09 c1932de4
   dffe9f68 0004 d228d000  dffe9fa4 b818 c01480ca 0009
Call Trace: [cached_lookup+16/84] [path_walk+889/3104] [getname+90/152] 
[__user_walk+60/88] [sys_stat64+22/120]
   [system_call+51/56]
eip: c021f2f4 (atomic_dec_and_lock)
kernel BUG at /usr/src/linux-2.4.5-ac13/include/asm/spinlock.h:101!

Code: 0f 0b 83 c4 08 f0 fe 0d c0 e2 27 c0 0f 88 22 17 0d 00 8b 54
 invalid operand: 
Kernel panic: Attempted to kill init!

  I've seen three variations of symptoms:
 
1) Almost complete lockout - machine responds to interrupts (indeed,
   it can even complete a TCP connection) but no userspace code gets
   executed.  Alt-SysRq-* still works, console scrollback does not;
2) Partial lockout - lock_kernel() seems to be getting called without
   a corresponding unlock_kernel().  This manifested as programs such
   as 'ps' and 'top' getting stuck in kernel space;
3) Unkillable programs - a test program that allocates 512M of memory
   and touches every page; running two copies of this simultaneously
   repeatedly results in at least one of the copies getting stuck
   in 'raid1_alloc_r1bh'.
 
  Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
  were observed under 2.4.5-ac13 only.  I never get any PANICs, only
  these variety of deadlocks.  A reboot is the only way to resolve the
  problem.
 
  There seem to be two ways to manifest the problem.  As alluded to in
  (3), running two copies of the memory eater simultaneously along with
  calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
  or two).  Another method to manifest the problem is to run multiple
  copies of this script (I run 10 simultaneous copies):
 
#!/bin/sh
 
while /bin/true; do
  ssh remote-machine 'sleep 1'
done
 
  This script causes (1) in about a day or two.
 
  If anyone has any suggestions about how to proceed to figure out what
  the problem is (or if there is already a fix), please let me know.
  I would be more than willing to provide a wide range of cooperation on
  this problem.  I don't have a feel for where to go from here, and I'm
  hoping that someone with more experience can give me some
  assistance..
 
  -Bob
  -
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo 

[SMP] 2.4.5-ac13 deadlocked?

2001-06-18 Thread Bob Glamm

I've got a strange situation, and I'm looking for a little direction.
Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
identical machines display these problems.

I've seen three variations of symptoms:

  1) Almost complete lockout - machine responds to interrupts (indeed,
 it can even complete a TCP connection) but no userspace code gets
 executed.  Alt-SysRq-* still works, console scrollback does not;
  2) Partial lockout - lock_kernel() seems to be getting called without
 a corresponding unlock_kernel().  This manifested as programs such
 as 'ps' and 'top' getting stuck in kernel space;
  3) Unkillable programs - a test program that allocates 512M of memory
 and touches every page; running two copies of this simultaneously
 repeatedly results in at least one of the copies getting stuck
 in 'raid1_alloc_r1bh'.

Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
were observed under 2.4.5-ac13 only.  I never get any PANICs, only
these variety of deadlocks.  A reboot is the only way to resolve the
problem.

There seem to be two ways to manifest the problem.  As alluded to in
(3), running two copies of the memory eater simultaneously along with
calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
or two).  Another method to manifest the problem is to run multiple
copies of this script (I run 10 simultaneous copies):

  #!/bin/sh

  while /bin/true; do
ssh remote-machine 'sleep 1'
  done

This script causes (1) in about a day or two.

If anyone has any suggestions about how to proceed to figure out what
the problem is (or if there is already a fix), please let me know.
I would be more than willing to provide a wide range of cooperation on
this problem.  I don't have a feel for where to go from here, and I'm
hoping that someone with more experience can give me some
assistance..

-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[SMP] 2.4.5-ac13 deadlocked?

2001-06-18 Thread Bob Glamm

I've got a strange situation, and I'm looking for a little direction.
Quick summary: I get sporadic lockups running 2.4.5-ac13 on a
ServerWorks HE-SL board (SuperMicro 370DE6), 2 800MHz Coppermine CPUs,
512M RAM, 512M+ swap.  Machine has 8 active disks, two as RAID 1,
6 as RAID 5.  Swap is on RAID 1.  Machine also has a 100Mbit Netgear
FA310TX and an Intel 82559-based 100Mbit card.  SCSI controllers
are AIC-7899 (2) and AIC-7895 (1).  RAM is PC-133 ECC RAM; two
identical machines display these problems.

I've seen three variations of symptoms:

  1) Almost complete lockout - machine responds to interrupts (indeed,
 it can even complete a TCP connection) but no userspace code gets
 executed.  Alt-SysRq-* still works, console scrollback does not;
  2) Partial lockout - lock_kernel() seems to be getting called without
 a corresponding unlock_kernel().  This manifested as programs such
 as 'ps' and 'top' getting stuck in kernel space;
  3) Unkillable programs - a test program that allocates 512M of memory
 and touches every page; running two copies of this simultaneously
 repeatedly results in at least one of the copies getting stuck
 in 'raid1_alloc_r1bh'.

Symptom number 1 was present in 2.4.2-ac20 as well; symptoms 2 and 3
were observed under 2.4.5-ac13 only.  I never get any PANICs, only
these variety of deadlocks.  A reboot is the only way to resolve the
problem.

There seem to be two ways to manifest the problem.  As alluded to in
(3), running two copies of the memory eater simultaneously along with
calls to 'ps' and 'top' trigger the bug fairly quickly (within a minute
or two).  Another method to manifest the problem is to run multiple
copies of this script (I run 10 simultaneous copies):

  #!/bin/sh

  while /bin/true; do
ssh remote-machine 'sleep 1'
  done

This script causes (1) in about a day or two.

If anyone has any suggestions about how to proceed to figure out what
the problem is (or if there is already a fix), please let me know.
I would be more than willing to provide a wide range of cooperation on
this problem.  I don't have a feel for where to go from here, and I'm
hoping that someone with more experience can give me some
assistance..

-Bob
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-15 Thread Bob Glamm

> > >Keep it informational. And NEVER EVER make it part of the design.
> > 
> > What about:
> > 
> > 1 (network domain). I have two network interfaces that I connect to 
> > two different network segments, eth0 & eth1;
> 
> So?
> 
> Informational. You can always ask what "eth0" and "eth1" are.
[...] 
> The location of the device is _meaningless_. 
[...]

Roast me if I'm wrong or if this has been beat to death, but there
seem to be two sides of the issue here:

  1) Device numbering/naming.  It is immaterial to the kernel how the
 devices are enumerated or named.  In fact, I would argue that the
 naming could be non-deterministic across reboots.

  2) Device identification.  The end-user or user-space software needs
 to be able to configure certain devices a certain way.  They too
 don't (shouldn't) care what numbers or names are given to the
 devices, as long as they can configure the proper device correctly.

I don't disagree that a move toward making the move toward dynamic device
enumeration/naming is the right way to go; in fact, I don't disagree
that a 32-bit dev_t would be more than adequate (and sparse) for most
configurations - even the largest configured machines wouldn't have more
than several million device names/nodes.

However, I *do* see that there is a LOT of end-user software that still
depends on static numbering to partially identify devices.  Yes, it is
half-baked that major 8 gets you SCSI devices, and then after you open
all the minor devices you STILL get to do all the device-specific ioctl()
calls to identify the device capabilities of the controller or each target
on the controller.  But I don't think that arbitrarily slamming the door
on static naming/numbering to force people to change arguably broken
code or semantics is the right move to make either.

Instead, what about doing the transformation gradually?  Static and
dyanmic enumeration shouldn't have to be mutually exclusive.  E.g.
in the interim devices could be accessed via dynamically enumerated/named
nodes as well as the old staticially enumerated/named nodes.  The 
current device enumeration space seems be sparse enough to take
care of this for most cases.

During this transition, end-user software would have the chance to be
re-written to use the new dynamically enumerated/named device scheme,
perhaps with a somewhat standardized interface to make identification 
and capability detection of devices easier from software.  At some
scheduled point in future kernel development support for the old
static enumeration/naming scheme would be dropped.

Finally, there has to be an *easy* way of identifying devices from software.
You're right, I don't care if my network cards are numbered 0-1-2, 2-0-1,
or in any other permutation, *as long as I can write something like this*:

  # start up networking
  for i in eth0 eth1 eth2; do
  identify device $i
  get configuration/config procedure for device $i identity
  configure $i
  done

Ideally, the identity of device $i would remain the same across reboots.
Note that the device isn't named by its identity, rather, the identity is
acquired from the device.

This gets difficult for certain situations but I think those situations
are rare.  Most modern hardware I've seen has some intrinsic identification
built on board.

> Linux gets this right. We don't give 100Mbps cards different names from
> 10Mbps cards - and pcmcia cards show up in the same namespace as cardbus,
> which is the same namespace as ISA. And it doesn't matter what _driver_ we
> use.
> 
> The "eth0..N" naming is done RIGHT!
> 
> > 2 (disk domain). I have multiple spindles on multiple SCSI adapters. 
> 
> So? Same deal. You don't have eth0..N, you have disk0..N. 
[...]
> Linux gets this _somewhat_ right. The /dev/sdxxx naming is correct (or, if
> you look at only IDE devices, /dev/hdxxx). The problem is that we don't
> have a unified namespace, so unlike eth0..N we do _not_ have a unified
> namespace for disks.

This numbering seems to be a kernel categorization policy.  E.g.,
I have k eth devices, numbered eth0..k-1.  I have m disks, numbered
disc0..m-1, I have n video adapters, numbered fb0..n-1, etc.  This
implies that at some point someone will have to maintain a list of 
device categories.

IMHO the example isn't consistent though.  ethXX devices are a different
level of classification from diskYY.  I would argue that *all* network
devices should be named net0, net1, etc., be they Ethernet, Token Ring, Fibre
Channel, ATM.  Just as different disks be named disk0, disk1, etc., whether
they are IDE, SCSI, ESDI, or some other controller format.

-Bob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: LANANA: To Pending Device Number Registrants

2001-05-15 Thread Bob Glamm

  Keep it informational. And NEVER EVER make it part of the design.
  
  What about:
  
  1 (network domain). I have two network interfaces that I connect to 
  two different network segments, eth0  eth1;
 
 So?
 
 Informational. You can always ask what eth0 and eth1 are.
[...] 
 The location of the device is _meaningless_. 
[...]

Roast me if I'm wrong or if this has been beat to death, but there
seem to be two sides of the issue here:

  1) Device numbering/naming.  It is immaterial to the kernel how the
 devices are enumerated or named.  In fact, I would argue that the
 naming could be non-deterministic across reboots.

  2) Device identification.  The end-user or user-space software needs
 to be able to configure certain devices a certain way.  They too
 don't (shouldn't) care what numbers or names are given to the
 devices, as long as they can configure the proper device correctly.

I don't disagree that a move toward making the move toward dynamic device
enumeration/naming is the right way to go; in fact, I don't disagree
that a 32-bit dev_t would be more than adequate (and sparse) for most
configurations - even the largest configured machines wouldn't have more
than several million device names/nodes.

However, I *do* see that there is a LOT of end-user software that still
depends on static numbering to partially identify devices.  Yes, it is
half-baked that major 8 gets you SCSI devices, and then after you open
all the minor devices you STILL get to do all the device-specific ioctl()
calls to identify the device capabilities of the controller or each target
on the controller.  But I don't think that arbitrarily slamming the door
on static naming/numbering to force people to change arguably broken
code or semantics is the right move to make either.

Instead, what about doing the transformation gradually?  Static and
dyanmic enumeration shouldn't have to be mutually exclusive.  E.g.
in the interim devices could be accessed via dynamically enumerated/named
nodes as well as the old staticially enumerated/named nodes.  The 
current device enumeration space seems be sparse enough to take
care of this for most cases.

During this transition, end-user software would have the chance to be
re-written to use the new dynamically enumerated/named device scheme,
perhaps with a somewhat standardized interface to make identification 
and capability detection of devices easier from software.  At some
scheduled point in future kernel development support for the old
static enumeration/naming scheme would be dropped.

Finally, there has to be an *easy* way of identifying devices from software.
You're right, I don't care if my network cards are numbered 0-1-2, 2-0-1,
or in any other permutation, *as long as I can write something like this*:

  # start up networking
  for i in eth0 eth1 eth2; do
  identify device $i
  get configuration/config procedure for device $i identity
  configure $i
  done

Ideally, the identity of device $i would remain the same across reboots.
Note that the device isn't named by its identity, rather, the identity is
acquired from the device.

This gets difficult for certain situations but I think those situations
are rare.  Most modern hardware I've seen has some intrinsic identification
built on board.

 Linux gets this right. We don't give 100Mbps cards different names from
 10Mbps cards - and pcmcia cards show up in the same namespace as cardbus,
 which is the same namespace as ISA. And it doesn't matter what _driver_ we
 use.
 
 The eth0..N naming is done RIGHT!
 
  2 (disk domain). I have multiple spindles on multiple SCSI adapters. 
 
 So? Same deal. You don't have eth0..N, you have disk0..N. 
[...]
 Linux gets this _somewhat_ right. The /dev/sdxxx naming is correct (or, if
 you look at only IDE devices, /dev/hdxxx). The problem is that we don't
 have a unified namespace, so unlike eth0..N we do _not_ have a unified
 namespace for disks.

This numbering seems to be a kernel categorization policy.  E.g.,
I have k eth devices, numbered eth0..k-1.  I have m disks, numbered
disc0..m-1, I have n video adapters, numbered fb0..n-1, etc.  This
implies that at some point someone will have to maintain a list of 
device categories.

IMHO the example isn't consistent though.  ethXX devices are a different
level of classification from diskYY.  I would argue that *all* network
devices should be named net0, net1, etc., be they Ethernet, Token Ring, Fibre
Channel, ATM.  Just as different disks be named disk0, disk1, etc., whether
they are IDE, SCSI, ESDI, or some other controller format.

-Bob
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/