Tom, On Thu, Jul 22, 2010 at 4:49 PM, Tom Ammon <tom.am...@utah.edu> wrote: > Hal, > > Thanks for looking at all of this with me. ifconfig output is below. > > On 7/22/2010 12:08 PM, Hal Rosenstock wrote: >> >> Tom, >> >> On Thu, Jul 22, 2010 at 1:19 PM, Tom Ammon<tom.am...@utah.edu> wrote: >>> >>> Hal, >>> >>> On 7/21/2010 2:45 PM, Hal Rosenstock wrote: >>>> >>>> Hi Tom, >>>> >>>> On 7/19/10, Tom Ammon<tom.am...@utah.edu> wrote: >>>>> >>>>> I'm trying to set up partitions in a little test environment, and I'm >>>>> having trouble. >>>>> >>>>> I have opensm running on a machine attached to the fabric, and sminfo >>>>> on >>>>> the other machines confirm that this is indeed the master SM. Here's my >>>>> /etc/opensm/partitions.conf: >>>>> >>>>> Default=0xffff , ipoib : ALL, SELF=full ; >>>>> PartitionBlue=0x8004, ipoib : 0x0002c9030009cb3f=full, >>>>> 0x0002c90200252841=full, 0x0002c90200243471=full ; >>>>> PartitionRed=0x8005, ipoib : 0x0002c90200252841=full, >>>>> 0x0002c90200243591=full, 0x0002c9030009cb2b=full ; >>>> >>>> You don't really need the 0x8000 bit on in the pkeys but I don't think >>>> it does any harm. >>>> >>>>> But when I go to the machine with port GUID 0x0002c90200243471, it >>>>> doesn't appear that it's getting the pkey I wanted: >>>>> >>>>> [r...@stagnate ~]# ibstat >>>>> CA 'mthca0' >>>>> CA type: MT23108 >>>>> Number of ports: 2 >>>>> Firmware version: 3.3.5 >>>>> Hardware version: a1 >>>>> Node GUID: 0x0002c90200243470 >>>>> System image GUID: 0x0002c90200243473 >>>>> Port 1: >>>>> State: Active >>>>> Physical state: LinkUp >>>>> Rate: 10 >>>>> Base lid: 10 >>>>> LMC: 0 >>>>> SM lid: 4 >>>>> Capability mask: 0x02510a68 >>>>> Port GUID: 0x0002c90200243471 >>>>> Port 2: >>>>> State: Down >>>>> Physical state: Polling >>>>> Rate: 2 >>>>> Base lid: 0 >>>>> LMC: 0 >>>>> SM lid: 0 >>>>> Capability mask: 0x02510a68 >>>>> Port GUID: 0x0002c90200243472 >>>>> >>>>> [r...@stagnate ~]# cat /sys/class/net/ib0/pkey >>>>> 0xffff >>>> >>>> What does: >>>> >>>> smpquery pkeys 10 1 >>>> >>>> say ? Do you see the other pkey(s) on that port ? >>> >>> [r...@stagnate ~]# smpquery pkeys 10 1 >>> 0: 0x7fff 0x8004 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 32: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 40: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 48: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 56: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 64 pkeys capacity for this port >>> >>> So I see that both 7fff and 8004 are being assigned to this port. Is that >>> okay? >> >> Yes. >> >>> Is there any problem with the machine also being in the default >>> partition? >> >> No. >> >>> As I look around at all of the machines with smpquery, it appears that >>> they >>> are all being assigned 7fff and the pkey that I assigned in >>> partitions.conf. >> >> Good. >> >>> But the machine that I want to run 2 child interfaces on is having >>> issues. >>> It's at LID 7 and here's what smpquery says: >>> >>> [r...@stagnate ~]# smpquery pkeys 7 1 >>> 0: 0x7fff 0x8004 0x8005 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 8: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 16: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 24: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 32: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 40: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 48: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 56: 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 0x0000 >>> 64 pkeys capacity for this port >>> >>> So that's fine, but when I try to create a child interface I get this: >>> >>> [r...@labdisk01 ~]# echo 0x8004> /sys/class/net/ib0/create_child >>> -bash: echo: write error: Name not unique on network >> >> I don't know what cause that error. Maybe someone else can help here. >> >> Are you sure the ib0 interface is OK ? What does ifconfig ib0 say ? > > Here's ifconfig ib0: > > ib0 Link encap:InfiniBand HWaddr > 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet6 addr: fe80::202:c902:25:2841/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:1 errors:0 dropped:0 overruns:0 frame:0 > TX packets:17 errors:0 dropped:7 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:56 (56.0 b) TX bytes:3529 (3.4 KiB) > > > Then I brought up the "sub"interfaces with "ifup ib0.8004" "ifup ib0.8005" . > Still get the "Name not unique on network" message if I switch the order and > do ifup followed by echo 0x8004....etc. > > ib0.8004 Link encap:InfiniBand HWaddr > 80:00:04:06:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:10.0.0.2 Bcast:10.0.0.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c902:25:2841/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:78 errors:0 dropped:17 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:14620 (14.2 KiB) > > ib0.8005 Link encap:InfiniBand HWaddr > 80:00:04:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:192.168.10.2 Bcast:192.168.10.255 Mask:255.255.255.0 > inet6 addr: fe80::202:c902:25:2841/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 > RX packets:0 errors:0 dropped:0 overruns:0 frame:0 > TX packets:72 errors:0 dropped:18 overruns:0 carrier:0 > collisions:0 txqueuelen:256 > RX bytes:0 (0.0 b) TX bytes:14269 (13.9 KiB)
Looks like none of the subinterfaces are receiving and the primary interface only received 1 packet. What does saquery -g show and then saquery -m <mlid> for each mlid shown in the MC groups dump. -- Hal > Also, here's some junk from /var/log/messages, seemed like it might be > relevant, but maybe this is just IP stuff: > > Jul 22 14:38:37 labdisk01 kernel: ADDRCONF(NETDEV_UP): ib0.8004: link is not > ready > Jul 22 14:38:37 labdisk01 kernel: ADDRCONF(NETDEV_CHANGE): ib0.8004: link > becomes ready > Jul 22 14:38:39 labdisk01 avahi-daemon[4056]: New relevant interface > ib0.8004.IPv6 for mDNS. > Jul 22 14:38:39 labdisk01 avahi-daemon[4056]: Joining mDNS multicast group > on interface ib0.8004.IPv6 with address fe80::202:c902:25:2841. > Jul 22 14:38:39 labdisk01 avahi-daemon[4056]: Registering new address record > for fe80::202:c902:25:2841 on ib0.8004. > Jul 22 14:38:41 labdisk01 avahi-daemon[4056]: New relevant interface > ib0.8004.IPv4 for mDNS. > Jul 22 14:38:41 labdisk01 avahi-daemon[4056]: Joining mDNS multicast group > on interface ib0.8004.IPv4 with address 10.0.0.2. > Jul 22 14:38:41 labdisk01 avahi-daemon[4056]: Registering new address record > for 10.0.0.2 on ib0.8004. > Jul 22 14:39:22 labdisk01 kernel: ADDRCONF(NETDEV_UP): ib0.8005: link is not > ready > Jul 22 14:39:22 labdisk01 kernel: ADDRCONF(NETDEV_CHANGE): ib0.8005: link > becomes ready > Jul 22 14:39:24 labdisk01 avahi-daemon[4056]: New relevant interface > ib0.8005.IPv6 for mDNS. > Jul 22 14:39:24 labdisk01 avahi-daemon[4056]: Joining mDNS multicast group > on interface ib0.8005.IPv6 with address fe80::202:c902:25:2841. > Jul 22 14:39:24 labdisk01 avahi-daemon[4056]: Registering new address record > for fe80::202:c902:25:2841 on ib0.8005. > Jul 22 14:39:26 labdisk01 avahi-daemon[4056]: New relevant interface > ib0.8005.IPv4 for mDNS. > Jul 22 14:39:26 labdisk01 avahi-daemon[4056]: Joining mDNS multicast group > on interface ib0.8005.IPv4 with address 192.168.10.2. > Jul 22 14:39:26 labdisk01 avahi-daemon[4056]: Registering new address record > for 192.168.10.2 on ib0.8005. > > > >> >>> My plan was to create two child interfaces (0x8004 and 0x8005) and then >>> ifconfig ib0.8004 and ifconfig ib0.8005 to assign them to separate >>> subnets. >> >> That should be fine. >> >> -- Hal >> >>> Tom >>> >>> >>>> >>>> The pkey you are seeing is the only one for ib0 interface. >>>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> If you want to have IPoIB interfaces on the other partitions too, you >>>> need to set this up by creating a child interface on those nodes; you >>>> had asked about that in a previous email >>>> (http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg04728.html). >>>> >>>> -- Hal >>>> >>>>> >>>>> I'm trying to run one ipoib subnet in each partition, and then >>>>> eventually the goal is to have a different server that has 2 child >>>>> interfaces, one on each subnet. But it doesn't appear that my partition >>>>> configuration is even correct. Is there a syntax error, or something >>>>> else I am missing? >>>>> >>>>> Thanks, >>>>> >>>>> Tom >>>>> >>>>> >>>>> >>>>> -- >>>>> Tom Ammon >>>>> Network Engineer >>>>> Office: 801.587.0976 >>>>> Mobile: 801.674.9273 >>>>> >>>>> Center for High Performance Computing >>>>> University of Utah >>>>> http://www.chpc.utah.edu >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" >>>>> in >>>>> the body of a message to majord...@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>> >>> -- >>> Tom Ammon >>> Network Engineer >>> Office: 801.587.0976 >>> Mobile: 801.674.9273 >>> >>> Center for High Performance Computing >>> University of Utah >>> http://www.chpc.utah.edu >>> > > -- > Tom Ammon > Network Engineer > Office: 801.587.0976 > Mobile: 801.674.9273 > > Center for High Performance Computing > University of Utah > http://www.chpc.utah.edu > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html