[openib-general] bug in libmthca/src/verbs.c?
struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector) { struct mthca_create_cq cmd; ---à snip ß ret = ibv_cmd_create_cq(context, cqe - 1, channel, comp_vector, &cq->ibv_cq, &cmd.ibv_cmd, sizeof cmd, ^^ &resp.ibv_resp, sizeof resp); The command size passed to ibv_cmd_create_cq is the size of the mthca command wrapper which is larger than what is most likely expected. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] dropped packets
Roland, I am trying to write a user level application that receives multicast UD packets at user level. I am seeing about 1-2 % packet loss between the send side and the receive side apparently independent of the packet rate for low rates. (Heavily traced sends and receives with very low rates still drop packets even though there are more packets posted on the receive side than are sent.) I have a couple of questions: 1. Are there any race issues with ibv_get_cq_event? The example code (ud_pingpong) seems to imply that the correct sequence is Start: Call ibv_get_cq_event Call ibv_ack_cq_event <- anywhere so long as it happens before destroy_cq Call ibv_req_notify_cq Call ibv_poll_cq <- just once not as usual until empty according to the example Goto start In the old days we called request notify and poll until poll was empty on a notify thread in order to prevent a race. 2. When I post say 500 receive buffers and send say 200 send buffers and tag the sends with a sequence number I often see one or two missing sequence numbers at the receive side at the poll_cq interface having checked at the post and poll interfaces of the send side to see that all the correct sequence numbers went out. I am not sure how this can be possible regardless of the notification scheme used. I would love for this to be a programming error in my code but I can’t figure out how I can mess it up between post_send and poll_cq on the receive side. I see the same behavior between systems and with a loopback between two ports on the same HCA. Please let me know if this rings any bells. Bob Pearson ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] SVN problem
I’m having problems checking out the svn repository. As anyone seen this? svn: In directory 'gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1' svn: Can't copy 'gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/.svn/tmp/text-bas nk/src/userspace/mpi/mvapich-gen2/www/www1/mpicc.html.tmp': No such file or directory ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Linux real time scheduler
We have customer who is attempting to use the Linux real time scheduler options available in 2.6.x. Does anyone have any experience with either gen1 or gen2 running with real time thread scheduling? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Disabling IRQ #201 message
> If you decide its a board/BIOS issue, try with my patch as a work-around. Sorry, which patch was that? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Disabling IRQ #201 message
Michael, I ran an experiment where I rebooted the machine, waited for an hour and then looked at /var/log/messages which I copy below. OpenSM is not running and IPoverIB is not loaded. The dropped interrupt occurs about 30 minutes after the reboot. I do not know what if anything is causing activity in the HCA. Port 1 is cabled to an unmanaged switch but stuck in INIT state since there is no running SM. I was mistaken about the machine. We have two dual Opteron machines, but this one is an AMD engineering machine, not the NewIsys machine. There is a message below about some device that does not call pci_enable_device() which may be interesting. Feb 2 11:20:48 localhost syslogd 1.4.1: restart. Feb 2 11:20:48 localhost syslog: syslogd startup succeeded Feb 2 11:20:48 localhost kernel: klogd 1.4.1, log source = /proc/kmsg started. Feb 2 11:20:48 localhost kernel: Bootdata ok (command line is ro root=/dev/VolGroup00/LogVol00 rhgb quiet) Feb 2 11:20:48 localhost kernel: Linux version 2.6.11-rc1rbp ([EMAIL PROTECTED]) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #1 SMP Fri Jan 21 11:23:08 CST 2005 Feb 2 11:20:48 localhost kernel: BIOS-provided physical RAM map: Feb 2 11:20:48 localhost kernel: BIOS-e820: - 0009fc00 (usable) Feb 2 11:20:48 localhost kernel: BIOS-e820: 0009fc00 - 000a (reserved) Feb 2 11:20:48 localhost kernel: BIOS-e820: 000e - 0010 (reserved) Feb 2 11:20:48 localhost kernel: BIOS-e820: 0010 - e7ff (usable) Feb 2 11:20:48 localhost kernel: BIOS-e820: e7ff - e7fff000 (ACPI data) Feb 2 11:20:48 localhost kernel: BIOS-e820: e7fff000 - e800 (ACPI NVS) Feb 2 11:20:48 localhost kernel: BIOS-e820: ff7c - 0001 (reserved) Feb 2 11:20:48 localhost kernel: BIOS-e820: 0001 - 0004 (usable) Feb 2 11:20:48 localhost syslog: klogd startup succeeded Feb 2 11:20:48 localhost portmap: portmap startup succeeded Feb 2 11:20:48 localhost rpc.statd[2680]: Version 1.0.6 Starting Feb 2 11:20:48 localhost kernel: Scanning NUMA topology in Northbridge 24 Feb 2 11:20:48 localhost nfslock: rpc.statd startup succeeded Feb 2 11:20:48 localhost kernel: Number of nodes 2 (10010) Feb 2 11:20:48 localhost kernel: Node 0 already present. Skipping Feb 2 11:20:48 localhost kernel: Node 1 already present. Skipping Feb 2 11:20:48 localhost kernel: No NUMA configuration found Feb 2 11:20:48 localhost kernel: Faking a node at -0004 Feb 2 11:20:48 localhost kernel: Bootmem setup node 0 -0004 Feb 2 11:20:48 localhost kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Feb 2 11:20:48 localhost kernel: Processor #0 15:5 APIC version 16 Feb 2 11:20:48 localhost kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Feb 2 11:20:48 localhost rpcidmapd: rpc.idmapd startup succeeded Feb 2 11:20:48 localhost kernel: Processor #1 15:5 APIC version 16 Feb 2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) Feb 2 11:20:49 localhost kernel: IOAPIC[0]: apic_id 2, version 17, address 0xfec0, GSI 0-23 Feb 2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x03] address[0xfebfe000] gsi_base[24]) Feb 2 11:20:49 localhost kernel: IOAPIC[1]: apic_id 3, version 17, address 0xfebfe000, GSI 24-27 Feb 2 11:20:49 localhost netfs: Mounting other filesystems: succeeded Feb 2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x04] address[0xfebff000] gsi_base[28]) Feb 2 11:20:49 localhost rc: Starting lm_sensors: succeeded Feb 2 11:20:49 localhost kernel: IOAPIC[2]: apic_id 4, version 17, address 0xfebff000, GSI 28-31 Feb 2 11:20:49 localhost kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Feb 2 11:20:49 localhost kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Feb 2 11:20:49 localhost autofs: automount startup succeeded Feb 2 11:20:49 localhost kernel: Setting APIC routing to flat Feb 2 11:20:49 localhost kernel: Using ACPI (MADT) for SMP configuration information Feb 2 11:20:49 localhost kernel: Checking aperture... Feb 2 11:20:49 localhost kernel: CPU 0: aperture @ 1000 size 32 MB Feb 2 11:20:49 localhost kernel: Aperture from northbridge cpu 0 too small (32 MB) Feb 2 11:20:49 localhost kernel: No AGP bridge found Feb 2 11:20:49 localhost kernel: Your BIOS doesn't leave a aperture memory hole Feb 2 11:20:49 localhost kernel: Please enable the IOMMU option in the BIOS setup Feb 2 11:20:49 localhost kernel: This costs you 64 MB of RAM Feb 2 11:20:49 localhost kernel: Mapping aperture over 65536 KB of RAM @ 1000 Feb 2 11:20:49 localhost kernel: Built 1 zonelists Feb 2 11:20:49 localhost kernel: Kernel command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet console=tty0 Feb 2 11:20:49 localhost kernel: Initializing CPU#0 Feb 2 11:20:49 localhost mDNSResponder: startup succeeded Fe
RE: [openib-general] Disabling IRQ #201 message
I'm debugging some code that is reading files in /sys/class/infiniband/. Other than that the HCA isn't doing anything at all. The dropped interrupt occurs whether or not I am doing anything. I can reboot the machine and just let it sit there and the message will occur after a while. After the message, files which require interacting with the HCA e.g. /sys/class/infiniband/mthca0/ports/1/state become unreadable. Read calls block for a long time and finally timeout with an EOF indication. > -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 6:31 PM > To: Robert Pearson > Cc: openib-general@openib.org > Subject: Re: [openib-general] Disabling IRQ #201 message > > Robert> Am running current version of openib on a 2.6.11-rc1 > Robert> kernel on a NewIsis dual Opteron system. Every 15-20 > Robert> minutes the following occurs. Have others seen this > Robert> behavior? Is the system misconfigured? > > Do the drivers work other than this messsage? > > It seems occasionally an interrupt occurs but the driver is not > finding an events in any of the event queues. I've never seen this > but on the other hand I've not done much testing on the > Opteron/AMD-8131 platform. > > - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Disabling IRQ #201 message
Am running current version of openib on a 2.6.11-rc1 kernel on a NewIsis dual Opteron system. Every 15-20 minutes the following occurs. Have others seen this behavior? Is the system misconfigured? Feb 1 16:13:25 localhost kernel: irq 201: nobody cared! Feb 1 16:13:25 localhost kernel: Feb 1 16:13:25 localhost kernel: Call Trace: {__report_bad_irq+48} {note_interrupt+89} Feb 1 16:13:25 localhost kernel: {__do_IRQ+281} {do_IRQ+66} Feb 1 16:13:25 localhost kernel: {ret_from_intr+0} {default_idle+0} Feb 1 16:13:25 localhost kernel: {default_idle+32} {cpu_idle+63} Feb 1 16:13:25 localhost kernel: Feb 1 16:13:25 localhost kernel: handlers: Feb 1 16:13:25 localhost kernel: [] (mthca_interrupt+0x0/0xa0 [ib_mthca]) Feb 1 16:13:25 localhost kernel: Disabling IRQ #201 Bob Pearson ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] VPD capabilty address flag
On a gen2 system when using /proc/bus/pci/nn/mm.m to access the PCI configuration space of an HCA in an attempt to read the VPD for the adapter, the flag bit in the VPD address fails to set after a reasonable delay after writing a new address and the VPD data are invalid. Other adapters (e.g. Broadcom NICs) that also have a VPD do set the flag bit and return valid VPD data. Any ideas? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general