[openib-general] bug in libmthca/src/verbs.c?

2006-08-25 Thread Robert Pearson








struct ibv_cq *mthca_create_cq(struct ibv_context
*context, int cqe,

   struct
ibv_comp_channel *channel,

   int comp_vector)

{

    struct mthca_create_cq  cmd;

---à
snip ß

    ret = ibv_cmd_create_cq(context, cqe - 1,
channel, comp_vector,

    &cq->ibv_cq,
&cmd.ibv_cmd, sizeof cmd,

 

  
^^

 

    &resp.ibv_resp,
sizeof resp);

 

The command size passed to ibv_cmd_create_cq is the
size of the mthca command wrapper which is larger than what is most likely
expected.






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] dropped packets

2006-08-24 Thread Robert Pearson








Roland,

 

I am
trying to write a user level application that receives multicast UD packets at
user level. I am seeing about 1-2 % packet loss between the send side and the
receive side apparently independent of the packet rate for low rates. (Heavily
traced sends and receives with very low rates still drop packets even though
there are more packets posted on the receive side than are sent.) I have a
couple of questions:

 

1.  
Are there any race issues with ibv_get_cq_event? The
example code (ud_pingpong) seems to imply that the correct sequence is

 

Start:

    Call
ibv_get_cq_event

    Call
ibv_ack_cq_event   <-
anywhere so long as it happens before destroy_cq

    Call
ibv_req_notify_cq

    Call
ibv_poll_cq    <-
just once not as usual until empty according to the example

    Goto
start

 

In the old days we called request notify and poll
until poll was empty on a notify thread in order to prevent a race.

 

2.  
When I post say 500 receive buffers and send say 200 send
buffers and tag the sends with a sequence number I often see one or two missing
sequence numbers at the receive side at the poll_cq interface having checked at
the post and poll interfaces of the send side to see that all the correct
sequence numbers went out. I am not sure how this can be possible regardless of
the notification scheme used.

 

I
would love for this to be a programming error in my code but I can’t
figure out how I can mess it up between post_send and poll_cq on the receive
side. I see the same behavior between systems and with a loopback between two
ports on the same HCA.

 

Please
let me know if this rings any bells.

 

Bob
Pearson

 






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] SVN problem

2006-05-27 Thread Robert Pearson








I’m having problems checking out the svn repository. As
anyone seen this?

 

svn: In directory
'gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1'

svn: Can't copy
'gen2/trunk/src/userspace/mpi/mvapich-gen2/www/www1/.svn/tmp/text-bas

nk/src/userspace/mpi/mvapich-gen2/www/www1/mpicc.html.tmp':
No such file or directory






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Linux real time scheduler

2006-03-10 Thread Robert Pearson








We have customer who is attempting to use the Linux real
time scheduler options available in 2.6.x. Does anyone have any experience with
either gen1 or gen2 running with real time thread scheduling?






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

RE: [openib-general] Re: Disabling IRQ #201 message

2005-02-02 Thread Robert Pearson


> If you decide its a board/BIOS issue, try with my patch as a
work-around.

Sorry, which patch was that?




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: Disabling IRQ #201 message

2005-02-02 Thread Robert Pearson
Michael,

I ran an experiment where I rebooted the machine, waited for an hour and
then looked at /var/log/messages which I copy below. OpenSM is not
running and IPoverIB is not loaded. The dropped interrupt occurs about
30 minutes after the reboot. I do not know what if anything is causing
activity in the HCA. Port 1 is cabled to an unmanaged switch but stuck
in INIT state since there is no running SM.

I was mistaken about the machine. We have two dual Opteron machines, but
this one is an AMD engineering machine, not the NewIsys machine.

There is a message below about some device that does not call
pci_enable_device() which may be interesting. 

Feb  2 11:20:48 localhost syslogd 1.4.1: restart.
Feb  2 11:20:48 localhost syslog: syslogd startup succeeded
Feb  2 11:20:48 localhost kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Feb  2 11:20:48 localhost kernel: Bootdata ok (command line is ro
root=/dev/VolGroup00/LogVol00 rhgb quiet)
Feb  2 11:20:48 localhost kernel: Linux version 2.6.11-rc1rbp
([EMAIL PROTECTED]) (gcc version 3.4.2 20041017 (Red Hat
3.4.2-6.fc3)) #1 SMP Fri Jan 21 11:23:08 CST 2005
Feb  2 11:20:48 localhost kernel: BIOS-provided physical RAM map:
Feb  2 11:20:48 localhost kernel:  BIOS-e820:  -
0009fc00 (usable)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: 0009fc00 -
000a (reserved)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: 000e -
0010 (reserved)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: 0010 -
e7ff (usable)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: e7ff -
e7fff000 (ACPI data)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: e7fff000 -
e800 (ACPI NVS)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: ff7c -
0001 (reserved)
Feb  2 11:20:48 localhost kernel:  BIOS-e820: 0001 -
0004 (usable)
Feb  2 11:20:48 localhost syslog: klogd startup succeeded
Feb  2 11:20:48 localhost portmap: portmap startup succeeded
Feb  2 11:20:48 localhost rpc.statd[2680]: Version 1.0.6 Starting
Feb  2 11:20:48 localhost kernel: Scanning NUMA topology in Northbridge
24
Feb  2 11:20:48 localhost nfslock: rpc.statd startup succeeded
Feb  2 11:20:48 localhost kernel: Number of nodes 2 (10010)
Feb  2 11:20:48 localhost kernel: Node 0 already present. Skipping
Feb  2 11:20:48 localhost kernel: Node 1 already present. Skipping
Feb  2 11:20:48 localhost kernel: No NUMA configuration found
Feb  2 11:20:48 localhost kernel: Faking a node at
-0004
Feb  2 11:20:48 localhost kernel: Bootmem setup node 0
-0004
Feb  2 11:20:48 localhost kernel: ACPI: LAPIC (acpi_id[0x01]
lapic_id[0x00] enabled)
Feb  2 11:20:48 localhost kernel: Processor #0 15:5 APIC version 16
Feb  2 11:20:48 localhost kernel: ACPI: LAPIC (acpi_id[0x02]
lapic_id[0x01] enabled)
Feb  2 11:20:48 localhost rpcidmapd: rpc.idmapd startup succeeded
Feb  2 11:20:48 localhost kernel: Processor #1 15:5 APIC version 16
Feb  2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x02]
address[0xfec0] gsi_base[0])
Feb  2 11:20:49 localhost kernel: IOAPIC[0]: apic_id 2, version 17,
address 0xfec0, GSI 0-23
Feb  2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x03]
address[0xfebfe000] gsi_base[24])
Feb  2 11:20:49 localhost kernel: IOAPIC[1]: apic_id 3, version 17,
address 0xfebfe000, GSI 24-27
Feb  2 11:20:49 localhost netfs: Mounting other filesystems:  succeeded
Feb  2 11:20:49 localhost kernel: ACPI: IOAPIC (id[0x04]
address[0xfebff000] gsi_base[28])
Feb  2 11:20:49 localhost rc: Starting lm_sensors:  succeeded
Feb  2 11:20:49 localhost kernel: IOAPIC[2]: apic_id 4, version 17,
address 0xfebff000, GSI 28-31
Feb  2 11:20:49 localhost kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0
global_irq 2 dfl dfl)
Feb  2 11:20:49 localhost kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0
global_irq 2 dfl dfl)
Feb  2 11:20:49 localhost autofs: automount startup succeeded
Feb  2 11:20:49 localhost kernel: Setting APIC routing to flat
Feb  2 11:20:49 localhost kernel: Using ACPI (MADT) for SMP
configuration information
Feb  2 11:20:49 localhost kernel: Checking aperture...
Feb  2 11:20:49 localhost kernel: CPU 0: aperture @ 1000 size 32 MB
Feb  2 11:20:49 localhost kernel: Aperture from northbridge cpu 0 too
small (32 MB)
Feb  2 11:20:49 localhost kernel: No AGP bridge found
Feb  2 11:20:49 localhost kernel: Your BIOS doesn't leave a aperture
memory hole
Feb  2 11:20:49 localhost kernel: Please enable the IOMMU option in the
BIOS setup
Feb  2 11:20:49 localhost kernel: This costs you 64 MB of RAM
Feb  2 11:20:49 localhost kernel: Mapping aperture over 65536 KB of RAM
@ 1000
Feb  2 11:20:49 localhost kernel: Built 1 zonelists
Feb  2 11:20:49 localhost kernel: Kernel command line: ro
root=/dev/VolGroup00/LogVol00 rhgb quiet console=tty0
Feb  2 11:20:49 localhost kernel: Initializing CPU#0
Feb  2 11:20:49 localhost mDNSResponder:  startup succeeded
Fe

RE: [openib-general] Disabling IRQ #201 message

2005-02-02 Thread Robert Pearson
I'm debugging some code that is reading files in /sys/class/infiniband/.
Other than that the HCA isn't doing anything at all. The dropped
interrupt occurs whether or not I am doing anything. I can reboot the
machine and just let it sit there and the message will occur after a
while. After the message, files which require interacting with the HCA
e.g. /sys/class/infiniband/mthca0/ports/1/state become unreadable. Read
calls block for a long time and finally timeout with an EOF indication.

> -Original Message-
> From: Roland Dreier [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 01, 2005 6:31 PM
> To: Robert Pearson
> Cc: openib-general@openib.org
> Subject: Re: [openib-general] Disabling IRQ #201 message
> 
> Robert> Am running current version of openib on a 2.6.11-rc1
> Robert> kernel on a NewIsis dual Opteron system. Every 15-20
> Robert> minutes the following occurs. Have others seen this
> Robert> behavior? Is the system misconfigured?
> 
> Do the drivers work other than this messsage?
> 
> It seems occasionally an interrupt occurs but the driver is not
> finding an events in any of the event queues.  I've never seen this
> but on the other hand I've not done much testing on the
> Opteron/AMD-8131 platform.
> 
>  - R.




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Disabling IRQ #201 message

2005-02-01 Thread Robert Pearson








Am running current version of openib on a 2.6.11-rc1
kernel on a NewIsis dual Opteron system. Every 15-20 minutes the following
occurs. Have others seen this behavior? Is the system misconfigured?

 

Feb  1 16:13:25 localhost kernel: irq 201:
nobody cared!

Feb  1 16:13:25 localhost kernel:

Feb  1 16:13:25 localhost kernel: Call
Trace: {__report_bad_irq+48}
{note_interrupt+89}

Feb  1 16:13:25 localhost kernel:    {__do_IRQ+281}
{do_IRQ+66}

Feb  1 16:13:25 localhost
kernel:   
{ret_from_intr+0}  
{default_idle+0}

Feb  1 16:13:25 localhost
kernel:   
{default_idle+32} {cpu_idle+63}

Feb  1 16:13:25 localhost kernel:

Feb  1 16:13:25 localhost kernel: handlers:

Feb  1 16:13:25 localhost kernel:
[] (mthca_interrupt+0x0/0xa0 [ib_mthca])

Feb  1 16:13:25 localhost kernel: Disabling IRQ
#201

 

Bob Pearson






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] VPD capabilty address flag

2005-01-28 Thread Robert Pearson








On a gen2 system when using /proc/bus/pci/nn/mm.m to access the PCI
configuration space

of an HCA in an attempt to read the VPD for the adapter, the flag bit
in the VPD address

fails to set after a reasonable delay after writing a new address and
the VPD data are invalid.

Other adapters (e.g. Broadcom NICs) that also have a VPD do set the
flag bit and return valid

VPD data. Any ideas?






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general