Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
Thanks, I committed just the packet->msg => packet->msg->mad fix as
one changeset, and the rest of this patch (along with some
kmalloc()+memset() => kzalloc() cleanups now that 2.6.14 is out) as a
second changeset.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM crash with today's trunk

2005-10-27 Thread Roland Dreier
I believe that this is in r3889.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] OpenSM crash with today's trunk

2005-10-27 Thread Aniruddha Bohra
Hello,
I updated the OpenIB stack today and I get the following error
on starting OpenSM. The verbose log is available at
http://www.cs.rutgers.edu/~bohra/osm-v.log


# opensm -V -d10 -r
-
OpenSM Rev:openib-1.1.0
Command Line Arguments:
 Big V selected
 d level = 0xa
 Reassign LIDs
 Log File: /var/log/osm.log
-
OpenSM Rev:openib-1.1.0

Using default guid 0x2c901081e7471

Error from osm_opensm_bind (0x2A)
Exiting SM

Segmentation fault


Please let me know what I can do to debug this.

Thanks
Aniruddha


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] SRQ limit reached async event.

2005-10-27 Thread Roland Dreier
Galen> Does anyone now if openib supports the SRQ limit
Galen> asynchronous event?

Yes, openib verbs and the mthca driver supports this.  However, with
current firmware, you will only receive this event for mem-free HCAs
(firmware versions 5.x and 1.x).  For mem-ful HCAs (firmware versions
3.x and 4.x), you will need to use as-yet-unreleased firmware for the
event to be generated.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty
>OK, I think I found it.  The problem was that ib_umad_write() wrote
>through packet->msg in a few places where it should have used
>packet->msg->mad, and therefore corrupted the address of the buffer.

Yep - that appears to be the issue.

I've attached another patch that includes your fixes, plus adds some
additional code cleanup.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>


Index: user_mad.c
===
--- user_mad.c  (revision 3861)
+++ user_mad.c  (working copy)
@@ -99,7 +99,6 @@
struct ib_mad_send_buf *msg;
struct list_head   list;
intlength;
-   DECLARE_PCI_UNMAP_ADDR(mapping)
struct ib_user_mad mad;
 };
 
@@ -138,24 +137,23 @@
 struct ib_mad_send_wc *send_wc)
 {
struct ib_umad_file *file = agent->context;
-   struct ib_umad_packet *timeout, *packet = send_wc->send_buf->context[0];
+   struct ib_umad_packet *timeout;
+   struct ib_umad_packet *packet = send_wc->send_buf->context[0];
 
ib_destroy_ah(packet->msg->ah);
ib_free_send_mad(packet->msg);
 
if (send_wc->status == IB_WC_RESP_TIMEOUT_ERR) {
-   timeout = kmalloc(sizeof *timeout + sizeof (struct ib_mad_hdr),
- GFP_KERNEL);
+   timeout = kmalloc(sizeof *timeout + IB_MGMT_MAD_HDR, 
GFP_KERNEL);
if (!timeout)
goto out;
 
-   memset(timeout, 0, sizeof *timeout + sizeof (struct 
ib_mad_hdr));
+   memset(timeout, 0, sizeof *timeout + IB_MGMT_MAD_HDR);
 
-   timeout->length = sizeof (struct ib_mad_hdr);
+   timeout->length = IB_MGMT_MAD_HDR;
timeout->mad.hdr.id = packet->mad.hdr.id;
timeout->mad.hdr.status = ETIMEDOUT;
-   memcpy(timeout->mad.data, packet->mad.data,
-  sizeof (struct ib_mad_hdr));
+   memcpy(timeout->mad.data, packet->mad.data, IB_MGMT_MAD_HDR);
 
if (!queue_packet(file, agent, timeout))
return;
@@ -245,7 +243,7 @@
else
ret = -ENOSPC;
} else if (copy_to_user(buf, &packet->mad,
- packet->length + sizeof (struct ib_user_mad)))
+   packet->length + sizeof (struct ib_user_mad)))
ret = -EFAULT;
else
ret = packet->length + sizeof (struct ib_user_mad);
@@ -270,22 +268,19 @@
struct ib_rmpp_mad *rmpp_mad;
u8 method;
__be64 *tid;
-   int ret, length, hdr_len, rmpp_hdr_size;
+   int ret, length, hdr_len, copy_offset;
int rmpp_active = 0;
 
if (count < sizeof (struct ib_user_mad))
return -EINVAL;
 
length = count - sizeof (struct ib_user_mad);
-   packet = kmalloc(sizeof *packet + sizeof(struct ib_mad_hdr) +
-sizeof (struct ib_rmpp_hdr), GFP_KERNEL);
+   packet = kmalloc(sizeof *packet + IB_MGMT_RMPP_HDR, GFP_KERNEL);
if (!packet)
return -ENOMEM;
 
if (copy_from_user(&packet->mad, buf,
-   sizeof (struct ib_user_mad) +
-   sizeof (struct ib_mad_hdr) +
-   sizeof (struct ib_rmpp_hdr))) {
+   sizeof (struct ib_user_mad) + IB_MGMT_RMPP_HDR)) {
ret = -EFAULT;
goto err;
}
@@ -296,8 +291,6 @@
goto err;
}
 
-   packet->length = length;
-
down_read(&file->agent_mutex);
 
agent = file->agent[packet->mad.hdr.id];
@@ -344,12 +337,10 @@
goto err_ah;
}
rmpp_active = 1;
+   copy_offset = IB_MGMT_RMPP_HDR;
} else {
-   if (length > sizeof (struct ib_mad)) {
-   ret = -EINVAL;
-   goto err_ah;
-   }
hdr_len = IB_MGMT_MAD_HDR;
+   copy_offset = IB_MGMT_MAD_HDR;
}
 
packet->msg = ib_create_send_mad(agent,
@@ -363,32 +354,18 @@
}
 
packet->msg->ah = ah;
-   packet->msg->timeout_ms  = packet->mad.hdr.timeout_ms;
+   packet->msg->timeout_ms = packet->mad.hdr.timeout_ms;
packet->msg->retries = packet->mad.hdr.retries;
packet->msg->context[0] = packet;
 
-   if (!rmpp_active) {
-   /* Copy message from user into send buffer */
-   if (copy_from_user(packet->msg->mad,
-  buf + sizeof (struct ib_user_mad), length)) {
-   ret = -EFAULT;
-   goto err_msg;
-   }
-   } else {
-   rmpp_hdr_size = sizeof (struct ib_mad_hdr) +
-   sizeof (struct ib_rmpp_hdr);
-
-   /* Only copy MAD headers (RMPP header 

[openib-general] SRQ limit reached async event.

2005-10-27 Thread Galen M. Shipman

Hello,

Does anyone now if openib supports the SRQ limit asynchronous event?
I am working with mellanox verbs right now and it doesn't seem to  
support this. I say this because I have to set the srq_limit  
attribute via VAPI_modify_srq in order to get the event,  
unfortunately when I call VAPI_modify_srq I get:  error in  
VAPI_modify_srq: Not implemented


Any insight is appreciated.

Thanks,

Galen

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Boot over IB - support in Bproc status?

2005-10-27 Thread Hb Chen

Hi,
Can anyone point out the current staus of  Boot over IB - support in Bproc?
Also what is the other solution about "mass boot over IB" now?  (openSM, 
SRP...)

Thanks.

HB
LANL
CCN-9

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
BTW, Jay, can you confirm that this patch fixes your problem too?

Thanks,
  Roland
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
OK, I think I found it.  The problem was that ib_umad_write() wrote
through packet->msg in a few places where it should have used
packet->msg->mad, and therefore corrupted the address of the buffer.

I'll commit the patch below in a little while, which fixes this issue
and the packet->length race that Sean spotted, unless someone sees a
problem with it:

--- infiniband/core/user_mad.c  (revision 3867)
+++ infiniband/core/user_mad.c  (working copy)
@@ -297,8 +297,6 @@ static ssize_t ib_umad_write(struct file
goto err;
}
 
-   packet->length = length;
-
down_read(&file->agent_mutex);
 
agent = file->agent[packet->mad.hdr.id];
@@ -398,12 +396,12 @@ static ssize_t ib_umad_write(struct file
 * transaction ID matches the agent being used to send the
 * MAD.
 */
-   method = ((struct ib_mad_hdr *) packet->msg)->method;
+   method = ((struct ib_mad_hdr *) packet->msg->mad)->method;
 
if (!(method & IB_MGMT_METHOD_RESP)   &&
method != IB_MGMT_METHOD_TRAP_REPRESS &&
method != IB_MGMT_METHOD_SEND) {
-   tid = &((struct ib_mad_hdr *) packet->msg)->tid;
+   tid = &((struct ib_mad_hdr *) packet->msg->mad)->tid;
*tid = cpu_to_be64(((u64) agent->hi_tid) << 32 |
   (be64_to_cpup(tid) & 0x));
}
@@ -414,7 +412,7 @@ static ssize_t ib_umad_write(struct file
 
up_read(&file->agent_mutex);
 
-   return sizeof (struct ib_user_mad_hdr) + packet->length;
+   return count;
 
 err_msg:
ib_free_send_mad(packet->msg);
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] ping over IPoIB does not work between 2 cards on the same host

2005-10-27 Thread Kanevsky, Arkady
I have a host with 2 HCAs (dual port each but I only connected one port
per 
machine) connected to a switch.

When IPoIB configured I ping cards own IP address it works.
I can ping another machines with their HCA cards configured with IPoIB
fine.
And I can ping both local IP addresses from remote machine(s)

Details:

ifconfig ib1 192.168.0.1 netmask 255.255.0.0 ifconfig ib3 192.168.0.3
netmask 255.255.0.0

On remote machine:

ifconfig ib0 192.168.1.0 netmask 255.255.0.0

Locally:

ping -I ib3 192.168.0.3

PING 192.168.0.3 (192.168.97.3) from 192.168.0.3 ib3: 56(84) bytes of
data.

64 bytes from 192.168.0.3: icmp_seq=0 ttl=64 time=0.028 ms

ping -I ib1 192.168.0.1

PING 192.168.0.1 (192.168.97.1) from 192.168.0.1 ib1: 56(84) bytes of
data.

64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.028 ms

# ping -I ib3 192.168.1.0

PING 192.168.1.0 (192.168.1.0) from 192.168.0.3 ib3: 56(84) bytes of
data.

64 bytes from 192.168.1.0: icmp_seq=0 ttl=64 time=1.81 ms

>From remote host:

# ping -I ib0 192.168.0.1

PING 192.168.0.1 (192.168.0.1) from 192.168.1.0 ib0: 56(84) bytes of
data.

64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.086 ms

# ping -I ib0 192.168.0.3

PING 192.168.0.3 (192.168.0.3) from 192.168.1.0 ib0: 56(84) bytes of
data.

64 bytes from 192.168.0.1: icmp_seq=0 ttl=64 time=0.086 ms

Locally between 2 cards:# ping -I ib3 192.168.0.1 PING 192.168.0.1
(192.168.0.1) from 192.168.0.3 ib3: 56(84) bytes of data.

>From 192.168.0.3 icmp_seq=1 Destination Host Unreachable From
192.168.0.3 icmp_seq=2 Destination Host Unreachable From 192.168.0.3
icmp_seq=3 Destination Host Unreachable

Arkady

 

Arkady Kanevsky   email: [EMAIL PROTECTED]

Network Appliance Inc.   phone: 781-768-5395

275 Totten Pond Rd.  Fax: 781-895-1195

Waltham, MA 02451-2010  central phone: 781-768-5300
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty

Roland Dreier wrote:

Good catch.  Seems like the below patch is the right fix:
we start out with


Fix looks right to me.


packet->length = length;


I don't think that this assignment is needed.  Once the packet is sent, it is 
simply freed.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [SRP] srp_cm_handler expanded response handling

2005-10-27 Thread John Kingman
On Thu, 27 Oct 2005, Roland Dreier wrote:

>Looks good, except:
>
>> +if (reason == 0x00010002)  
>
>can you add enums for all these SRP_LOGIN_REJ reason codes rather than
>open-coding this magic number here?

OK.

Thanks,
John

Signed-off-by: John Kingman <[EMAIL PROTECTED]>

Index: ib_srp.h
===
--- ib_srp.h(revision 3884)
+++ ib_srp.h(working copy)
@@ -76,6 +76,16 @@ enum srp_target_state {
SRP_TARGET_REMOVED
 };
 
+enum srp_login_rej_reason {
+   SRP_UNABLE_ESTABLISH_CHANNEL= 0x0001,
+   SRP_INSUFFICIENT_RESOURCES  = 0x00010001,
+   SRP_REQ_IT_IU_LENGTH_TOO_LARGE  = 0x00010002,
+   SRP_UNABLE_ASSOCIATE_CHANNEL= 0x00010003,
+   SRP_UNSUPPORTED_DESCRIPTOR_FMT  = 0x00010004,
+   SRP_MULTI_CHANNEL_UNSUPPORTED   = 0x00010005,
+   SRP_CHANNEL_LIMIT_REACHED   = 0x00010006
+};
+
 struct srp_host {
u8  initiator_port_id[16];
struct ib_device   *dev;

Index: ib_srp.c
===
--- ib_srp.c(revision 3883)
+++ ib_srp.c(working copy)
@@ -975,6 +975,7 @@ static int srp_cm_handler(struct ib_cm_i
struct ib_qp_attr *qp_attr = NULL;
int attr_mask = 0;
int comp = 0;
+   int rsp_opcode = 0;
 
switch (event->event) {
case IB_CM_REQ_ERROR:
@@ -985,17 +986,20 @@ static int srp_cm_handler(struct ib_cm_i
 
case IB_CM_REP_RECEIVED:
comp = 1;
+   rsp_opcode = *(u8 *) event->private_data;
 
-   {
+   if (rsp_opcode == SRP_LOGIN_RSP) {
struct srp_login_rsp *rsp = event->private_data;
 
-   /* XXX check that opcode is SRP RSP */
-
target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len);
target->req_lim   = be32_to_cpu(rsp->req_lim_delta);
 
target->scsi_host->can_queue = min(target->req_lim,
   
target->scsi_host->can_queue);
+   } else {
+   printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", 
rsp_opcode);
+   target->status = -ECONNRESET;
+   break;
}
 
target->status = srp_alloc_iu_bufs(target);
@@ -1043,7 +1047,8 @@ static int srp_cm_handler(struct ib_cm_i
printk(KERN_DEBUG PFX "REJ received\n");
comp = 1;
 
-   if (event->param.rej_rcvd.reason == IB_CM_REJ_PORT_CM_REDIRECT) 
{
+   switch (event->param.rej_rcvd.reason) {
+   case IB_CM_REJ_PORT_CM_REDIRECT:
cpi = event->param.rej_rcvd.ari;
target->path.dlid = cpi->redirect_lid;
target->path.pkey = cpi->redirect_pkey;
@@ -1052,23 +1057,52 @@ static int srp_cm_handler(struct ib_cm_i
 
target->status = target->path.dlid ?
SRP_DLID_REDIRECT : SRP_PORT_REDIRECT;
-   } else if (topspin_workarounds &&
-  !memcmp(&target->ioc_guid, topspin_oui, 3) &&
-  event->param.rej_rcvd.reason == 
IB_CM_REJ_PORT_REDIRECT) {
-   /*
-* Topspin/Cisco SRP gateways incorrectly send
-* reject reason code 25 when they mean 24
-* (port redirect).
-*/
-   memcpy(target->path.dgid.raw,
-  event->param.rej_rcvd.ari, 16);
-
-   printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target 
port GID %016llx%016llx\n",
-  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.subnet_prefix),
-  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.interface_id));
+   break;
 
-   target->status = SRP_PORT_REDIRECT;
-   } else {
+   case IB_CM_REJ_PORT_REDIRECT:
+   if (topspin_workarounds &&
+  !memcmp(&target->ioc_guid, topspin_oui, 3)) {
+   /*
+* Topspin/Cisco SRP gateways incorrectly send
+* reject reason code 25 when they mean 24
+* (port redirect).
+*/
+   memcpy(target->path.dgid.raw,
+  event->param.rej_rcvd.ari, 16);
+
+   printk(KERN_DEBUG PFX "Topspin/Cisco redirect 
to target port GID %016llx%016llx\n",
+  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.subnet_prefix),
+  

[openib-general] Re: [PATCH] [SRP] srp_cm_handler expanded response handling

2005-10-27 Thread Roland Dreier
Looks good, except:

> + if (reason == 0x00010002)  

can you add enums for all these SRP_LOGIN_REJ reason codes rather than
open-coding this magic number here?

Thanks,
  Roland
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] [SRP] srp_cm_handler expanded response handling

2005-10-27 Thread John Kingman
This patch expands the srp_cm_handler code to recognize more response
cases and provides a place holder for future code to handle SRP target
exceptions such as IB_CM_REJ_CONSUMER_DEFINED with reason code
0x00010002 (requested max_it_iu_len too large).

Patch has been tested with our target.

Signed-off-by: John Kingman <[EMAIL PROTECTED]>

Index: ib_srp.c
===
--- ib_srp.c(revision 3883)
+++ ib_srp.c(working copy)
@@ -975,6 +975,7 @@ static int srp_cm_handler(struct ib_cm_i
struct ib_qp_attr *qp_attr = NULL;
int attr_mask = 0;
int comp = 0;
+   int rsp_opcode = 0;
 
switch (event->event) {
case IB_CM_REQ_ERROR:
@@ -985,17 +986,20 @@ static int srp_cm_handler(struct ib_cm_i
 
case IB_CM_REP_RECEIVED:
comp = 1;
+   rsp_opcode = *(u8 *) event->private_data;
 
-   {
+   if (rsp_opcode == SRP_LOGIN_RSP) {
struct srp_login_rsp *rsp = event->private_data;
 
-   /* XXX check that opcode is SRP RSP */
-
target->max_ti_iu_len = be32_to_cpu(rsp->max_ti_iu_len);
target->req_lim   = be32_to_cpu(rsp->req_lim_delta);
 
target->scsi_host->can_queue = min(target->req_lim,
   
target->scsi_host->can_queue);
+   } else {
+   printk(KERN_WARNING PFX "Unhandled RSP opcode %#x\n", 
rsp_opcode);
+   target->status = -ECONNRESET;
+   break;
}
 
target->status = srp_alloc_iu_bufs(target);
@@ -1043,7 +1047,8 @@ static int srp_cm_handler(struct ib_cm_i
printk(KERN_DEBUG PFX "REJ received\n");
comp = 1;
 
-   if (event->param.rej_rcvd.reason == IB_CM_REJ_PORT_CM_REDIRECT) 
{
+   switch (event->param.rej_rcvd.reason) {
+   case IB_CM_REJ_PORT_CM_REDIRECT:
cpi = event->param.rej_rcvd.ari;
target->path.dlid = cpi->redirect_lid;
target->path.pkey = cpi->redirect_pkey;
@@ -1052,23 +1057,52 @@ static int srp_cm_handler(struct ib_cm_i
 
target->status = target->path.dlid ?
SRP_DLID_REDIRECT : SRP_PORT_REDIRECT;
-   } else if (topspin_workarounds &&
-  !memcmp(&target->ioc_guid, topspin_oui, 3) &&
-  event->param.rej_rcvd.reason == 
IB_CM_REJ_PORT_REDIRECT) {
-   /*
-* Topspin/Cisco SRP gateways incorrectly send
-* reject reason code 25 when they mean 24
-* (port redirect).
-*/
-   memcpy(target->path.dgid.raw,
-  event->param.rej_rcvd.ari, 16);
-
-   printk(KERN_DEBUG PFX "Topspin/Cisco redirect to target 
port GID %016llx%016llx\n",
-  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.subnet_prefix),
-  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.interface_id));
+   break;
 
-   target->status = SRP_PORT_REDIRECT;
-   } else {
+   case IB_CM_REJ_PORT_REDIRECT:
+   if (topspin_workarounds &&
+  !memcmp(&target->ioc_guid, topspin_oui, 3)) {
+   /*
+* Topspin/Cisco SRP gateways incorrectly send
+* reject reason code 25 when they mean 24
+* (port redirect).
+*/
+   memcpy(target->path.dgid.raw,
+  event->param.rej_rcvd.ari, 16);
+
+   printk(KERN_DEBUG PFX "Topspin/Cisco redirect 
to target port GID %016llx%016llx\n",
+  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.subnet_prefix),
+  (unsigned long long) 
be64_to_cpu(target->path.dgid.global.interface_id));
+
+   target->status = SRP_PORT_REDIRECT;
+   } else {
+   printk(KERN_WARNING "  REJ reason: 
IB_CM_REJ_PORT_REDIRECT\n");
+   target->status = -ECONNRESET;
+   }
+   break;
+
+   case IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID:
+   printk(KERN_WARNING "  REJ reason: 
IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID\n");
+   target->status = -ECONNRESET;
+   break;
+
+   case IB_CM_REJ_CONSUMER_DEFIN

Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
Sean> the only bug I saw was accessing packet->length after
Sean> calling ib_post_send_mad().  The send_handler() will free
Sean> the packet, so there's a race there.

Good catch.  Seems like the below patch is the right fix:
we start out with

length = count - sizeof (struct ib_user_mad);

and then do

packet->length = length;

so in

return sizeof (struct ib_user_mad_hdr) + packet->length;

we're really just returning count -- in ib_user_mad.h, the definition
of struct ib_user_mad is:

struct ib_user_mad {
struct ib_user_mad_hdr hdr;
__u8data[0];
};

so sizeof struct ib_user_mad == struct ib_user_mad_hdr.

Hal, am I missing something?  Was there any reason to write the return
statement like that, or is it OK to just return count directly?

 - R.


--- infiniband/core/user_mad.c  (revision 3867)
+++ infiniband/core/user_mad.c  (working copy)
@@ -414,7 +414,7 @@ static ssize_t ib_umad_write(struct file
 
up_read(&file->agent_mutex);
 
-   return sizeof (struct ib_user_mad_hdr) + packet->length;
+   return count;
 
 err_msg:
ib_free_send_mad(packet->msg);
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [RFC] OpenSM Interactive Console

2005-10-27 Thread Greg Lindahl
On Thu, Oct 27, 2005 at 09:29:57AM -0500, Troy Benjegerdes wrote:

> I guess the point of all this is find a end-user use-case for the SM
> MIB, and work back from there to decide if haveing a MIB actually helps
> solve the problem.

The end-use case is likely to be something like "an enterprise which
insists on managing as much as possible through HP OpenView." Which
isn't anyone in HPC, hence the current lack of interest.

Now the things you'd actually want to monitor for a cluster, it's not
really the normal stuff that's in MIBs. I'd want to know if a cable
was unexpectedly unplugged, or if a node was up but its IB connection
wasn't. I'd like to know if a link had an unusual error rate.

-- greg

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Hal Rosenstock
I think that is likely a different issue.
 
-- Hal



From: [EMAIL PROTECTED] on behalf of James Lentini
Sent: Thu 10/27/2005 1:54 PM
To: Roland Dreier
Cc: openib-general@openib.org
Subject: Re: [openib-general] OpenSM causes kernel trap




On Thu, 27 Oct 2005, Roland Dreier wrote:

> Sean, looks like your MAD send buf stuff may have broken send
> timeouts.  Any quick ideas before I dig into this?

Itamar also had a problem with the MAD layer on x86_64:

http://openib.org/pipermail/openib-general/2005-October/013029.html

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Automated userspace build error

2005-10-27 Thread Nishanth Aravamudan
On 27.10.2005 [14:31:31 +0200], Michael S. Tsirkin wrote:
> Quoting r. Nishanth Aravamudan <[EMAIL PROTECTED]>:
> > Subject: Re: Automated userspace build error
> > 
> > On 25.10.2005 [15:22:56 -0700], Roland Dreier wrote:
> > > Nishanth> Hrm, well, I'm testing the latest svn (3865), did the
> > > Nishanth> patch just get checked in?
> > > 
> > > Yeah, I only noticed it and fixed it after your original email.  I
> > > just meant that I had already checked it in before sending my reply.
> > > Sorry for the confusion...
> > 
> > No worries, I figured that's what happened.
> > 
> > On a related note, do you (or anyone else) have any suggestions for
> > build-testing all of the userspace components? There isn't a top-level
> > Makefile of any kind to make it easy :/
> > 
> > Thanks,
> > Nish
> 
> Yes, look at scripts in
> https://openib.org/svn/trunk/contrib/mellanox/scripts
> 
> You can also, basically, cut and paste stuff from the FAQ page,
> but that relies on performing make as root.

Which luckily I can do (or is that unluckily -- if I screw up, the
machine tends to fall over ;). Thanks for the pointer!

-Nish
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty

Sean Hefty wrote:
I don't see anything off there either.  Timeouts seem to work fine with 
CM testing, so I'm guessing that the issue is somewhere in user_mad.c.  
I'm trying to see if there's anything wrong in ib_umad_write() that 
might cause it to crash on the completion.


Re-testing with grmpp, I didn't hit any issues running with or without RMPP. 
ib_umad_write() can be cleaned up a little, but the only bug I saw was accessing 
packet->length after calling ib_post_send_mad().  The send_handler() will free 
the packet, so there's a race there.  This doesn't seem related to this crash 
though.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Automated userspace build error

2005-10-27 Thread Nishanth Aravamudan
On 26.10.2005 [17:15:05 -0700], Woodruff, Robert J wrote:
> Nish wrote,
> >On a related note, do you (or anyone else) have any suggestions for
> >build-testing all of the userspace components? There isn't a top-level
> >Makefile of any kind to make it easy :/
> 
> >Thanks,
> >Nish
> 
> If you look at the openib download page, Makia posted a userspace
> source RPM, although it is a bit out of date. 

RPM's aren't necessarily useful, but the means to get there might be.

> I also have a similar build proceedure that I use
> internally, basically building all of the usermode components
> and then building an RPM to allow easy installation on other
> nodes for testing There are also .spec files for most of the individual
> libraries, if you prefer to build RPMs for individual libraries.
> I find it easier just to lump it all into one big usermode component RPM
> and
> one kernel-mode component RPM. 

Yes, that's my goal. But I don't necessarily want to install the
libraries. Just build them. I will take a look at the SRPM you mentioned
above.

Thanks,
Nish
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty

Roland Dreier wrote:

Sean> I think that the send_handler in user_mad.c is broken.

I don't see anything obviously wrong -- in Jay's log, the call to
ib_free_send_mad() is crashing.  When can it be wrong to do that from
the send handler?


I don't see anything off there either.  Timeouts seem to work fine with CM 
testing, so I'm guessing that the issue is somewhere in user_mad.c.  I'm trying 
to see if there's anything wrong in ib_umad_write() that might cause it to crash 
on the completion.


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
Sean> I think that the send_handler in user_mad.c is broken.

I don't see anything obviously wrong -- in Jay's log, the call to
ib_free_send_mad() is crashing.  When can it be wrong to do that from
the send handler?

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Grant Grundler
On Thu, Oct 27, 2005 at 09:37:29AM -0700, Bob Woodruff wrote:
> Grant wrote,
> >What does it say when you use *192* for the first byte?
> 
> Same thing, I had a typo in first email,
> 
> arping -c 2 -w 3  -D -I ib0 192.168.0.1
> ARPING 192.168.0.1 from 0.0.0.0 ib0
> Sent 2 probes (2 broadcast(s))
> Received -1 response(s)

Hrm...wouldn't that be a bug in arping program?
How can one get "-1" responses?

And I can't reproduce that here (ia64-linux):

gsyprf3:~# ifconfig ib0
ib0   Link encap:UNSPEC  HWaddr 
00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00  
  inet addr:10.0.0.51  Bcast:10.0.0.255  Mask:255.255.255.0
  UP BROADCAST MULTICAST  MTU:2044  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:128 
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

gsyprf3:~# arping -c 2 -w 3  -D -I ib0 10.0.0.51
ARPING 10.0.0.51 from 0.0.0.0 ib0
Sent 2 probes (2 broadcast(s))
Received 0 response(s)
gsyprf3:~# arping -c 2 -w 3  -D -I ib0 10.0.0.55
ARPING 10.0.0.55 from 0.0.0.0 ib0
Sent 2 probes (2 broadcast(s))
Received 0 response(s)

There is no 10.0.0.55 IP in use on this network.
I don't understand if the above result is correct or not
and I did RTFM.

BTW, I'm using Debian "iputils-arping 20020927-2".

grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread James Lentini

On Thu, 27 Oct 2005, Roland Dreier wrote:

> Sean, looks like your MAD send buf stuff may have broken send
> timeouts.  Any quick ideas before I dig into this?

Itamar also had a problem with the MAD layer on x86_64:

http://openib.org/pipermail/openib-general/2005-October/013029.html

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty

Roland Dreier wrote:

Sean, looks like your MAD send buf stuff may have broken send
timeouts.  Any quick ideas before I dig into this?


I think that the send_handler in user_mad.c is broken.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Sean Hefty

Roland Dreier wrote:

Sean, looks like your MAD send buf stuff may have broken send
timeouts.  Any quick ideas before I dig into this?


No quick ideas why.  I'll start looking into this as well.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Bob Woodruff
Hal wrote,
>I think arping needs a minor change to work for IB due to the difference in
the 
>HW addresses for IPoIB and other LAN MACs.
 
>-- Hal

Yep. That is the conclusion that we came to also. 
A work around for now, one can just remove the arping check 
in ifup if the device is an ib device. Not perfect, but
allows it to work for ib devices and the normal ifcfg- scripts. 

Something like,

if [ "x`echo ${REALDEVICE} | sed -e "s/^ib.//"`" != "x" ]; then
if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then
echo $"Error, some other host already uses address ${IPADDR}."
exit 1
fi
fi

woody

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenSM causes kernel trap

2005-10-27 Thread Roland Dreier
Sean, looks like your MAD send buf stuff may have broken send
timeouts.  Any quick ideas before I dig into this?

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: ib_mthca panic on PPC64

2005-10-27 Thread Roland Dreier
OK, the latest svn should work again.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] OpenSM causes kernel trap

2005-10-27 Thread Jay Higley
I am trying to start up opensm on a Dell PowerEdge 2850 with a Mellanox 
based infiniband card.  We are using the x86-64 Architecture.  The 
kernel is recompiled with the latest stack from subversion, and all of 
the modules load OK.  However, when I try to start opensm I get the 
following error.  After this, then modules can not be successfully 
removed from the kernel and opensm is not successfully running.  I can 
send the output from opensm's log file if anyone is interested.  Thanks.


-Jay Higley

Oct 27 12:07:17 riba OpenSM[3321]: OpenSM Rev:openib-1.1.0  
Oct 27 12:07:17 riba kernel: Unable to handle kernel paging request at 
 RIP:

Oct 27 12:07:17 riba kernel: {kfree+107}
Oct 27 12:07:17 riba kernel: PGD 103027 PUD 5619067 PMD 0
Oct 27 12:07:17 riba kernel: Oops:  [1] SMP
Oct 27 12:07:17 riba kernel: CPU 3
Oct 27 12:07:17 riba kernel: Modules linked in: nfsd exportfs lockd 
nfs_acl ipv6 sunrpc ib_uverbs ib_at ib_sdp ib_ucm ib_cm ib_ping ib_mthca 
ib_umad binfmt_misc dm_mod video thermal processor fan container button 
battery ac ehci_hcd uhci_hcd pcspkr floppy parport_pc parport ib_ipoib 
ib_sa ib_mad ib_core e1000 snd_pcm_oss snd_pcm snd_timer snd_page_alloc 
snd_mixer_oss snd soundcore ext3 jbd megaraid_mbox megaraid_mm sd_mod 
scsi_mod
Oct 27 12:07:17 riba kernel: Pid: 1783, comm: ib_mad1 Not tainted 
2.6.13.4-86.caos.smp
Oct 27 12:07:17 riba kernel: RIP: 0010:[] 
{kfree+107}

Oct 27 12:07:17 riba kernel: RSP: 0018:81013df97db8  EFLAGS: 00010006
Oct 27 12:07:17 riba kernel: RAX: 0003 RBX:  
RCX: 81013fd93518
Oct 27 12:07:17 riba kernel: RDX: 00762000 RSI: 0292 
RDI: 810004b02028
Oct 27 12:07:17 riba kernel: RBP: 81010e00 R08: 81013df96000 
R09: 
Oct 27 12:07:17 riba kernel: R10: 0001 R11:  
R12: 81013e600e10
Oct 27 12:07:17 riba kernel: R13: 810037deb000 R14: 81013e600e78 
R15: 880e5190
Oct 27 12:07:17 riba kernel: FS:  () 
GS:804f3980() knlGS:
Oct 27 12:07:17 riba kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
8005003b
Oct 27 12:07:17 riba kernel: CR2:  CR3: 00013907a000 
CR4: 06e0
Oct 27 12:07:17 riba kernel: Process ib_mad1 (pid: 1783, threadinfo 
81013df96000, task 81013e40a1b0)
Oct 27 12:07:17 riba kernel: Stack: 0286 81013e600e10 
81013f3db180 880e272e
Oct 27 12:07:17 riba kernel:81013df97e28 8817113f 
81013e40a3c8 81013fd93500

Oct 27 12:07:17 riba kernel:81013e600e00 0292
Oct 27 12:07:17 riba kernel: Call 
Trace:{:ib_mad:ib_free_send_mad+14} 
{:ib_umad:send_handler+63}
Oct 27 12:07:17 riba kernel:
{:ib_mad:timeout_sends+404} 
{__wake_up+67}
Oct 27 12:07:17 riba kernel:
{worker_thread+498} 
{default_wake_function+0}
Oct 27 12:07:17 riba kernel:
{__wake_up_common+64} 
{default_wake_function+0}
Oct 27 12:07:17 riba kernel:
{keventd_create_kthread+0} 
{worker_thread+0}
Oct 27 12:07:17 riba kernel:
{keventd_create_kthread+0} {kthread+217}
Oct 27 12:07:17 riba kernel:{child_rip+8} 
{keventd_create_kthread+0}
Oct 27 12:07:17 riba kernel:{kthread+0} 
{child_rip+0}
Oct 27 12:07:17 riba kernel:
Oct 27 12:07:17 riba kernel:
Oct 27 12:07:17 riba kernel: Code: 8b 03 3b 43 04 73 04 89 c0 eb 0a 48 
89 de e8 a2 03 00 00 8b
Oct 27 12:07:17 riba kernel: RIP {kfree+107} RSP 


Oct 27 12:07:17 riba kernel: CR2: 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Hal Rosenstock
I think arping needs a minor change to work for IB due to the difference in the 
HW addresses for IPoIB and other LAN MACs.
 
-- Hal



From: [EMAIL PROTECTED] on behalf of Bob Woodruff
Sent: Thu 10/27/2005 12:37 PM
To: 'Grant Grundler'
Cc: openib-general@openib.org
Subject: RE: [openib-general] ifup/ifdown scripts don't work with IPoIB



Grant wrote,
>What does it say when you use *192* for the first byte?

Same thing, I had a typo in first email,

arping -c 2 -w 3  -D -I ib0 192.168.0.1
ARPING 192.168.0.1 from 0.0.0.0 ib0
Sent 2 probes (2 broadcast(s))
Received -1 response(s)

woody

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] add node_guid to struct ib_device

2005-10-27 Thread Sean Hefty
Here's a modified version of Roland's original patch that adds only
the node_guid to struct ib_device.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

I'll rework my other patches based on this change.

Index: include/rdma/ib_verbs.h
===
--- include/rdma/ib_verbs.h (revision 3861)
+++ include/rdma/ib_verbs.h (working copy)
@@ -951,6 +951,7 @@
u64  uverbs_cmd_mask;
int  uverbs_abi_ver;
 
+   __be64   node_guid;
u8   node_type;
u8   phys_port_cnt;
 };
Index: hw/mthca/mthca_dev.h
===
--- hw/mthca/mthca_dev.h(revision 3830)
+++ hw/mthca/mthca_dev.h(working copy)
@@ -290,7 +290,7 @@
u64  ddr_end;
 
MTHCA_DECLARE_DOORBELL_LOCK(doorbell_lock)
-   struct semaphore cap_mask_mutex;
+   struct semaphore dev_attr_mutex;
 
void __iomem*hcr;
void __iomem*kar;
@@ -528,4 +528,17 @@
return dev->mthca_flags & MTHCA_FLAG_MEMFREE;
 }
 
+/*
+ * XXX remove once 2.6.14 is released.
+ */
+static inline void *mthca_kzalloc(size_t size, unsigned int __nocast flags)
+{
+   void *ret = kmalloc(size, flags);
+   if (ret)
+   memset(ret, 0, size);
+   return ret;
+}
+#undef kzalloc
+#define kzalloc(s, f) mthca_kzalloc(s, f);
+
 #endif /* MTHCA_DEV_H */
Index: hw/mthca/mthca_provider.c
===
--- hw/mthca/mthca_provider.c   (revision 3830)
+++ hw/mthca/mthca_provider.c   (working copy)
@@ -45,6 +45,14 @@
 #include "mthca_user.h"
 #include "mthca_memfree.h"
 
+static void init_query_mad(struct ib_smp *mad)
+{
+   mad->base_version  = 1;
+   mad->mgmt_class= IB_MGMT_CLASS_SUBN_LID_ROUTED;
+   mad->class_version = 1;
+   mad->method= IB_MGMT_METHOD_GET;
+}
+
 static int mthca_query_device(struct ib_device *ibdev,
  struct ib_device_attr *props)
 {
@@ -55,7 +63,7 @@
 
u8 status;
 
-   in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+   in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
if (!in_mad || !out_mad)
goto out;
@@ -64,12 +72,8 @@
 
props->fw_ver  = mdev->fw_ver;
 
-   memset(in_mad, 0, sizeof *in_mad);
-   in_mad->base_version   = 1;
-   in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
-   in_mad->class_version  = 1;
-   in_mad->method = IB_MGMT_METHOD_GET;
-   in_mad->attr_id= IB_SMP_ATTR_NODE_INFO;
+   init_query_mad(in_mad);
+   in_mad->attr_id = IB_SMP_ATTR_NODE_INFO;
 
err = mthca_MAD_IFC(mdev, 1, 1,
1, NULL, NULL, in_mad, out_mad,
@@ -127,20 +131,16 @@
int err = -ENOMEM;
u8 status;
 
-   in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+   in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
if (!in_mad || !out_mad)
goto out;
 
memset(props, 0, sizeof *props);
 
-   memset(in_mad, 0, sizeof *in_mad);
-   in_mad->base_version   = 1;
-   in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
-   in_mad->class_version  = 1;
-   in_mad->method = IB_MGMT_METHOD_GET;
-   in_mad->attr_id= IB_SMP_ATTR_PORT_INFO;
-   in_mad->attr_mod   = cpu_to_be32(port);
+   init_query_mad(in_mad);
+   in_mad->attr_id  = IB_SMP_ATTR_PORT_INFO;
+   in_mad->attr_mod = cpu_to_be32(port);
 
err = mthca_MAD_IFC(to_mdev(ibdev), 1, 1,
port, NULL, NULL, in_mad, out_mad,
@@ -185,7 +185,7 @@
int err;
u8 status;
 
-   if (down_interruptible(&to_mdev(ibdev)->cap_mask_mutex))
+   if (down_interruptible(&to_mdev(ibdev)->dev_attr_mutex))
return -ERESTARTSYS;
 
err = mthca_query_port(ibdev, port, &attr);
@@ -207,7 +207,7 @@
}
 
 out:
-   up(&to_mdev(ibdev)->cap_mask_mutex);
+   up(&to_mdev(ibdev)->dev_attr_mutex);
return err;
 }
 
@@ -219,18 +219,14 @@
int err = -ENOMEM;
u8 status;
 
-   in_mad  = kmalloc(sizeof *in_mad, GFP_KERNEL);
+   in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
if (!in_mad || !out_mad)
goto out;
 
-   memset(in_mad, 0, sizeof *in_mad);
-   in_mad->base_version   = 1;
-   in_mad->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
-   in_mad->class_version  = 1;
-   in_mad->method = IB_MGMT_METHOD_GET;
-   in_mad->attr_id= IB_SMP_ATTR_PKEY_TABLE;
-   in_mad->attr_mod

Re: [openib-general] Re: ehca testing

2005-10-27 Thread Roland Dreier
OK, looks like you have two problems.  First of all, you seem to have
two versions of ib_mthca, one of which gets picked up by hotplug on
boot and one of which gets picked up by modprobe.  Notice how you
don't see the

dev->ib_dev.node_type = 1

line when mthca runs on boot?  The only explanation I can come up with
for that would be that you have an old version of it in an initrd or
something that's screwing thing up.

As for the crash in poll_catas, I understand what's going on there.
The catastrophic error polling code is ioremap()ing a PCI address
instead of the correct CPU address.  They're different on pSeries but
not on most other architectures, so I didn't see problems in testing.

I'll commit a fix for that problem shortly.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Bob Woodruff
Grant wrote,
>What does it say when you use *192* for the first byte?

Same thing, I had a typo in first email,

arping -c 2 -w 3  -D -I ib0 192.168.0.1
ARPING 192.168.0.1 from 0.0.0.0 ib0
Sent 2 probes (2 broadcast(s))
Received -1 response(s)

woody

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: ehca testing

2005-10-27 Thread Troy Benjegerdes
On Thu, Oct 20, 2005 at 03:32:13PM -0700, Roland Dreier wrote:
> Troy> There is some sort of strange initializiation error going on here..
> 
> Yes, very strange.  Can you add
> 
>   printk(KERN_ERR "hca->node_type = %d\n", hca->node_type);
> 
> to the beginning of ipoib_add_port(), and
> 
>   printk(KERN_ERR "dev->ib_dev.node_type = %d\n", dev->ib_dev.node_type);
> 
> right before the call to ib_register_device() in
> mthca_register_device() and send the output that you get when hotplug
> loads ib_mthca vs. when you load ib_mthca by hand?

When loaded at boot:

[586811.915831] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23,
2005)
[586811.915849] ib_mthca: Initializing :d9:00.0
[586811.916634] PCI: Enabling device: (:d9:00.0), cmd 142
[586818.501595] openafs: module license
'http://www.openafs.org/dl/license10.html' taints kernel.
[586818.504651] Found system call table at 0xc0013e68 (scan:
close+ioctl)
[586818.520240] Starting AFS cache scan...Memory cache: Allocating 12500
dcacheentries...found 0 non-empty cache files (0%).
[586875.848354] afs: Lost contact with volume location server
147.155.137.10 incell scl.ameslab.gov
[586875.848374] afs: Lost contact with volume location server
147.155.137.10 incell scl.ameslab.gov
[587154.758768] hca->node_type = 236
[587154.760578] hca->node_type = 236
[587154.761511] hca->node_type = 236
[587154.761572] mthca0: ib_query_pkey port 3 failed (ret = -22)
[587154.761584] hca->node_type = 236
[587154.761633] mthca0: ib_query_pkey port 4 failed (ret = -22)
[587154.761644] hca->node_type = 236
[587154.762506] hca->node_type = 236
[587154.763422] hca->node_type = 236
[587154.763480] mthca0: ib_query_pkey port 7 failed (ret = -22)
[587154.763491] hca->node_type = 236
[587154.763542] mthca0: ib_query_pkey port 8 failed (ret = -22)
[587154.763553] hca->node_type = 236
[587154.765698] hca->node_type = 236
[587154.767136] hca->node_type = 236
[587154.767312] mthca0: ib_query_pkey port 11 failed (ret = -22)
[587154.767324] hca->node_type = 236
[587154.767455] mthca0: ib_query_pkey port 12 failed (ret = -22)
[587154.767471] hca->node_type = 236
[587154.769140] hca->node_type = 236
[587154.772116] hca->node_type = 236
[587154.772180] mthca0: ib_query_pkey port 15 failed (ret = -22)
[587154.772192] hca->node_type = 236
[587154.772243] mthca0: ib_query_pkey port 16 failed (ret = -22)
[587154.772255] hca->node_type = 236
[587154.773401] hca->node_type = 236
[587154.776817] hca->node_type = 236
[587154.776974] mthca0: ib_query_pkey port 19 failed (ret = -22)
[587154.776986] hca->node_type = 236
[587154.778179] mthca0: ib_query_pkey port 20 failed (ret = -22)
[587154.778198] hca->node_type = 236
[587154.780159] hca->node_type = 236
[587154.785406] hca->node_type = 236
[587154.785512] mthca0: ib_query_pkey port 23 failed (ret = -22)
[587154.785523] hca->node_type = 236
[587154.785582] mthca0: ib_query_pkey port 24 failed (ret = -22)
[587154.785599] hca->node_type = 236
[587154.789427] hca->node_type = 236
[587154.794314] hca->node_type = 236
[587154.794458] mthca0: ib_query_pkey port 27 failed (ret = -22)
[587154.794474] hca->node_type = 236
[587154.794634] mthca0: ib_query_pkey port 28 failed (ret = -22)
[587154.794646] hca->node_type = 236
[587154.797133] hca->node_type = 236
[587154.803507] hca->node_type = 236
[587154.803597] mthca0: ib_query_pkey port 31 failed (ret = -22)
[587154.803608] hca->node_type = 236
[587154.803667] mthca0: ib_query_pkey port 32 failed (ret = -22)
[587154.803679] hca->node_type = 236
[587154.820947] hca->node_type = 236
[587154.829795] hca->node_type = 236
[587154.831921] mthca0: ib_query_pkey port 35 failed (ret = -22)
[587154.831934] hca->node_type = 236
[587154.834932] mthca0: ib_query_pkey port 36 failed (ret = -22)
[587154.834946] hca->node_type = 236
[587154.844314] hca->node_type = 236
[587154.853591] hca->node_type = 236
[587154.853680] mthca0: ib_query_pkey port 39 failed (ret = -22)
[587154.853692] hca->node_type = 236
[587154.853745] mthca0: ib_query_pkey port 40 failed (ret = -22)
[587154.853761] hca->node_type = 236
[587154.869483] hca->node_type = 236
[587154.874749] hca->node_type = 236
[587154.874952] mthca0: ib_query_pkey port 43 failed (ret = -22)
[587154.874969] hca->node_type = 236
[587154.875609] mthca0: ib_query_pkey port 44 failed (ret = -22)
[587154.875624] hca->node_type = 236
[587154.894612] hca->node_type = 236
[587154.908058] hca->node_type = 236
[587154.909244] mthca0: ib_query_pkey port 47 failed (ret = -22)
[587154.909261] hca->node_type = 236
[587154.909323] mthca0: ib_query_pkey port 48 failed (ret = -22)
[587154.909334] hca->node_type = 236
[587154.918749] hca->node_type = 236
[587154.939629] hca->node_type = 236
[587154.939729] mthca0: ib_query_pkey port 51 failed (ret = -22)
[587154.939745] hca->node_type = 236
[587154.939866] mthca0: ib_query_pkey port 52 failed (ret = -22)
[587154.939883] hca->node_type = 236
[587154.957219] hca->node_type = 236
[587154.971523] hca->node_type = 236

[openib-general] ib_mthca panic on PPC64

2005-10-27 Thread Troy Benjegerdes
I got this the other day (before I had a chance to add the debug code)

p5l0:~# [443954.161068] mthca0: ib_query_pkey port 0 failed (ret = -22)
[443988.334644] mthca0: ib_query_pkey port 0 failed (ret = -22)
[444037.579342] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005)
[444037.579360] ib_mthca: Initializing :d9:00.0
[444101.503664] ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005)
[444101.503682] ib_mthca: Initializing :d9:00.0
[444107.815375] Oops: Kernel access of bad area, sig: 7 [#1]
[444107.815389] SMP NR_CPUS=8 NUMA PSERIES LPAR
[444107.815401] Modules linked in: ib_ipoib ib_sa ib_mthca ib_mad ib_core openaf
s
[444107.815425] NIP: D98BF638 XER: 2018 LR: C0057B2C CTR: D0
00098BF5D0
[444107.815440] REGS: c001ee79b490 TRAP: 0300   Tainted: P   (2.6.13.3-p
ower5)
[444107.815455] MSR: 80009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 2800
0084
[444107.815469] DAR: d10082189a04 DSISR: 4000
[444107.815481] TASK: c001ee7950e0[0] 'swapper' THREAD: c001ee798000 CPU
: 6
[444107.815494] GPR00: 0010 C001EE79B710 D98D6540 D1
0082189A04
[444107.815515] GPR04: 0008 0001009D0180  00
000800
[444107.815535] GPR08: C003DDA91910  C001EE79B840 
D10082189A04
[444107.815556] GPR12: 4882 C04BF400  
00C00060
[444107.815576] GPR16: 0006   

[444107.815595] GPR20:  C05F7ED8 C05F7F40 
C0606500
[444107.815617] GPR24: C001ECEFC498 C001EE79B840 C001EE798000 
C003DDA91000
[444107.815639] GPR28: 0100 C003DDA91000 D98D4EC0 

[444107.815661] NIP [d98bf638] .poll_catas+0x68/0x2f0 [ib_mthca]
[444107.815699] LR [c0057b2c] .run_timer_softirq+0x15c/0x260
[444107.815717] Call Trace:
[444107.815725] [c001ee79b710] [c001ee79b7c0] 0xc001ee79b7c0 
(unreliable)
[444107.815744] [c001ee79b7d0] [c0057b2c] 
.run_timer_softirq+0x15c/0x260
[444107.815764] [c001ee79b890] [c0051e68] .__do_softirq+0xe8/0x1c0
[444107.815783] [c001ee79b950] [c0051fc4] .do_softirq+0x84/0x90
[444107.815801] [c001ee79b9d0] [c00108f0] .timer_interrupt+0xd0/0x41
0
[444107.815821] [c001ee79bad0] [c000a2b4] 
decrementer_common+0xb4/0x100
[444107.815838] --- Exception: 901 at .pseries_dedicated_idle+0x104/0x280
[444107.815857] LR = .pseries_dedicated_idle+0x1e0/0x280
[444107.815868] [c001ee79be90] [c000f460] .cpu_idle+0x40/0x60
[444107.815886] [c001ee79bf00] [c0032fa0] 
.start_secondary+0x120/0x150
[444107.815905] [c001ee79bf90] [c000ba7c] .enable_64b_mode+0x0/0x28
[444107.815922] Instruction dump:
[444107.815930] 3be0 4820 2fab 381f0001 7c1f07b4 409e0058 801d0908 
7f9f0040
[444107.815955] 409c00c8 e97d08f8 7be91764 7c6b4a14 <7c001c2c> 0c00 
4c00012c 780b0020
[444107.815983]  <0>Kernel panic - not syncing: Fatal exception in interrupt
[444107.815998]

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] PGI compiler issue with dat_platform_specific.h

2005-10-27 Thread Sayantan Sur
Hi,

We ran into some troubles when compiling the OpenIB dapl provider with
the PGI compiler. I believe this should appear in both ibat-cm and the
scm based providers.

Has anyone compiled DAPL/Gen2 with PGI? Is there a quick workaround for
this?


PGC-W-0221-Redefinition of symbol UINT64_C (/usr/include/stdint.h: 304)
PGC-S-0040-Illegal use of symbol, u_int64_t
(/home/1/surs/projects/Gen2/dapl_scm
_patch/dapl/dat/include/dat/dat_platform_specific.h: 139)
PGC/x86-64 Linux/x86-64 6.0-5: compilation completed with severe errors


Our machine is SuSe 9.3, with linux kernel version 2.6.13.1 and OpenIB
svn #3882.

Thanks,
Sayantan.

-- 
http://www.cse.ohio-state.edu/~surs
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Grant Grundler
On Thu, Oct 27, 2005 at 08:58:50AM -0700, Bob Woodruff wrote:
> If I run the arping command manually, I get
> 
> arping -c 2 -w 3 -D -I ib0 102.168.0.1

What does it say when you use *192* for the first byte?

(This may not be the only problem...but need to get that right too)

grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] osm_console.c - compilation warnings

2005-10-27 Thread Eitan Zahavi
Title: osm_console.c - compilation warnings





Hi Hal

I think you are missing 

#include 

As I get the following warnings:

osm_console.c: In function `loglevel_parse':

osm_console.c:112: warning: implicit declaration of function `strtoul'

osm_console.c:118: warning: implicit declaration of function `strtol'

osm_console.c: In function `osm_console':

osm_console.c:177: warning: implicit declaration of function `free'

EZ

Eitan Zahavi

Design Technology Director

Mellanox Technologies LTD

Tel:+972-4-9097208
Fax:+972-4-9593245

P.O. Box 586 Yokneam 20692 ISRAEL




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] ifup/ifdown scripts don't work with IPoIB

2005-10-27 Thread Bob Woodruff
I was trying to set up my system to use the normal
/etc/sysconfig/network-scripts/ifcfg-ib0

and have the interface brought up at startup using
/sbin/ifup, as it does with Ethernet.

I am running on a RedHat EL4.0 U2 distribution.

My config files looks like this,

# OpenIB IPoIB Controller
DEVICE=ib0
BOOTPROTO=static
ONBOOT=yes
IPADDR=192.168.0.1
NETMASK=255.255.255.0
BROADCAST=192.168.0.255

When I run /sbin/ifup ib0, I get

[EMAIL PROTECTED] woody]# /sbin/ifup ib0
Error, some other host already uses address 192.168.0.1.

Looking at the ifup script, it does a 

   if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then
echo $"Error, some other host already uses address ${IPADDR}."
exit 1
fi
If I run the arping command manually, I get

arping -c 2 -w 3 -D -I ib0 102.168.0.1
ARPING 102.168.0.1 from 0.0.0.0 ib0
Sent 2 probes (2 broadcast(s))
Received -1 response(s)

but when I run it on the eth0 device, I get

arping -c 2 -w 3 -D -I eth0 10.0.0.1
ARPING 10.0.0.1 from 0.0.0.0 eth0
Sent 2 probes (2 broadcast(s))
Received 0 response(s)

So why with IPoIB, does arping return -1 for IPoIB, rather than 0 like
it does with ethernet ?  Is this a problem with IPoIB or the ifup script ?

woody




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [RFC] OpenSM Interactive Console

2005-10-27 Thread Eitan Zahavi
Title: RE: [openib-general] [RFC] OpenSM Interactive Console





Hi Hal,


I still think that a "server" like behavior is much preferable to having the SM sit there and wait for console inputs. The SM is a service and thus should run like a daemon.

MIB is just a standard way to avoid the need to define our own protocol to do that.
In your implementation the SM should be put in console mode from the first invocation and thus will need a dedicated terminal.

Even with osmsh one could implement (using standard Tcl sockets) a simple server that could just wait for remote commands (I can provide the code as I have done zillions of such servers).

The MIB is nicer and I think it is not very complicated to implement. At least not the trivial groups of setting SM parameters. 

The more I think about it the more I get convinced we need to do it.


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL



> -Original Message-
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, October 27, 2005 1:45 PM
> To: Eitan Zahavi
> Cc: Troy Benjegerdes; openib-general@openib.org
> Subject: RE: [openib-general] [RFC] OpenSM Interactive Console
> 
> There have been requests for this CLI functionality from at least the labs. It has been
> discussed on the list.
> 
> Also, there was the following comment in OpenSM::main.c:
> 
> /*
> Sit here forever
> In the future, some sort of console interactivity could
> be implemented in this loop.
> */
> 
> -- Hal
> 
> 
> 
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED]]
> Sent: Thu 10/27/2005 2:03 AM
> To: Hal Rosenstock; Eitan Zahavi
> Cc: Troy Benjegerdes; openib-general@openib.org
> Subject: RE: [openib-general] [RFC] OpenSM Interactive Console
> 
> 
> 
> Yes this MIB needs some cleanup.
> I would love to hear from the community some feedback regarding SM MIB
> usefulness.
> 
> In the past we did not get any push for interactive SM or online configurable SM so I
> did not see any reason to work on it.
> 
> I do not think it is a huge task to make SM MIB work with OpenSM. At least not the
> 90% of it that I glanced through.
> 
> 
> Eitan Zahavi
> Design Technology Director
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -Original Message-
> > From: Hal Rosenstock [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, October 26, 2005 7:44 PM
> > To: Eitan Zahavi
> > Cc: Troy Benjegerdes; openib-general@openib.org
> > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console
> >
> > Hi Eitan,
> >
> > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I was
> > thinking of all the other IPoIB MIBs. It's been a while since I looked at the SM MIB.
> >
> > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while ago. At
> a
> > minimum, it needs to be dusted off. That would include updating it for IBA 1.2.
> >
> > -- Hal
> >
> > 
> >
> > From: Eitan Zahavi [mailto:[EMAIL PROTECTED]]
> > Sent: Tue 10/25/2005 5:19 AM
> > To: Hal Rosenstock
> > Cc: Troy Benjegerdes; openib-general@openib.org
> > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console
> >
> >
> >
> > Hal Rosenstock wrote:
> > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote:
> > >
> > >>Hal Rosenstock wrote:
> > >>
> > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote:
> > >>>
> > >>>
> > I would suggest to use SNMP for the tasks below. IETF IPoIB group
> > >
> > > has
> > >
> > defined an SNMP MIB that can support the required functionality
> > >
> > > below.
> > >
> > >>>
> > >>>The IETF SNMP MIBs are one way of presenting the information to the
> > >>>outside world. There are other possible management interfaces. The
> > >
> > > SNMP
> > >
> > >>>MIB instrumentation would need to use lower layer APIs to get this
> > >>>information out of the SM.
> > >>
> > >>Yes but the IETF SM MIB is the only one that is close to a standard
> > >
> > > way.
> > >
> > >>It does not require low level interface if it will integrate into the
> > >
> > > OpenSM code.
> > >
> > >>One way to do it is buy extending OpenSM with an AgentX interface.
> > >>
> > >>IMO one clear advantage of using SNMP for SM integration is that the
> > >
> > > code will work with any SM that is IETF compliant.
> > >
> > >>Also if you want to write a "client server" type of application on top
> > >
> > > of an SM you
> > >
> > >>can either stick to sending MADs which translate into SA client based
> > >
> > > application or
> > >
> > >>you better stay with some known protocol for management (like SNMP)
> > >
> > > and not develop yet another protocol for
> > >
> > >>doing exactly the same things as SNMP already supports.
> > >
> > >
> > > There are limitations in the SNMP MIBs. One is that they are RO so they
> > > are more for monitoring. Also, many environments do not use S

Re: [openib-general] [RFC] OpenSM Interactive Console

2005-10-27 Thread Troy Benjegerdes
For me, the only purpose for an SNMP MIB would be to get the information
into a network management system. In my case, I'll be using something
that's open-source or has a plugin architecture like Nagios, and I'd
really rather just have the network management system communicate with
the subnet manager or SMA packets directly rather than introducting an
extra translation to SNMP.

SNMP is only usefull to me because it is (in theory) an interoperable
cross-vendor standard. In the infiniband case, we already have a
cross-vendor standard implementation (OpenIB), and adding SNMP is
another dependency and layer of complexity that can break and be
difficult to set up.

If I knew of an open-source tool that was actually able to use SNMP to
query a random ethernet vendor's switch and be able to tell me what port
a particular MAC address was plugged into, I might be more positive. But
as far as I know, each vendor's SNMP implementation is broken in subtly
different ways, so that this gets to be a nightmare to actually
implement.

I guess the point of all this is find a end-user use-case for the SM
MIB, and work back from there to decide if haveing a MIB actually helps
solve the problem.

On Thu, Oct 27, 2005 at 08:03:57AM +0200, Eitan Zahavi wrote:
> Yes this MIB needs some cleanup.
> I would love to hear from the community some feedback regarding SM MIB
> usefulness.
> 
> In the past we did not get any push for interactive SM or online
> configurable SM so I did not see any reason to work on it. 
> 
> I do not think it is a huge task to make SM MIB work with OpenSM. At least
> not the 90% of it that I glanced through.
> 
> 
> Eitan Zahavi
> Design Technology Director
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> 
> > -Original Message-
> > From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, October 26, 2005 7:44 PM
> > To: Eitan Zahavi
> > Cc: Troy Benjegerdes; openib-general@openib.org
> > Subject: RE: [openib-general] [RFC] OpenSM Interactive Console
> > 
> > Hi Eitan,
> > 
> > I sit corrected. There are R/W parameters in the SM MIB as you indicate. I
> was
> > thinking of all the other IPoIB MIBs. It's been a while since I looked at
> the SM MIB.
> > 
> > Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while
> ago. At a
> > minimum, it needs to be dusted off. That would include updating it for IBA
> 1.2.
> > 
> > -- Hal
> > 
> > 
> > 
> > From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
> > Sent: Tue 10/25/2005 5:19 AM
> > To: Hal Rosenstock
> > Cc: Troy Benjegerdes; openib-general@openib.org
> > Subject: Re: [openib-general] [RFC] OpenSM Interactive Console
> > 
> > 
> > 
> > Hal Rosenstock wrote:
> > > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote:
> > >
> > >>Hal Rosenstock wrote:
> > >>
> > >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote:
> > >>>
> > >>>
> > I would suggest to use SNMP for the tasks below. IETF IPoIB group
> > >
> > > has
> > >
> > defined an SNMP MIB that can support the required functionality
> > >
> > > below.
> > >
> > >>>
> > >>>The IETF SNMP MIBs are one way of presenting the information to the
> > >>>outside world. There are other possible management interfaces. The
> > >
> > > SNMP
> > >
> > >>>MIB instrumentation would need to use lower layer APIs to get this
> > >>>information out of the SM.
> > >>
> > >>Yes but the IETF SM MIB is the only one that is close to a standard
> > >
> > > way.
> > >
> > >>It does not require low level interface if it will integrate into the
> > >
> > > OpenSM code.
> > >
> > >>One way to do it is buy extending OpenSM with an AgentX interface.
> > >>
> > >>IMO one clear advantage of using SNMP for SM integration is that the
> > >
> > > code will work with any SM that is IETF compliant.
> > >
> > >>Also if you want to write a "client server" type of application on top
> > >
> > > of an SM you
> > >
> > >>can either stick to sending MADs which translate into SA client based
> > >
> > > application or
> > >
> > >>you better stay with some known protocol for management (like SNMP)
> > >
> > > and not develop yet another protocol for
> > >
> > >>doing exactly the same things as SNMP already supports.
> > >
> > >
> > > There are limitations in the SNMP MIBs. One is that they are RO so they
> > > are more for monitoring. Also, many environments do not use SNMP. It is
> > > unclear how much of a requirement it is to manage any SM or how many
> > > other SMs support the SM MIB. (There are other IB associated MIBs too).
> > 
> > SNMP MIBs are certainly not just RO a simple example from the SM MIB:
> >ibSmPortInfoLMC   OBJECT-TYPE
> >SYNTAX  Unsigned32(0..7)
> >MAX-ACCESS  read-write
> >STATUS  current
> >DESCRIPTION
> >   "LID mask for multipath support.  User should take extra caution
> >   when setting this value, since any ch

[openib-general] [PATCH] Opensm - fix lmc algorithm

2005-10-27 Thread Yael Kalka
Hi Hal,

We noticed a problem in the lmc assignment algorithm.
In the current code - when trying to run opensm with lmc > 0, the
opensm goes into infinite loop.
Debugging the problem we noticed that there is a problem with the
lid assignment, and we changed the algorithm. The change is in the
osm_lid_mgr_init_sweep function.
We have done some testing to the new code, and it seems that the lmc
assignment is ok with the fix.

Thanks,
Yael

Signed-off-by:  Yael Kalka <[EMAIL PROTECTED]>

Index: opensm/osm_lid_mgr.c
===
--- opensm/osm_lid_mgr.c(revision 3848)
+++ opensm/osm_lid_mgr.c(working copy)
@@ -337,7 +337,7 @@ __osm_lid_mgr_init_sweep(
   uint16_t max_defined_lid;
   uint16_t max_persistent_lid;
   uint16_t max_discovered_lid;
-  uint16_t lid, l;
+  uint16_t lid;
   uint16_t disc_min_lid;
   uint16_t disc_max_lid;
   uint16_t db_min_lid;
@@ -349,16 +349,23 @@ __osm_lid_mgr_init_sweep(
   osm_port_t  *p_port;
   cl_qmap_t   *p_port_guid_tbl;
   uint8_t  lmc_num_lids = (uint8_t)(1 << p_mgr->p_subn->opt.lmc);
+  uint16_t lmc_mask;
+  uint16_t req_lid, num_lids;
   
   OSM_LOG_ENTER( p_mgr->p_log, __osm_lid_mgr_init_sweep );
 
+  if (p_mgr->p_subn->opt.lmc)
+lmc_mask = ~((1 << p_mgr->p_subn->opt.lmc) - 1);
+  else
+lmc_mask = 0x;
+
   /* if we came out of standby we need to discard any previous guid 2 lid
  info we might had */
   if ( p_mgr->p_subn->coming_out_of_standby == TRUE )
   {
 osm_db_clear( p_mgr->p_g2l );
 for (lid = 0; lid < cl_ptr_vector_get_size(&p_mgr->used_lids); lid++)
-  cl_ptr_vector_set(&p_mgr->used_lids, lid, NULL);
+  cl_ptr_vector_set(p_persistent_vec, lid, NULL);
   }
 
   /* we need to cleanup the empty ranges list */
@@ -375,7 +382,7 @@ __osm_lid_mgr_init_sweep(
 
   /* we if are on the first sweep and in re-assign lids mode 
  we should ignore all the available info and simply define one 
- hufe empty range */
+ huge empty range */
   if ((p_mgr->p_subn->first_time_master_sweep == TRUE) &&
   (p_mgr->p_subn->opt.reassign_lids == TRUE ))
   {
@@ -398,6 +405,34 @@ __osm_lid_mgr_init_sweep(
 osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid);
 for (lid = disc_min_lid; lid <= disc_max_lid; lid++)
   cl_ptr_vector_set(p_discovered_vec, lid, p_port );
+/* make sure the guid2lid entry is valid. If not - clean it. */
+if (!osm_db_guid2lid_get( p_mgr->p_g2l,
+  cl_ntoh64(osm_port_get_guid(p_port)),
+  &db_min_lid, &db_max_lid))
+{
+  if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) !=
+   IB_NODE_TYPE_SWITCH)
+num_lids = lmc_num_lids;
+  else
+num_lids = 1;
+
+  if ((num_lids != 1) &&
+  (((db_min_lid & lmc_mask) != db_min_lid) ||
+   (db_max_lid - db_min_lid + 1 < num_lids)) )
+  {
+/* Not alligned, or not wide enough - remove the entry */
+osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+ "__osm_lid_mgr_init_sweep: "
+ "Cleaning persistent entry for guid:0x%016" PRIx64
+ " illegal range:[0x%x:0x%x] \n",
+ cl_ntoh64(osm_port_get_guid(p_port)), db_min_lid,
+ db_max_lid );
+osm_db_guid2lid_delete( p_mgr->p_g2l,
+cl_ntoh64(osm_port_get_guid(p_port)));
+for ( lid = db_min_lid ; lid <= db_max_lid ; lid++ )
+  cl_ptr_vector_set(p_persistent_vec, lid, NULL);
+  }
+}
   }
 
   /* 
@@ -434,7 +469,7 @@ __osm_lid_mgr_init_sweep(
   {
 is_free = TRUE;
 /* first check to see if the lid is used by a persistent assignment */
-if ((lid < max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, lid))
+if ((lid <= max_persistent_lid) && cl_ptr_vector_get(p_persistent_vec, 
lid))
 {
   osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
"__osm_lid_mgr_init_sweep: "
@@ -442,62 +477,86 @@ __osm_lid_mgr_init_sweep(
lid);
   is_free = FALSE;
 }
-
-/* check the discovered port if there is one */
-if ((lid < max_discovered_lid) &&
-(p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid)))
+else
 {
-  /* get the lid range of that port  - but we know how many lids we 
- are about to assign to it */
-  osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid);
-  if ( osm_node_get_type( osm_port_get_parent_node( p_port ) ) !=
-   IB_NODE_TYPE_SWITCH)
-disc_max_lid = disc_min_lid + lmc_num_lids - 1;
-
+  /* check this is a discovered port */
+  CL_ASSERT(lid <= max_discovered_lid);
+  if ((p_port = (osm_port_t *)cl_ptr_vector_get(p_discovered_vec, lid)))
+  {
+/* we have a port. Now lets see 

[openib-general] Re: Automated userspace build error

2005-10-27 Thread Michael S. Tsirkin
Quoting r. Nishanth Aravamudan <[EMAIL PROTECTED]>:
> Subject: Re: Automated userspace build error
> 
> On 25.10.2005 [15:22:56 -0700], Roland Dreier wrote:
> > Nishanth> Hrm, well, I'm testing the latest svn (3865), did the
> > Nishanth> patch just get checked in?
> > 
> > Yeah, I only noticed it and fixed it after your original email.  I
> > just meant that I had already checked it in before sending my reply.
> > Sorry for the confusion...
> 
> No worries, I figured that's what happened.
> 
> On a related note, do you (or anyone else) have any suggestions for
> build-testing all of the userspace components? There isn't a top-level
> Makefile of any kind to make it easy :/
> 
> Thanks,
> Nish

Yes, look at scripts in
https://openib.org/svn/trunk/contrib/mellanox/scripts

You can also, basically, cut and paste stuff from the FAQ page,
but that relies on performing make as root.

-- 
MST
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [RFC] OpenSM Interactive Console

2005-10-27 Thread Hal Rosenstock
There have been requests for this CLI functionality from at least the labs. It 
has been discussed on the list.
 
Also, there was the following comment in OpenSM::main.c:
 
/*
Sit here forever
In the future, some sort of console interactivity could
be implemented in this loop.
*/
 
-- Hal



From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
Sent: Thu 10/27/2005 2:03 AM
To: Hal Rosenstock; Eitan Zahavi
Cc: Troy Benjegerdes; openib-general@openib.org
Subject: RE: [openib-general] [RFC] OpenSM Interactive Console



Yes this MIB needs some cleanup. 
I would love to hear from the community some feedback regarding SM MIB 
usefulness. 

In the past we did not get any push for interactive SM or online configurable 
SM so I did not see any reason to work on it. 

I do not think it is a huge task to make SM MIB work with OpenSM. At least not 
the 90% of it that I glanced through. 


Eitan Zahavi 
Design Technology Director 
Mellanox Technologies LTD 
Tel:+972-4-9097208 
Fax:+972-4-9593245 
P.O. Box 586 Yokneam 20692 ISRAEL 


> -Original Message- 
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 26, 2005 7:44 PM 
> To: Eitan Zahavi 
> Cc: Troy Benjegerdes; openib-general@openib.org 
> Subject: RE: [openib-general] [RFC] OpenSM Interactive Console 
> 
> Hi Eitan, 
> 
> I sit corrected. There are R/W parameters in the SM MIB as you indicate. I 
> was 
> thinking of all the other IPoIB MIBs. It's been a while since I looked at the 
> SM MIB. 
> 
> Also, the SM MIB (draft-ietf-ipoib-subnet-manager-mib-00) expired a while 
> ago. At a 
> minimum, it needs to be dusted off. That would include updating it for IBA 
> 1.2. 
> 
> -- Hal 
> 
>  
> 
> From: Eitan Zahavi [mailto:[EMAIL PROTECTED] 
> Sent: Tue 10/25/2005 5:19 AM 
> To: Hal Rosenstock 
> Cc: Troy Benjegerdes; openib-general@openib.org 
> Subject: Re: [openib-general] [RFC] OpenSM Interactive Console 
> 
> 
> 
> Hal Rosenstock wrote: 
> > On Mon, 2005-10-24 at 14:38, Eitan Zahavi wrote: 
> > 
> >>Hal Rosenstock wrote: 
> >> 
> >>>On Mon, 2005-10-24 at 03:08, Eitan Zahavi wrote: 
> >>> 
> >>> 
> I would suggest to use SNMP for the tasks below. IETF IPoIB group 
> > 
> > has 
> > 
> defined an SNMP MIB that can support the required functionality 
> > 
> > below. 
> > 
> >>> 
> >>>The IETF SNMP MIBs are one way of presenting the information to the 
> >>>outside world. There are other possible management interfaces. The 
> > 
> > SNMP 
> > 
> >>>MIB instrumentation would need to use lower layer APIs to get this 
> >>>information out of the SM. 
> >> 
> >>Yes but the IETF SM MIB is the only one that is close to a standard 
> > 
> > way. 
> > 
> >>It does not require low level interface if it will integrate into the 
> > 
> > OpenSM code. 
> > 
> >>One way to do it is buy extending OpenSM with an AgentX interface. 
> >> 
> >>IMO one clear advantage of using SNMP for SM integration is that the 
> > 
> > code will work with any SM that is IETF compliant. 
> > 
> >>Also if you want to write a "client server" type of application on top 
> > 
> > of an SM you 
> > 
> >>can either stick to sending MADs which translate into SA client based 
> > 
> > application or 
> > 
> >>you better stay with some known protocol for management (like SNMP) 
> > 
> > and not develop yet another protocol for 
> > 
> >>doing exactly the same things as SNMP already supports. 
> > 
> > 
> > There are limitations in the SNMP MIBs. One is that they are RO so they 
> > are more for monitoring. Also, many environments do not use SNMP. It is 
> > unclear how much of a requirement it is to manage any SM or how many 
> > other SMs support the SM MIB. (There are other IB associated MIBs too). 
> 
> SNMP MIBs are certainly not just RO a simple example from the SM MIB: 
>ibSmPortInfoLMC   OBJECT-TYPE 
>SYNTAX  Unsigned32(0..7) 
>MAX-ACCESS  read-write 
>STATUS  current 
>DESCRIPTION 
>   "LID mask for multipath support.  User should take extra caution 
>   when setting this value, since any change will effect packet 
>   routing." 
>::= { ibSmPortInfoEntry 19 } 
> 
> 
> I agree that it is possible that currently no SM is supporting the SM MIB. 
> But it does make sense to have ALL of the them support it. Such that they can 
> be activated/deactivated and configured in the manner. 
> 
> Most unix distributions and windows box have standard SNMP agent and client 
> included in them 
> So it does not take more then simple bash or C code to interact with the SM 
> if it 
> supports SNMP. 
> 
> > 
> > 
> Everything but the dynamic partitioning (OpenSM does not have 
> partition manager to this moment) 
> >>> 
> >>> 
> >>>What Troy meant by partitioning is not necessarily IB partitioning. 
> >> 
> >>How are you sure about that? Troy - please comment. 
> > 
> > 
> > I think you missed an email on th