Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-27 Thread Tom Tucker

On Tue, 2007-11-27 at 18:15 +0100, Adrian Bunk wrote:
 On Mon, Nov 26, 2007 at 11:35:42PM -0600, Tom Tucker wrote:
  On Tue, 2007-11-27 at 15:49 +1100, Rusty Russell wrote:
 ...
   No.  That's the wrong question.  What's the real upside?
  
  Explicitly documenting what comprises the kernel API (external,
  supported) and what comprises the kernel implementation (internal, not
  supported).
 ...
 
 There is not, never was, and never will be, any supported external API 
 of the kernel.

Philosophically I understand what you're saying, but in practical terms
there is the issue of managing core API like kmalloc. Although kmalloc
_could_ change, doing so would be extremely painful. In fact anyone who
proposed such a change would have to have a profoundly powerful argument
as to why it was necessary.

I think this patchset is an attempt to make it easier to identify and
review these kinds of interfaces.

  
 cu
 Adrian
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] [1/9] Core module symbol namespaces code and intro.

2007-11-26 Thread Tom Tucker

On Tue, 2007-11-27 at 15:49 +1100, Rusty Russell wrote:
 On Monday 26 November 2007 17:15:44 Roland Dreier wrote:
Except C doesn't have namespaces and this mechanism doesn't create them.
 So this is just complete and utter makework; as I said before, noone's
going to confuse all those udp_* functions if they're not in the udp
namespace.
 
  I don't understand why you're so opposed to organizing the kernel's
  exported symbols in a more self-documenting way.
 
 No, I was the one who moved exports near their declarations.  That's 
 organised.  I just don't see how this new organization will help: oh good, 
 I won't accidentally use the udp functions any more?!?
 
  It seems pretty   
  clear to me that having a mechanism that requires modules to make
  explicit which (semi-)internal APIs makes reviewing easier
 
 Perhaps you've got lots of patches were people are using internal APIs they 
 shouldn't?
 

Maybe the issue is who can tell since what is external and what is
internal is not explicitly defined?

  , makes it 
  easier to communicate please don't use that API to module authors,
 
 Well, introduce an EXPORT_SYMBOL_INTERNAL().  It's a lot less code.  But 
 you'd 
 still need to show that people are having trouble knowing what APIs to use.

  and takes at least a small step towards bringing the kernel's exported
  API under control.
 
 There is no exported API to bring under control.  

Hmm...apparently, there are those that are struggling...

 There are symbols we 
 expose for the kernel's own use which can be used by external modules at 
 their own risk.  
 
  What's the real downside? 
 
 No.  That's the wrong question.  What's the real upside?

Explicitly documenting what comprises the kernel API (external,
supported) and what comprises the kernel implementation (internal, not
supported).

 
 Let's not put code in the core because it doesn't seem to hurt.
 

agreed.

 I'm sure you think there's a real problem, but I'm still waiting for someone 
 to *show* it to me.  Then we can look at solutions.

I think the benefits should include:

- forcing developers to identify their exports as part of the
implementation or as part of the kernel API

- making it easier for reviewers to identify when developers are adding
to the kernel API and thereby focusing the appropriate level of review
to the new function

- making it obvious to developers when they are binding their
implementation to a particular kernel release



 Rusty.
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-16 Thread Tom Tucker
On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote:

[...snip...]

  I think removing the RDMA stack is the wrong thing to do, and you 
  shouldn't just threaten to yank entire subsystems because you don't like 
  the technology.  Lets keep this constructive, can we?  RDMA should get 
  the respect of any other technology in Linux.  Maybe its a niche in your 
  opinion, but come on, there's more RDMA users than say, the sparc64 
  port.  Eh?
 
 It's not about being a niche.  It's about creating a maintainable 
 software net stack that has predictable behavior.

Isn't RDMA _part_ of the software net stack within Linux? Why isn't
making RDMA stable, supportable and maintainable equally as important as
any other subsystem? 

 
 Needing to reach out of the RDMA sandbox and reserve net stack resources 
 away from itself travels a path we've consistently avoided.
 
 
  I will NACK any patch that opens up sockets to eat up ports or
  anything stupid like that.
  
  Got it.
 
 Ditto for me as well.
 
   Jeff
 
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] amso1100: QP init bug in amso driver

2007-07-24 Thread Tom Tucker
Roland:

The guys at UNH found this and fixed it. I'm surprised no
one has hit this before. I guess it only breaks when the 
refcount on the QP is non-zero.

Initialize the wait_queue_head_t in the c2_qp structure.

Signed-off-by: Ethan Burns [EMAIL PROTECTED]
Acked-by: Tom Tucker [EMAIL PROTECTED]

---
 drivers/infiniband/hw/amso1100/c2_qp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c 
b/drivers/infiniband/hw/amso1100/c2_qp.c
index 420c138..01d0786 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -506,6 +506,7 @@ int c2_alloc_qp(struct c2_dev *c2dev,
qp-send_sgl_depth = qp_attrs-cap.max_send_sge;
qp-rdma_write_sgl_depth = qp_attrs-cap.max_send_sge;
qp-recv_sgl_depth = qp_attrs-cap.max_recv_sge;
+   init_waitqueue_head(qp-wait);
 
/* Initialize the SQ MQ */
q_size = be32_to_cpu(reply-sq_depth);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 3/9] NetEffect 10Gb RNIC Driver: openfabrics connection manager c file

2006-10-27 Thread Tom Tucker

[...snip...]
 +extern void set_interface(
 +UINT32ip_addr,

These should probably be the standard linux types u32, or uint32

 +UINT32mask,
 +UINT32bcastaddr,
 +UINT32type
 +   );

[...snip...]

 + struct NES_sockaddr_in  inet_addr;
 + struct sockaddr_in  kinet_addr;

Is there some reason why you need your own sockaddr and sockaddr_in
structures? 

[...snip...]
 +
 +/**
 + * nes_disconnect
 + * 
 + * @param cm_id
 + * @param abrupt
 + * 
 + * @return int
 + */
 +int nes_disconnect(struct iw_cm_id *cm_id, int abrupt)
 +{
 + struct ib_qp_attr attr;
 + struct ib_qp *ibqp;
 + struct nes_qp *nesqp;
 + struct nes_dev *nesdev = to_nesdev(cm_id-device);
 + int err = 0;
 + u8 u8temp;
 +
 + dprintk(%s:%s:%u\n, __FILE__, __FUNCTION__, __LINE__);
 + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__,
 atomic_read(nesdev-netdev-refcnt));
 +
 + /* If the qp was already destroyed, then there's no QP */
 + if (cm_id-provider_data == 0)
 + return 0;
 +
 + nesqp = (struct nes_qp *)cm_id-provider_data;
 + ibqp = nesqp-ibqp;
 +
 + /* Disassociate the QP from this cm_id */
 + cm_id-provider_data = 0;
 + cm_id-rem_ref(cm_id);
 + nesqp-cm_id = 0;
 +
 + stack_ops_p-decelerate_socket(nesqp-socket, 
 +(struct nes_uploaded_qp_context *)
 +nesqp-nesqp_context);
 +  
 + if (nesqp-active_conn) {
 +   u8temp = 1  (ntohs(cm_id-local_addr.sin_port)7);
 +   nesdev-apbv_table[ntohs(cm_id-local_addr.sin_port)3] =
 ~(u8temp);
 + } else {
 + dev_put(nesdev-netdev);
 +/* Need to free the Last Streaming Mode Message */
 +pci_free_consistent(nesdev-pcidev, 
 +
 nesqp-private_data_len+sizeof(*nesqp-ietf_frame), 
 +nesqp-ietf_frame,
 +nesqp-ietf_frame_pbase);

This is mailer perversion. You need to turn off wrapping in your mailer.
It makes it hard to review the patch never mind apply it.

 +}
 +
 + if (nesqp-ksock) sock_release(nesqp-ksock);
 + stack_ops_p-sock_ops_p-close( nesqp-socket );
 + nesqp-ksock = 0;
 + nesqp-socket = 0;
 + if (nesqp-wq) {
 + destroy_workqueue(nesqp-wq);

This will deadlock if this function is called from a workqueue thread
and CONFIG_HOTPLUG_CPU is enabled. 

 + nesqp-wq = NULL;
 + }
 +
 + memset(attr, 0, sizeof(struct ib_qp_attr));
 + if (abrupt)
 + attr.qp_state = IB_QPS_ERR;
 + else
 + attr.qp_state = IB_QPS_SQD;
 +
 + return err;
 +}
 +
 +
 +/**
 + * nes_accept
 + * 
 + * @param cm_id
 + * @param conn_param
 + * 
 + * @return int
 + */
 +int nes_accept(struct iw_cm_id *cm_id, struct iw_cm_conn_param
 *conn_param)
 +{
 + struct nes_qp *nesqp;
 + struct nes_dev *nesdev;
 + struct nes_adapter *nesadapter;
 + struct ib_qp *ibqp;
 +struct nes_hw_qp_wqe *wqe;
 + struct nes_v4_quad nes_quad;
 + struct ib_qp_attr attr;
 +struct iw_cm_event cm_event;
 +
 + dprintk(%s:%s:%u: data len = %u\n, 
 + __FILE__, __FUNCTION__, __LINE__,
 conn_param-private_data_len);
 +
 + ibqp = nes_get_qp(cm_id-device, conn_param-qpn);
 + if (!ibqp)
 + return -EINVAL;
 + nesqp = to_nesqp(ibqp);
 + nesdev = to_nesdev(nesqp-ibqp.device);
 + nesadapter = nesdev-nesadapter;
 + dprintk(%s: netdev refcnt = %u.\n, __FUNCTION__,
 atomic_read(nesdev-netdev-refcnt));
 +
 +nesqp-ietf_frame = pci_alloc_consistent(nesdev-pcidev, 
 +
 sizeof(*nesqp-ietf_frame)+conn_param-private_data_len,
 + nesqp-ietf_frame_pbase);
 +if (!nesqp-ietf_frame) {
 +dprintk(KERN_ERR PFX %s: Unable to allocate memory for private
 data\n, __FUNCTION__);
 +return -ENOMEM;
 +}
 +dprintk(PFX %s: PCI consistent memory for 
 +private data located @ %p (pa = 0x%08lX.) size = %u.\n, 
 +__FUNCTION__, nesqp-ietf_frame, (unsigned
 long)nesqp-ietf_frame_pbase,
 +conn_param-private_data_len+sizeof(*nesqp-ietf_frame));
 +nesqp-private_data_len = conn_param-private_data_len;
 +
 +strcpy(nesqp-ietf_frame-key[0], IEFT_MPA_KEY_REP);
 +memcpy(nesqp-ietf_frame-private_data, conn_param-private_data,
 conn_param-private_data_len);
 +nesqp-ietf_frame-private_data_size =
 cpu_to_be16(conn_param-private_data_len);
 +nesqp-ietf_frame-rev = mpa_version;
 +nesqp-ietf_frame-flags = IETF_MPA_FLAGS_CRC;
 +
 +wqe = nesqp-hwqp.sq_vbase[0];
 +*((struct nes_qp
 **)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) = nesqp;
 + *((u64 *)wqe-wqe_words[NES_IWARP_SQ_WQE_COMP_CTX_LOW_IDX]) |=
 NES_SW_CONTEXT_ALIGN1;
 +wqe-wqe_words[NES_IWARP_SQ_WQE_MISC_IDX] =
 

[PATCH] Add spinlocks to serialize ib_post_send/ib_post_recv

2006-10-03 Thread Tom Tucker
From: Tom Tucker [EMAIL PROTECTED]

The AMSO driver was not thread-safe in the post WR code and had
code that would sleep if the WR post FIFO was full. Since these
functions can be called on interrupt level I changed the sleep to a
udelay.

Signed-off-by: Tom Tucker [EMAIL PROTECTED]
---

 drivers/infiniband/hw/amso1100/c2_qp.c |   15 +++
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2_qp.c 
b/drivers/infiniband/hw/amso1100/c2_qp.c
index 1226113..681c130 100644
--- a/drivers/infiniband/hw/amso1100/c2_qp.c
+++ b/drivers/infiniband/hw/amso1100/c2_qp.c
@@ -35,6 +35,7 @@
  *
  */
 
+#include linux/delay.h
 #include c2.h
 #include c2_vq.h
 #include c2_status.h
@@ -705,10 +706,8 @@ static inline void c2_activity(struct c2
 * cannot get on the bus and the card and system hang in a
 * deadlock -- thus the need for this code. [TOT]
 */
-   while (readl(c2dev-regs + PCI_BAR0_ADAPTER_HINT)  0x8000) {
-   set_current_state(TASK_UNINTERRUPTIBLE);
-   schedule_timeout(0);
-   }
+   while (readl(c2dev-regs + PCI_BAR0_ADAPTER_HINT)  0x8000)
+   udelay(10);
 
__raw_writel(C2_HINT_MAKE(mq_index, shared),
 c2dev-regs + PCI_BAR0_ADAPTER_HINT);
@@ -766,6 +765,7 @@ int c2_post_send(struct ib_qp *ibqp, str
struct c2_dev *c2dev = to_c2dev(ibqp-device);
struct c2_qp *qp = to_c2qp(ibqp);
union c2wr wr;
+   unsigned long lock_flags;
int err = 0;
 
u32 flags;
@@ -881,8 +881,10 @@ int c2_post_send(struct ib_qp *ibqp, str
/*
 * Post the puppy!
 */
+   spin_lock_irqsave(qp-lock, lock_flags);
err = qp_wr_post(qp-sq_mq, wr, qp, msg_size);
if (err) {
+   spin_unlock_irqrestore(qp-lock, lock_flags);
break;
}
 
@@ -890,6 +892,7 @@ int c2_post_send(struct ib_qp *ibqp, str
 * Enqueue mq index to activity FIFO.
 */
c2_activity(c2dev, qp-sq_mq.index, qp-sq_mq.hint_count);
+   spin_unlock_irqrestore(qp-lock, lock_flags);
 
ib_wr = ib_wr-next;
}
@@ -905,6 +908,7 @@ int c2_post_receive(struct ib_qp *ibqp, 
struct c2_dev *c2dev = to_c2dev(ibqp-device);
struct c2_qp *qp = to_c2qp(ibqp);
union c2wr wr;
+   unsigned long lock_flags;
int err = 0;
 
if (qp-state  IB_QPS_RTS)
@@ -945,8 +949,10 @@ int c2_post_receive(struct ib_qp *ibqp, 
break;
}
 
+   spin_lock_irqsave(qp-lock, lock_flags);
err = qp_wr_post(qp-rq_mq, wr, qp, qp-rq_mq.msg_size);
if (err) {
+   spin_unlock_irqrestore(qp-lock, lock_flags);
break;
}
 
@@ -954,6 +960,7 @@ int c2_post_receive(struct ib_qp *ibqp, 
 * Enqueue mq index to activity FIFO
 */
c2_activity(c2dev, qp-rq_mq.index, qp-rq_mq.hint_count);
+   spin_unlock_irqrestore(qp-lock, lock_flags);
 
ib_wr = ib_wr-next;
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/2] iWARP Connection Manager.

2006-08-30 Thread Tom Tucker
On Wed, 2006-08-30 at 10:35 -0700, Roland Dreier wrote:
 OK, getting closer to finishing the merge...
 
 anyway, why is iw_cm_private.h in include/rdma where it is visible
 everywhere?  As far as I can tell drivers/infiniband/core/iwcm.c is
 the only place it's included.  So why not just put this stuff in
 drivers/infiniband/core/iwcm.h and do

The data structures really belong in iwcm.c...but I have a KDB module
that dumps IB data structures. So when I was writing the IWCM, I pulled
them out where I could see them without include gymnastics. It seems
pretty dumb though to have header files called *private.h in a public
directory. Putting them in iwcm.h is fine with me... 

Here's a patch for the KDB code if anyone is interested...

KDB module for dumping OpenFabrics stack data types

From: Tom Tucker [EMAIL PROTECTED]


---

 kdb/kdbmain.c  |2 
 kdb/modules/Makefile   |3 
 kdb/modules/kdbm_openfabrics.c |  372

 3 files changed, 375 insertions(+), 2 deletions(-)

diff --git a/kdb/kdbmain.c b/kdb/kdbmain.c
index 931b643..b35139a 100644
--- a/kdb/kdbmain.c
+++ b/kdb/kdbmain.c
@@ -1154,8 +1154,8 @@ kdb_quiet(int reason)
  * none
  */
 
+void kdba_cpu_up(void) {};
 extern char kdb_prompt_str[];
-
 static int
 kdb_local(kdb_reason_t reason, int error, struct pt_regs *regs,
kdb_dbtrap_t db_result)
 {
diff --git a/kdb/modules/Makefile b/kdb/modules/Makefile
index ae2ac53..fbf05e1 100644
--- a/kdb/modules/Makefile
+++ b/kdb/modules/Makefile
@@ -6,7 +6,8 @@ #
 # Copyright (c) 1999-2006 Silicon Graphics, Inc.  All Rights Reserved.
 #
 
-obj-$(CONFIG_KDB_MODULES) += kdbm_pg.o kdbm_task.o kdbm_vm.o
kdbm_sched.o
+obj-$(CONFIG_KDB_MODULES) += kdbm_pg.o kdbm_task.o kdbm_vm.o
kdbm_sched.o \
+   kdbm_openfabrics.o
 ifdef CONFIG_X86
 ifndef CONFIG_X86_64
 obj-$(CONFIG_KDB_MODULES) += kdbm_x86.o
diff --git a/kdb/modules/kdbm_openfabrics.c
b/kdb/modules/kdbm_openfabrics.c
new file mode 100644
index 000..fdf204b
--- /dev/null
+++ b/kdb/modules/kdbm_openfabrics.c
@@ -0,0 +1,372 @@
+/*
+ * Copyright (c) 2006 Tom Tucker, Open Grid Computing, Inc. 
+ */
+
+#include linux/kdb.h
+#include linux/kdbprivate.h
+#include linux/module.h
+#include linux/init.h
+#include linux/sched.h
+#include rdma/ib_verbs.h
+#include rdma/rdma_cm.h
+#include rdma/iw_cm.h
+#include rdma/iw_cm_private.h
+#include ../drivers/infiniband/hw/amso1100/c2_provider.h
+
+MODULE_AUTHOR(Tom Tucker);
+MODULE_DESCRIPTION(Debug RDMA);
+MODULE_LICENSE(Dual BSD/GPL);
+
+static const char *wc_status_str[] = {
+   SUCCESS,
+   LOC_LEN_ERR,
+   LOC_QP_OP_ERR,
+   LOC_EEC_OP_ERR,
+   LOC_PROT_ERR,
+   WR_FLUSH_ERR,
+   MW_BIND_ERR,
+   BAD_RESP_ERR,
+   LOC_ACCESS_ERR,
+   REM_INV_REQ_ERR,
+   REM_ACCESS_ERR,
+   REM_OP_ERR,
+   RETRY_EXC_ERR,
+   RNR_RETRY_EXC_ERR,
+   LOC_RDD_VIOL_ERR,
+   REM_INV_RD_REQ_ERR,
+   REM_ABORT_ERR,
+   INV_EECN_ERR,
+   INV_EEC_STATE_ERR,
+   FATAL_ERR,
+   RESP_TIMEOUT_ERR,
+   GENERAL_ERR
+};
+
+static inline const char *wc_status_to_str(int status)
+{
+   if (status  (sizeof(wc_status_str) / sizeof(wc_status_str[0])))
+   return bad status;
+
+   return wc_status_str[status];
+}
+
+static const char *wc_opcode_str[] = {
+   SEND,
+   RDMA_WRITE,
+   RDMA_READ,
+   COMP_SWAP,
+   FETCH_ADD,
+   BIND_MW,
+   RECV,
+   RECV_RDMA_WITH_IMM,
+};
+
+static inline const char* wc_opcode_to_str(int op) 
+{
+   
+   if (op  129)
+   return bad opcode;
+   else if (op = 128)
+   op -= 122;
+   return wc_opcode_str[op];
+}
+
+static int
+print_ib_wc(int argc, const char **argv, const char **envp,
+   struct pt_regs *regs)
+{
+   int ret = 0;
+
+   if (argc == 1) {
+   kdb_machreg_t addr;
+   int nextarg = 1;
+   long offset = 0;
+   struct ib_wc wc;
+
+   ret = kdbgetaddrarg(argc, argv, nextarg, addr, 
+   offset, NULL, regs);
+   if (ret) 
+   return ret;
+
+   kdb_printf(struct ib_wc [%p]\n, (void*)addr);
+   ret = kdb_getarea_size((void*)wc, (unsigned long)addr, 
+  sizeof wc);
+   if (ret)
+   return ret;
+
+   kdb_printf(  wr_id  : %llx\n, wc.wr_id);
+   kdb_printf(  status : \%s\\n,
wc_status_to_str(wc.status));
+   kdb_printf(  opcode : %s\n, 
wc_opcode_to_str(wc.opcode));
+   kdb_printf(  vendor_err : %d\n, wc.vendor_err);
+   kdb_printf(  byte_len   : %d\n, wc.byte_len);
+   kdb_printf(  imm_data   : %d\n, wc.imm_data);
+   kdb_printf(  qp_num : %d\n, wc.qp_num);
+   kdb_printf(  src_qp

Re: RDMA will be reverted

2006-07-25 Thread Tom Tucker
On Mon, 2006-07-24 at 15:23 -0700, David Miller wrote: 
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 12:09:42 -0500
 
  A TOE net stack is closed source firmware. Linux engineers have no way
  to fix security issues that arise. As a result, only non-TOE users will
  receive security updates, leaving random windows of vulnerability for
  each TOE NIC's users.
  
  - A Linux security update may or may not be relevant to a vendors
  implementation. 
  
  - If a vendor's implementation has a security issue then the customer
  must rely on the vendor to fix it. This is no less true for iWARP than
  for any adapter.
 
 This isn't how things actually work.
 
 Users have a computer, and they can rightly expect the community
 to help them solve problems that occur in the upstream kernel.
 
 When a bug is found and the person is using NIC X, we don't
 necessarily forward the bug report to the vendor of NIC X.
 Instead we try to fix the bug.  Many chip drivers are maintained
 by people who do not work for the company that makes the chip,
 and this works just fine.
 
 If only the chip vendor can fix a security problem, this makes Linux
 less agile to fix.  Even aspect of a problem on a Linux system that
 cannot be fixed entirely by the community is a net negative for Linux.
 

All true. What I meant to say was that this is no less true than for
any deep adapter. It is incontrovertible that a deep adapter is less
flexible, and more difficult to support than a shallow adapter.

  - iWARP needs to do protocol processing in order to validate and
  evaluate TCP payload in advance of direct data placement. This
  requirement is independent of CPU speed. 
 
 Yet, RDMA itself is just an optimization meant to deal with
 limitations of cpu and memory speed.  You can rephrase the
 situation in whatever way suits your argument, but it does not
 make the core issue go away :)

Yep.

 
  - I suspect that connection rates for RDMA adapters fall well-below the
  rates attainable with a dumb device. That said, all of the RDMA
  applications that I know of are not connection intensive. Even for TOE,
  the later HTTP versions makes connection rates less of an issue.
 
 This is a very naive evaluation of the situation.  Yes, newer
 versions of protocols such as HTTP make the per-client connection
 burdon lower, but the number of clients will increase in time to
 more than makeup for whatever gains are seen due to this.

Naive is being kind, my HTTP comment is irrelevant :).  

 And then you have protocols which by design are connection heavy,
 and rightly so, such as bittorrent.
 
 Being able to handle enormous numbers of connections, with extreme
 scalability and low latency, is an absolute requirement of any modern
 day serious TCP stack.  And this requirement is not going away.
 Wishing this requirement away due to HTTP persistent connections is
 very unrealistic, at best.
 
  - This is the problem we're trying to solve...incrementally and
  responsibly.
 
 You can't.  See my email to Roland about why even VJ net channels
 are found to be impractical.  To support netfilter properly, you
 must traverse the whole netfilter stack, because NAT can rewrite
 packets, yet still make them destined for the local system, and
 thus they will have a different identity for connection demux
 by the time the TCP stack sees the packet.
 

I'm not claiming that all the problems can be solved, I'm suggesting
that integration could be better and that partial integration is better
than none. 

 All of these tranformations occur between normal packet receive
 and the TCP stack.  You would therefore need to put your card
 between netfilter and TCP in the packet input path, and at that
 point why bother with the stateful card at all?
 
 The fact is that stateless approaches will always be better than
 stateful things because you cannot replicate the functionality we
 have in the Linux stack without replicating 10 years of work into
 your chip's firmware.  At that point you should just run Linux
 on your NIC since that is what you are effectively doing :)
 

I wish...I'd have a better stack. 

 In conversations such as these, it helps us a lot if you can be frank
 and honest about the true absolute limitations of your technology.  

I'm trying ... classifying these limitations as core can't fix and
fixable with integration is where we're getting crosswise. 

 I
 can see that your viewpoint is tainted when I hear things such as HTTP
 persistent connections being used as a reason why high TCP connection
 rates won't matter in the future.  Such assertions are understood to
 be patently false by anyone who understands TCP and how it is used in
 the real world.

Partial Fixable with Integration Summary

- ARP Resolution
- ICMP Redirect
- Path MTU Change
- Route Update
- Colliding TCP Port Spaces

Partial Can't Fix Issues Summary:

- Many devices cannot support more than tens of thousands of concurrent
connections (16-64k would be typical

Re: RDMA will be reverted

2006-07-07 Thread Tom Tucker
On Thu, 2006-07-06 at 23:53 -0700, David Miller wrote:
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Thu, 06 Jul 2006 00:25:03 -0500
 
  This patch is about dotting I's and crossing T's, it's not about
  foundations.
 
 You assume that I've flat out rejected RDMA, in fact I haven't.  I
 really don't have enough information to form a final opinion yet.
 There's about a week of emails on this topic which I need to read
 and digest first.

I realize that there is a tremendous amount of work out there and this
is just one thread.

 
 What I am saying, however, is that we need to understand the
 technology and the hooks you guys want before we put any of it in.

Absolutely.

 
 I don't think that's unreasonable.

Not at all. Let me know if I can help.

Tom



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-06 Thread Tom Tucker
On Fri, 2006-07-07 at 00:08 +1000, Herbert Xu wrote:
 Tom Tucker [EMAIL PROTECTED] wrote:
  
  All that said, the proposed patch helps not only iWARP, but other
  transports (iSCSI, IB) as well. It is not large, invasive,
 
 Care to explain on how it helps those other technologies?

The RDMA CMA uses IP addresses and port numbers to create a uniform
addressing scheme across all transport types. For IB, it is necessary to
resolve IP addresses to IB GIDs. The ARP protocol is used to do this and
a netfilter rule is installed to snoop the incoming ARP replies. This
would not be necessary if ARP events were provided as in the patch. 

Unified wire iSCSI adapters have the same issue as iWARP wrt to managing
IP addresses and ports.


 
 Cheers,

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-06 Thread Tom Tucker
On Fri, 2006-07-07 at 10:03 +1000, Herbert Xu wrote:
 On Thu, Jul 06, 2006 at 12:36:24PM -0500, Tom Tucker wrote:
  
  The RDMA CMA uses IP addresses and port numbers to create a uniform
  addressing scheme across all transport types. For IB, it is necessary to
  resolve IP addresses to IB GIDs. The ARP protocol is used to do this and
  a netfilter rule is installed to snoop the incoming ARP replies. This
  would not be necessary if ARP events were provided as in the patch. 
 
 Well the concerns we have do not apply to just iWARP, but RDMA/IP in
 general so this isn't really another technology.
 
 In fact, it seems that we now have IP-specific knowledge living in
 drivers/infiniband/core/cma.c which is suboptimal.

To be clear the CMA doesn't look in the ARP packet, it just uses the
existence of the packet as an indication that it should check to see if
the ARP request it submitted for an IP address has been resolved yet. I
agree that this is suboptimal and why I think the notifier is a nice
alternative. 

 
  Unified wire iSCSI adapters have the same issue as iWARP wrt to managing
  IP addresses and ports.
 
 If by Unified wire iSCSI you mean something that presents a SCSI interface
 together with an Ethernet interface where the two share the same MAC and
 IP address, 

Yes, this is what I mean. But the notifier doesn't actually support
this, you would need to expose the IP/port space database to solve that
problem.

What I was referring to relative to  iSCSI is if the adapter is relying
on Linux to do ARP via the above suboptimal implementation, then it
would benefit from the notifier patch.

 then we have the same concerns with it as we do with iWARP or
 TOE.

Yes indeed.

 
 Cheers,




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-07-05 Thread Tom Tucker
On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote:
 On Saturday 01 July 2006 01:01, Tom Tucker wrote:
  On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
  
   The TOE folks have tried to submit their hooks and drivers
   on several occaisions, and we've rejected it every time.
  
  iWARP != TOE
 
 Perhaps a good start of that discussion David asked for would 
 be if you could give us an overview of the differences
 and how you avoid the TOE problems.

I think Roland already gave the high-level overview. For those
interested in some of the details, the API for iWARP transports was
originally conceived independently from IB and is documented in the
RDMAC Verbs Specification found here:

http://www.rdmaconsortium.org/home/draft-hilland-iwarp-verbs-v1.0-RDMAC.pdf

The protocols, etc... are available here:
http://www.ietf.org/html.charters/rddp-charter.html

As Roland mentioned, the RDMAC verbs are *very* similar to the IB verbs
and so when we were thinking about how to design an API for iWARP we
concluded it would be best to leverage the tremendous amount of work
already done for IB by OpenFabrics and then work iteratively to extend
this API to include features unique to iWARP. This work has been ongoing
since September of 2005. 

There is an open source svn repository available for the iWARP source at
https://openib.org/svn/gen2/branches/iwarp.

There is also an open source NFS over RDMA implementation for Linux
available here that: http://sourceforge.net/projects/nfs-rdma.


So how do we avoid the TOE pitfalls with iWARP? I think it depends on
the pitfall. At the low level:

- Stale Network/Address Information: Path MTU Change, ICMP Redirect 
and ARP next hop changes need netlink notifier events so that hardware
can be updated when they change. I see this support as an extension (new
events) to an existing service and a relatively low-level of parallel
stack integration. iSCSI and IB could also benefit from these events.

- Port Space Collision, i.e. socket app and rdma/iWARP apps collide on 
a port number: The RDMA CMA needs to be able to allocate and de-allocate
port numbers, however, the services that do this today are not exported
and would need some minor tweaking. iSCSI and IB benefit from these
services as well.

- netfilter rules, syn-flood, conn-rate, etc You pointed out that 
if connection establishment were done in the native stack (SYN,
SYN/ACK), that this would account for the bulk of the netfilter utility,
however, this probably results in falling into many of the TOE traps
people have issue with.

WRT to http://linux-net.osdl.org/index.php/TOE

Security Updates

A TOE net stack is closed source firmware. Linux engineers have no way
to fix security issues that arise. As a result, only non-TOE users will
receive security updates, leaving random windows of vulnerability for
each TOE NIC's users.

- A Linux security update may or may not be relevant to a vendors
implementation. 

- If a vendor's implementation has a security issue then the customer
must rely on the vendor to fix it. This is no less true for iWARP than
for any adapter.

Point-in-time Solution

Each TOE NIC has a limited lifetime of usefulness, because system
hardware rapidly catches up to TOE performance levels, and eventually
exceeds TOE performance levels. We saw this with 10mbit TOE, 100mbit
TOE, gigabit TOE, and soon with 10gig TOE.

- iWARP needs to do protocol processing in order to validate and
evaluate TCP payload in advance of direct data placement. This
requirement is independent of CPU speed. 

Different Network Behavior

System administrators are quite familiar with how the Linux network
stack interoperates with the world at large. TOE is a black box, each
NIC requires re-examination of network behavior. Network scanners and
analysis tools must be updated, or they will provide faulty analysis.

- Native Linux Tools like tcpdump, netstat, etc... will not work as
expected. 

- Network Analyzers such as Finisar, etc... will work just fine.

Performance

Experience has shown that TOE implementations require additional work
(programming the hardware, hardware-specific socket manipulation) to set
up and tear down connections. For connection intensive protocols such as
HTTP, TOE often underperforms.

- I suspect that connection rates for RDMA adapters fall well-below the
rates attainable with a dumb device. That said, all of the RDMA
applications that I know of are not connection intensive. Even for TOE,
the later HTTP versions makes connection rates less of an issue.

Hardware-specific limits 

TOE NICs are more resource limited than your overall computer system.
This is most readily apparent under load, when trying to support
thousands of simultaneous connections. TOE NICs simply do not have the
memory resources to buffer thousands of connections, much less have the
CPU power to handle such loads. Further, each TOE NIC has different
resource limitations (often unpublished, only to be discovered at the
worst

Re: RDMA will be reverted

2006-07-05 Thread Tom Tucker
On Wed, 2006-07-05 at 20:03 -0700, David Miller wrote:
 From: Roland Dreier [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 13:29:35 -0700
 
  The way forward seems to be to merge basic iWARP support that lives in
  drivers/infiniband, and then you can accept or reject things for
  better integration, like notifiers for routing changes.
 
 sarcasm
 Let's merge in drivers before the necessary infrastructure.
 /sarcasm
 
 No, I think that's not the way forward.  You build the foundation
 before the house, if the foundation cannot be built then you are
 wasting your time with the house idea.

We have been running NFS and other apps over iWARP 24x7 for the last 6
mos...without the proposed netdev patch. We've run 200+ node MPI
clusters for days and days over iWARP...without the proposed netdev
patch. We ran iWARP interoperability tests across the country between
Boston and San Jose...without ... yes I know ... you get it.

sarcasm
News flash...the foundation is built!
/sarcasm

But! Our stable LAN and the WAN tests didn't often experience MTU
changes, and redirects...but of course we knew these were inevitable. So
our goal was to make iWARP more robust in the face of a more dynamic
network topology. Shutters on the house maybe...I dunno, it's your
analogy ;-)

All that said, the proposed patch helps not only iWARP, but other
transports (iSCSI, IB) as well. It is not large, invasive,
intrusive...hell it's not even new. It leverages an existing event
notifier mechanism. 

This patch is about dotting I's and crossing T's, it's not about
foundations.


 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-30 Thread Tom Tucker
On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:

 The TOE folks have tried to submit their hooks and drivers
 on several occaisions, and we've rejected it every time.

iWARP != TOE

 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-29 Thread Tom Tucker
On Thu, 2006-06-29 at 12:46 -0700, David Miller wrote:
 From: Roland Dreier [EMAIL PROTECTED]
 Date: Thu, 29 Jun 2006 09:54:37 -0700
 
  In any case I think we need to find a way for Linux to support iWARP
  hardware, since there are users that want this, and (some of) the
  vendors are working hard to do things the right way (including cc'ing
  netdev on the conversation).  I don't think it's good for Linux for
  the answer to just be, sorry, you're wrong to want to use that hardware.
 
 We give the same response for TOE stuff.

What does the word we represent in this context? Is it the Linux
community at large, Linux and Andrew, you? I'm not trying to be
argumentative, I just want to understand how carefully and by whom iWARP
technology has been considered.

 
 The integration of iWARP with the Linux networking, while much better
 than TOE, is still heavily flawed.
 
 What most people might not realize when using this stuff is that:

Agreed, the patch improves some things, but doesn't address others. But
isn't this position a condemnation of the good to spite the bad? 

 
 1) None of their firewall rules will apply to the iWARP communications.
 2) None of their packet scheduling configurations can be applied to
the iWARP communications.
 3) It is not possible to encapsulate iWARP traffic in IPSEC
 
 And the list goes on and on.

It does, however, this position statement makes things worse, not
better. By this I mean that deep adapters (iSCSI, iWARP) are even more
debilitated by not being able to snoop MTU changes, etc... and are
therefore forced to duplicate sub-systems (e.g. ARP, ICMP, ...) already
ably implemented in host software.

 This is what we don't like about technologies that implement their own
 networking stack in the card firmware.

Doesn't this position force vendors to build deeper adapters, not
shallower adapters? Isn't this exactly the opposite of what is intended?

 

 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-29 Thread Tom Tucker
[...snip...]
 community at large, Linux and Andrew, you? I'm not trying to be

Linus sorry... spell checker...



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-29 Thread Tom Tucker
On Thu, 2006-06-29 at 13:19 -0700, David Miller wrote:
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Thu, 29 Jun 2006 15:11:06 -0500
 
  Doesn't this position force vendors to build deeper adapters, not
  shallower adapters? Isn't this exactly the opposite of what is intended?
 
 Nope.
 
 Look at what the networking developers give a lot of attention and
 effort to, things like TCP Large Receive Offload, and Van Jacobson net
 channels, both of which are fully stack integrated receive performance
 enhancements.  They do not bypass netfilter, they do not bypass
 packet scheduling, and yet they provide a hardware assist performance
 improvement for receive.

These technologies are integrated because someone chose to and was
allowed to integrate them. I contend that iWARP could be equally well
integrated if the decision was made to do so. It would, however, require
cooperation from both the hardware vendors and the netdev maintainers.

 
 This has been stated over and over again.

For TOE, you are correct, however, for iWARP, you can't do RDMA (direct
placement into application buffers) without state in the adapter. I
personally tried very hard to build an adapter without doing so, but
alas, I failed ;-) 

 
 If companies keep designing undesirable hardware that unnecessarily
 takes features away from the user, that really is not our problem.

I concede that features have been lost, but some applications benefit
greatly from RDMA. For these applications and these customers, the
hardware is not undesirable and the fact that netfilter won't work on
their sub 5us latency adapter is not perceived to be a big issue. The
mention of packet scheduling would cause an apoplectic seizure...unless
it were in the hardware...

All that verbiage aside, I believe that it is not a matter of whether it
is possible to integrate iWARP it is question of whether it is
permissible.




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-29 Thread Tom Tucker
On Thu, 2006-06-29 at 13:53 -0700, David Miller wrote:
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Thu, 29 Jun 2006 15:47:13 -0500
 
  I concede that features have been lost, but some applications benefit
  greatly from RDMA. For these applications and these customers,
 
 TOE folks give the same story... it's a broken record, really.
 
 Let us know when you can say something new about the situation.
 
 Under Linux we get to make better long term architectually sane
 decisions, even if it is to the dismay of the backers of certain
 short-sighted pieces of technology.

Would you indulge me with one final clarification?  

- Are you condemning RDMA over TCP as an ill-conceived technology?

- Are you condemning the implementation of iWARP? 

- Are you condemning both?

Thanks,
Tom



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA will be reverted

2006-06-28 Thread Tom Tucker
On Wed, 2006-06-28 at 00:07 -0700, David Miller wrote:
 Roland, there is no way in the world we would have let support for
 RDMA into the kernel tree had we seen and reviewed it on netdev.  I've
 discussed this with Andrew Morton, and we'd like you to please revert
 all of the RDMA code from Linus's tree immedialtely.
 
 Folks are well aware how against RDMA and TOE type schemes the Linux
 networking developers are.  So the fact that none of these RDMA
 changes went up for review on netdev strikes me as just a little bit
 more than suspicious.
 
 Please do not do this again, thank you.

I believe Roland is on vacation (they just had a baby..). 

It is my belief that everything that Roland submitted went through both
netdev and lklm reviews. 


 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] iWARP Connection Manager.

2006-06-12 Thread Tom Tucker
Andrew, thanks for the review, comments inline...

On Thu, 2006-06-08 at 00:54 -0700, Andrew Morton wrote:
 On Wed, 07 Jun 2006 15:06:05 -0500
 Steve Wise [EMAIL PROTECTED] wrote:
 
  
  This patch provides the new files implementing the iWARP Connection
  Manager.
  
  Review Changes:
  
  - sizeof - sizeof()
  
  - removed printks
  
  - removed TT debug code
  
  - cleaned up lock/unlock around switch statements.
  
  - waitqueue - completion for destroy path.
 
  ...
 
  +/* 
  + * This function is called on interrupt context. Schedule events on
  + * the iwcm_wq thread to allow callback functions to downcall into
  + * the CM and/or block.  Events are queued to a per-CM_ID
  + * work_list. If this is the first event on the work_list, the work
  + * element is also queued on the iwcm_wq thread.
  + *
  + * Each event holds a reference on the cm_id. Until the last posted
  + * event has been delivered and processed, the cm_id cannot be
  + * deleted. 
  + */
  +static void cm_event_handler(struct iw_cm_id *cm_id,
  +struct iw_cm_event *iw_event) 
  +{
  +   struct iwcm_work *work;
  +   struct iwcm_id_private *cm_id_priv;
  +   unsigned long flags;
  +
  +   work = kmalloc(sizeof(*work), GFP_ATOMIC);
  +   if (!work)
  +   return;
 
 This allocation _will_ fail sometimes.  The driver must recover from it. 
 Will it do so?

Er...no. It will lose this event. Depending on the event...the carnage
varies. We'll take a look at this.

 
  +EXPORT_SYMBOL(iw_cm_init_qp_attr);
 
 This file exports a ton of symbols.  It's usual to provide some justifying
 commentary in the changelog when this happens.

This module is a logical instance of the xx_cm where xx is the transport
type. I think there is some discussion warranted on whether or not these
should all be built into and exported by rdma_cm. One rationale would be
that the rdma_cm is the only client for many of these functions (this
being a particularly good example) and doing so would reduce the export
count. Others would be reasonably needed for any application (connect,
etc...)

All that said, we'll be sure to document the exported symbols in a
follow-up patch.

 
  +/*
  + * Copyright (c) 2005 Network Appliance, Inc. All rights reserved.
  + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
  + *
  + * This software is available to you under a choice of one of two
  + * licenses.  You may choose to be licensed under the terms of the GNU
  + * General Public License (GPL) Version 2, available from the file
  + * COPYING in the main directory of this source tree, or the
  + * OpenIB.org BSD license below:
  + *
  + * Redistribution and use in source and binary forms, with or
  + * without modification, are permitted provided that the following
  + * conditions are met:
  + *
  + *  - Redistributions of source code must retain the above
  + *copyright notice, this list of conditions and the following
  + *disclaimer.
  + *
  + *  - Redistributions in binary form must reproduce the above
  + *copyright notice, this list of conditions and the following
  + *disclaimer in the documentation and/or other materials
  + *provided with the distribution.
  + *
  + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
  + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
  + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
  + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
  + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
  + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
  + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  + * SOFTWARE.
  + */
  +#if !defined(IW_CM_PRIVATE_H)
  +#define IW_CM_PRIVATE_H
 
 We normally use #ifndef here.

We'll change this..

 
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/7] AMSO1100 Memory Management.

2006-06-12 Thread Tom Tucker
On Thu, 2006-06-08 at 01:17 -0700, Andrew Morton wrote:
 On Wed, 07 Jun 2006 15:06:55 -0500
 Steve Wise [EMAIL PROTECTED] wrote:
 
  
  +void c2_free(struct c2_alloc *alloc, u32 obj)
  +{
  +   spin_lock(alloc-lock);
  +   clear_bit(obj, alloc-table);
  +   spin_unlock(alloc-lock);
  +}
 
 The spinlock is unneeded here.

Good point.

 
 
 What does all the code in this file do, anyway?  It looks totally generic
 (and hence inappropriate for drivers/infiniband/hw/amso1100/) and somewhat
 similar to idr trees, perhaps.
 

We mimicked the mthca driver. It may be code that should be replaced
with Linux core services for new drivers. We'll investigate.

  +int c2_array_set(struct c2_array *array, int index, void *value)
  +{
  +   int p = (index * sizeof(void *))  PAGE_SHIFT;
  +
  +   /* Allocate with GFP_ATOMIC because we'll be called with locks held. */
  +   if (!array-page_list[p].page)
  +   array-page_list[p].page =
  +   (void **) get_zeroed_page(GFP_ATOMIC);
  +
  +   if (!array-page_list[p].page)
  +   return -ENOMEM;
 
 This _will_ happen under load.  What will the result of that be, in the
 context of thise driver?

A higher level object allocation will fail. In this case, a kernel
application request will fail and the application must handle the error.
 
 This function is incorrectly designed - it should receive a gfp_t argument.
 Because you don't *know* that the caller will always hold a spinlock.  And
 GFP_KERNEL is far, far stronger than GFP_ATOMIC.

This service is allocating a page that the adapter will DMA 2B message
indices into. 
 
  +static int c2_alloc_mqsp_chunk(gfp_t gfp_mask, struct sp_chunk **head)
  +{
  +   int i;
  +   struct sp_chunk *new_head;
  +
  +   new_head = (struct sp_chunk *) __get_free_page(gfp_mask | GFP_DMA);
 
 Why is __GFP_DMA in there?  Unless you've cornered the ISA bus infiniband
 market, it's likely to be wrong.
 

Flag confusion about what GFP_DMA means. We'll revisit this whole
file ... 

 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] iWARP Core Changes.

2006-06-07 Thread Tom Tucker

A reference is being taken on an iWARP device that is never getting
released. This prevents a participating iWARP netdev device from being
unloaded after a connection has been established on the passive side.

Search for ip_dev_find below...

On Wed, 2006-06-07 at 15:06 -0500, Steve Wise wrote:
 This patch contains modifications to the existing rdma header files,
 core files, drivers, and ulp files to support iWARP.
 
 Review updates:
 
 - copy_addr() - rdma_copy_addr()
 
 - dst_dev_addr param in rdma_copy_addr to const.
 
 - various spacing nits with recasting
 
 - include linux/inetdevice.h to get ip_dev_find() prototype.
 ---
 
  drivers/infiniband/core/Makefile |4 
  drivers/infiniband/core/addr.c   |   19 +
  drivers/infiniband/core/cache.c  |8 -
  drivers/infiniband/core/cm.c |3 
  drivers/infiniband/core/cma.c|  353 
 +++---
  drivers/infiniband/core/device.c |6 
  drivers/infiniband/core/mad.c|   11 +
  drivers/infiniband/core/sa_query.c   |5 
  drivers/infiniband/core/smi.c|   18 +
  drivers/infiniband/core/sysfs.c  |   18 +
  drivers/infiniband/core/ucm.c|5 
  drivers/infiniband/core/user_mad.c   |9 -
  drivers/infiniband/hw/ipath/ipath_verbs.c|2 
  drivers/infiniband/hw/mthca/mthca_provider.c |2 
  drivers/infiniband/ulp/ipoib/ipoib_main.c|8 +
  drivers/infiniband/ulp/srp/ib_srp.c  |2 
  include/rdma/ib_addr.h   |   15 +
  include/rdma/ib_verbs.h  |   39 +++
  18 files changed, 435 insertions(+), 92 deletions(-)
 
 diff --git a/drivers/infiniband/core/Makefile 
 b/drivers/infiniband/core/Makefile
 index 68e73ec..163d991 100644
 --- a/drivers/infiniband/core/Makefile
 +++ b/drivers/infiniband/core/Makefile
 @@ -1,7 +1,7 @@
  infiniband-$(CONFIG_INFINIBAND_ADDR_TRANS)   := ib_addr.o rdma_cm.o
  
  obj-$(CONFIG_INFINIBAND) +=  ib_core.o ib_mad.o ib_sa.o \
 - ib_cm.o $(infiniband-y)
 + ib_cm.o iw_cm.o $(infiniband-y)
  obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o
  obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=  ib_uverbs.o ib_ucm.o
  
 @@ -14,6 +14,8 @@ ib_sa-y :=  sa_query.o
  
  ib_cm-y :=   cm.o
  
 +iw_cm-y :=   iwcm.o
 +
  rdma_cm-y := cma.o
  
  ib_addr-y := addr.o
 diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
 index d294bbc..83f84ef 100644
 --- a/drivers/infiniband/core/addr.c
 +++ b/drivers/infiniband/core/addr.c
 @@ -32,6 +32,7 @@ #include linux/mutex.h
  #include linux/inetdevice.h
  #include linux/workqueue.h
  #include linux/if_arp.h
 +#include linux/inetdevice.h
  #include net/arp.h
  #include net/neighbour.h
  #include net/route.h
 @@ -60,12 +61,15 @@ static LIST_HEAD(req_list);
  static DECLARE_WORK(work, process_req, NULL);
  static struct workqueue_struct *addr_wq;
  
 -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
 -  unsigned char *dst_dev_addr)
 +int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
 +  const unsigned char *dst_dev_addr)
  {
   switch (dev-type) {
   case ARPHRD_INFINIBAND:
 - dev_addr-dev_type = IB_NODE_CA;
 + dev_addr-dev_type = RDMA_NODE_IB_CA;
 + break;
 + case ARPHRD_ETHER:
 + dev_addr-dev_type = RDMA_NODE_RNIC;
   break;
   default:
   return -EADDRNOTAVAIL;
 @@ -77,6 +81,7 @@ static int copy_addr(struct rdma_dev_add
   memcpy(dev_addr-dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
   return 0;
  }
 +EXPORT_SYMBOL(rdma_copy_addr);
  
  int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
  {
 @@ -88,7 +93,7 @@ int rdma_translate_ip(struct sockaddr *a
   if (!dev)
   return -EADDRNOTAVAIL;
  
 - ret = copy_addr(dev_addr, dev, NULL);
 + ret = rdma_copy_addr(dev_addr, dev, NULL);
   dev_put(dev);
   return ret;
  }
 @@ -160,7 +165,7 @@ static int addr_resolve_remote(struct so
  
   /* If the device does ARP internally, return 'done' */
   if (rt-idev-dev-flags  IFF_NOARP) {
 - copy_addr(addr, rt-idev-dev, NULL);
 + rdma_copy_addr(addr, rt-idev-dev, NULL);
   goto put;
   }
  
 @@ -180,7 +185,7 @@ static int addr_resolve_remote(struct so
   src_in-sin_addr.s_addr = rt-rt_src;
   }
  
 - ret = copy_addr(addr, neigh-dev, neigh-ha);
 + ret = rdma_copy_addr(addr, neigh-dev, neigh-ha);
  release:
   neigh_release(neigh);
  put:
 @@ -244,7 +249,7 @@ static int addr_resolve_local(struct soc
   if (ZERONET(src_ip)) {
   src_in-sin_family = dst_in-sin_family;
   

Re: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.

2006-06-01 Thread Tom Tucker
On Wed, 2006-05-31 at 15:22 -0700, Sean Hefty wrote:
 Steve Wise wrote:
  +/* 
  + * Release a reference on cm_id. If the last reference is being removed
  + * and iw_destroy_cm_id is waiting, wake up the waiting thread.
  + */
  +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
  +{
  +   int ret = 0;
  +
  +   BUG_ON(atomic_read(cm_id_priv-refcount)==0);
  +   if (atomic_dec_and_test(cm_id_priv-refcount)) {
  +   BUG_ON(!list_empty(cm_id_priv-work_list));
  +   if (waitqueue_active(cm_id_priv-destroy_wait)) {
  +   BUG_ON(cm_id_priv-state != IW_CM_STATE_DESTROYING);
  +   BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
  +   cm_id_priv-flags));
  +   ret = 1;
  +   wake_up(cm_id_priv-destroy_wait);
 
 We recently changed the RDMA CM, IB CM, and a couple of other modules from 
 using 
 wait objects to completions.   This avoids a race condition between 
 decrementing 
 the reference count, which allows destruction to proceed, and calling wake_up 
 on 
 a freed cm_id.  My guess is that you may need to do the same.
 
 Can you also explain the use of the return value here?  It's ignored below in 
 rem_ref() and destroy_cm_id().
 
  +static void add_ref(struct iw_cm_id *cm_id)
  +{
  +   struct iwcm_id_private *cm_id_priv;
  +   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
  +   atomic_inc(cm_id_priv-refcount);
  +}
  +
  +static void rem_ref(struct iw_cm_id *cm_id)
  +{
  +   struct iwcm_id_private *cm_id_priv;
  +   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
  +   iwcm_deref_id(cm_id_priv);
  +}
  +
 
  +/* 
  + * CM_ID -- CLOSING
  + *
  + * Block if a passive or active connection is currenlty being processed. 
  Then
  + * process the event as follows:
  + * - If we are ESTABLISHED, move to CLOSING and modify the QP state
  + *   based on the abrupt flag 
  + * - If the connection is already in the CLOSING or IDLE state, the peer is
  + *   disconnecting concurrently with us and we've already seen the 
  + *   DISCONNECT event -- ignore the request and return 0
  + * - Disconnect on a listening endpoint returns -EINVAL
  + */
  +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
  +{
  +   struct iwcm_id_private *cm_id_priv;
  +   unsigned long flags;
  +   int ret = 0;
  +
  +   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
  +   /* Wait if we're currently in a connect or accept downcall */
  +   wait_event(cm_id_priv-connect_wait, 
  +  !test_bit(IWCM_F_CONNECT_WAIT, cm_id_priv-flags));
 
 Am I understanding this check correctly?  You're checking to see if the user 
 has 
 called iw_cm_disconnect() at the same time that they called iw_cm_connect() 
 or 
 iw_cm_accept().  Are connect / accept blocking, or are you just waiting for 
 an 
 event?

Yes. The application (or the case I saw was user-mode exit logic after
ctrl-C) cleaning up at random times relative to connection
establishment. 

 
  +
  +   spin_lock_irqsave(cm_id_priv-lock, flags);
  +   switch (cm_id_priv-state) {
  +   case IW_CM_STATE_ESTABLISHED:
  +   cm_id_priv-state = IW_CM_STATE_CLOSING;
  +   spin_unlock_irqrestore(cm_id_priv-lock, flags);
  +   if (cm_id_priv-qp) { /* QP could be nul for user-mode 
  client */
  +   if (abrupt)
  +   ret = iwcm_modify_qp_err(cm_id_priv-qp);
  +   else
  +   ret = iwcm_modify_qp_sqd(cm_id_priv-qp);
  +   /* 
  +* If both sides are disconnecting the QP could
  +* already be in ERR or SQD states
  +*/
  +   ret = 0;
  +   }
  +   else
  +   ret = -EINVAL;
  +   break;
  +   case IW_CM_STATE_LISTEN:
  +   spin_unlock_irqrestore(cm_id_priv-lock, flags);
  +   ret = -EINVAL;
  +   break;
  +   case IW_CM_STATE_CLOSING:
  +   /* remote peer closed first */
  +   case IW_CM_STATE_IDLE:  
  +   /* accept or connect returned !0 */
  +   spin_unlock_irqrestore(cm_id_priv-lock, flags);
  +   break;
  +   case IW_CM_STATE_CONN_RECV:
  +   /* 
  +* App called disconnect before/without calling accept after
  +* connect_request event delivered.
  +*/
  +   spin_unlock_irqrestore(cm_id_priv-lock, flags);
  +   break;
  +   case IW_CM_STATE_CONN_SENT:
  +   /* Can only get here if wait above fails */
  +   default:
  +   BUG_ON(1);
  +   }
  +
  +   return ret;
  +}
  +EXPORT_SYMBOL(iw_cm_disconnect);
  +static void destroy_cm_id(struct iw_cm_id *cm_id)
  +{
  +   struct iwcm_id_private *cm_id_priv;
  +   unsigned long flags;
  +   int ret;
  +
  +   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
  +   /* Wait if we're currently in a 

Re: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.

2006-06-01 Thread Tom Tucker
On Thu, 2006-06-01 at 14:09 -0700, Sean Hefty wrote:
 Steve Wise wrote:
 +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
 +{
 +  struct iwcm_id_private *cm_id_priv;
 +  unsigned long flags;
 +  int ret = 0;
 +
 +  cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
 +  /* Wait if we're currently in a connect or accept downcall */
 +  wait_event(cm_id_priv-connect_wait, 
 + !test_bit(IWCM_F_CONNECT_WAIT, cm_id_priv-flags));
 
 Am I understanding this check correctly?  You're checking to see if the 
 user has 
 called iw_cm_disconnect() at the same time that they called iw_cm_connect() 
 or 
 iw_cm_accept().  Are connect / accept blocking, or are you just waiting for 
 an 
 event?
  
  
  The CM must wait for the low level provider to finish a connect() or
  accept() operation before telling the low level provider to disconnect
  via modifying the iwarp QP.  Regardless of whether they block, this
  disconnect can happen concurrently with the connect/accept so we need to
  hold the disconnect until the connect/accept completes.
  
  
 +EXPORT_SYMBOL(iw_cm_disconnect);
 +static void destroy_cm_id(struct iw_cm_id *cm_id)
 +{
 +  struct iwcm_id_private *cm_id_priv;
 +  unsigned long flags;
 +  int ret;
 +
 +  cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
 +  /* Wait if we're currently in a connect or accept downcall. A
 +   * listening endpoint should never block here. */
 +  wait_event(cm_id_priv-connect_wait, 
 + !test_bit(IWCM_F_CONNECT_WAIT, cm_id_priv-flags));
 
 Same question/comment as above.
 
  
  
  Same answer.  
 
 There's a difference between trying to handle the user calling 
 disconnect/destroy at the same time a call to accept/connect is active, 
 versus 
 the user calling disconnect/destroy after accept/connect have returned.  In 
 the 
 latter case, I think you're fine.  In the first case, this is allowing a user 
 to 
 call destroy at the same time that they're calling accept/connect. 
 Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set 
 by 
 accept/connect by the time disconnect/destroy tests it.

The problem is that we can't synchronously cancel an outstanding connect
request. Once we've asked the adapter to connect, we can't tell him to
stop, we have to wait for it to fail. During the time period between
when we ask to connect and the adapter says yeah-or-nay, the user hits
ctrl-C. This is the case where disconnect and/or destroy gets called and
we have to block it waiting for the outstanding connect request to
complete.

One alternative to this approach is to do the kfree of the cm_id in the
deref logic. This was the original design and leaves the object around
to handle the completion of the connect and still allows the app to
clean up and go away without all this waitin' around. When the adapter
finally finishes and releases it's reference, the object is kfree'd.

Hope this helps.
 
 
 - Sean
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html