Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-06 Thread Michael S. Tsirkin
 [Felix Marti] In addition, is arming the CQ really in the performance
 path? - Don't apps poll the CQ as long as there are pending CQEs and
 only arm the CQ for notification once there is nothing left to do? If
 this is the case, it would mean that we waste a few cycles 'idle'
 cycles.

Applications such as IPoIB might queue up packets, then ARM the CQ,
and only then they are processed by the upper layers in the stack.
So arming the CQ is on hot datapath.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-04 Thread Steve Wise
On Thu, 2007-01-04 at 07:07 +0200, Michael S. Tsirkin wrote:
  If you think I should not add the udata parameter to the req_notify_cq()
  provider verb, then I can rework the chelsio driver:
  
  1) at cq creation time, pass the virtual address of the u32 used by the
  library to track the current cq index.  That way the chelsio kernel
  driver can save the address in its kernel cq context for later use.
  
  2) change chelsio's req_notify_cq() to copy in the current cq index
  value directly for rearming.
  
  This puts all the burden on the chelsio driver, which is apparently the
  only one that needs this functionality.  
 
 Good thinking, I haven't thought of this approach.
 
 This way there won't be any API/core changes and no changes to
 other low level drivers, correct? And for chelsio, there's no overhead
 as compared to code you posted.
 
 Sounds good.
 

I still want to hear from Roland on this before I go to the effort of
reworking all this...


Steve.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Steve Wise
  @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
   static inline int ib_req_notify_cq(struct ib_cq *cq,
 enum ib_cq_notify cq_notify)
   {
  -   return cq-device-req_notify_cq(cq, cq_notify);
  +   return cq-device-req_notify_cq(cq, cq_notify, NULL);
   }
   
   /**
 
 Can't say I like this adding overhead in data path operations (and note this
 can't be optimized out). And kernel consumers work without passing it in, so 
 it
 hurts kernel code even for Chelsio. Granted, the cost is small here, but these
 things do tend to add up.
 
 It seems all Chelsio needs is to pass in a consumer index - so, how about a 
 new
 entry point? Something like void set_cq_udata(struct ib_cq *cq, struct 
 ib_udata *udata)?
 

Adding a new entry point would hurt chelsio's user mode performance if
if then requires 2 kernel transitions to rearm the cq.  

Passing in user data is sort of SOP for these sorts of verbs.  

How much does passing one more param cost for kernel users?  



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Michael S. Tsirkin
   @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
static inline int ib_req_notify_cq(struct ib_cq *cq,
enum ib_cq_notify cq_notify)
{
   - return cq-device-req_notify_cq(cq, cq_notify);
   + return cq-device-req_notify_cq(cq, cq_notify, NULL);
}

/**
  
  Can't say I like this adding overhead in data path operations (and note this
  can't be optimized out). And kernel consumers work without passing it in, 
  so it
  hurts kernel code even for Chelsio. Granted, the cost is small here, but 
  these
  things do tend to add up.
  
  It seems all Chelsio needs is to pass in a consumer index - so, how about a 
  new
  entry point? Something like void set_cq_udata(struct ib_cq *cq, struct 
  ib_udata *udata)?
  
 
 Adding a new entry point would hurt chelsio's user mode performance if
 if then requires 2 kernel transitions to rearm the cq.  

No, it won't need 2 transitions - just an extra function call,
so it won't hurt performance - it would improve performance.

ib_uverbs_req_notify_cq would call

ib_uverbs_req_notify_cq()
{
ib_set_cq_udata(cq, udata)
ib_req_notify_cq(cq, cmd.solicited_only ?
IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
}

This way kernel consumers don't incur any overhead,
and in userspace users extra function call is dwarfed
by system call overhead.

 Passing in user data is sort of SOP for these sorts of verbs.  

I don't see other examples. Where we did pass extra user data
is in non-data pass verbs such as create QP.

This is most inner tight loop in many ULPs, so we should be very careful
about adding code there - these things do add up.
See recent IRQ API update in kernel.

 How much does passing one more param cost for kernel users?  

Donnu. I just reviewed the code.
It really should be up to patch submitter to check the performance
effect of his patch, if there might be any.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Steve Wise
   
   It seems all Chelsio needs is to pass in a consumer index - so, how about 
   a new
   entry point? Something like void set_cq_udata(struct ib_cq *cq, struct 
   ib_udata *udata)?
   
  
  Adding a new entry point would hurt chelsio's user mode performance if
  if then requires 2 kernel transitions to rearm the cq.  
 
 No, it won't need 2 transitions - just an extra function call,
 so it won't hurt performance - it would improve performance.
 
 ib_uverbs_req_notify_cq would call
 
   ib_uverbs_req_notify_cq()
   {
   ib_set_cq_udata(cq, udata)
   ib_req_notify_cq(cq, cmd.solicited_only ?
   IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
   }
 

ib_set_cq_udata() would transition into the kernel to pass in the
consumer's index.  In addition, ib_req_notify_cq would also transition
into the kernel since its not a bypass function for chelsio.

 This way kernel consumers don't incur any overhead,
 and in userspace users extra function call is dwarfed
 by system call overhead.
 
  Passing in user data is sort of SOP for these sorts of verbs.  
 
 I don't see other examples. Where we did pass extra user data
 is in non-data pass verbs such as create QP.
 
 This is most inner tight loop in many ULPs, so we should be very careful
 about adding code there - these things do add up.
 See recent IRQ API update in kernel.

Roland, do you have any comments on this?  You previously indicated
these patches were good to go once chelsio's ethernet driver gets pulled
in. 

  How much does passing one more param cost for kernel users?  
 
 Donnu. I just reviewed the code.
 It really should be up to patch submitter to check the performance
 effect of his patch, if there might be any.

I've run this code with mthca and didn't notice any performance
degradation, but I wasn't specifically measuring cq_poll overhead in a
tight loop...






-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Michael S. Tsirkin
  
  No, it won't need 2 transitions - just an extra function call,
  so it won't hurt performance - it would improve performance.
  
  ib_uverbs_req_notify_cq would call
  
  ib_uverbs_req_notify_cq()
  {
  ib_set_cq_udata(cq, udata)
  ib_req_notify_cq(cq, cmd.solicited_only ?
  IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
  }
  
 
 ib_set_cq_udata() would transition into the kernel to pass in the
 consumer's index.  In addition, ib_req_notify_cq would also transition
 into the kernel since its not a bypass function for chelsio.

We misunderstand each other.

ib_uverbs_req_notify_cq is in drivers/infiniband/core/uverbs_cmd.c -
all this code runs inside the IB_USER_VERBS_CMD_REQ_NOTIFY_CQ command,
so there is a single user to kernel transition.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Michael S. Tsirkin
 I've run this code with mthca and didn't notice any performance
 degradation, but I wasn't specifically measuring cq_poll overhead in a
 tight loop...

We were speaking about ib_req_notify_cq here, actually, not cq poll.
So what was tested?

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Steve Wise
On Wed, 2007-01-03 at 17:02 +0200, Michael S. Tsirkin wrote:
  I've run this code with mthca and didn't notice any performance
  degradation, but I wasn't specifically measuring cq_poll overhead in a
  tight loop...
 
 We were speaking about ib_req_notify_cq here, actually, not cq poll.
 So what was tested?
 

Sorry, I meant req_notify.  I didn't specifically measure the cost of
req_notify before and after this change.

I've been running the user mode perftest programs mainly.  



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Steve Wise
On Wed, 2007-01-03 at 17:00 +0200, Michael S. Tsirkin wrote:
   
   No, it won't need 2 transitions - just an extra function call,
   so it won't hurt performance - it would improve performance.
   
   ib_uverbs_req_notify_cq would call
   
 ib_uverbs_req_notify_cq()
 {
 ib_set_cq_udata(cq, udata)
 ib_req_notify_cq(cq, cmd.solicited_only ?
 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
 }
   
  
  ib_set_cq_udata() would transition into the kernel to pass in the
  consumer's index.  In addition, ib_req_notify_cq would also transition
  into the kernel since its not a bypass function for chelsio.
 
 We misunderstand each other.
 
 ib_uverbs_req_notify_cq is in drivers/infiniband/core/uverbs_cmd.c -
 all this code runs inside the IB_USER_VERBS_CMD_REQ_NOTIFY_CQ command,
 so there is a single user to kernel transition.
 

Oh I see. 

This seems like a lot of extra code to avoid passing one extra arg to
the driver's req_notify_cq verb.  I'd appreciate other folk's input on
how important they think this is.  

If you insist, then I'll run some tests specifically in kernel mode and
see how this affects mthca's req_notify performance.

Steve.





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Michael S. Tsirkin
   I've run this code with mthca and didn't notice any performance
   degradation, but I wasn't specifically measuring cq_poll overhead in a
   tight loop...
  
  We were speaking about ib_req_notify_cq here, actually, not cq poll.
  So what was tested?
  
 
 Sorry, I meant req_notify.  I didn't specifically measure the cost of
 req_notify before and after this change.
 
 I've been running the user mode perftest programs mainly.  

So, it's not really activated a lot there.
You want something like IPoIB BW test.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Michael S. Tsirkin
No, it won't need 2 transitions - just an extra function call,
so it won't hurt performance - it would improve performance.

ib_uverbs_req_notify_cq would call

ib_uverbs_req_notify_cq()
{
ib_set_cq_udata(cq, udata)
ib_req_notify_cq(cq, cmd.solicited_only ?
IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
}

   
   ib_set_cq_udata() would transition into the kernel to pass in the
   consumer's index.  In addition, ib_req_notify_cq would also transition
   into the kernel since its not a bypass function for chelsio.
  
  We misunderstand each other.
  
  ib_uverbs_req_notify_cq is in drivers/infiniband/core/uverbs_cmd.c -
  all this code runs inside the IB_USER_VERBS_CMD_REQ_NOTIFY_CQ command,
  so there is a single user to kernel transition.
  
 
 Oh I see. 
 
 This seems like a lot of extra code to avoid passing one extra arg to
 the driver's req_notify_cq verb.  I'd appreciate other folk's input on
 how important they think this is.  

 If you insist, then I'll run some tests specifically in kernel mode and
 see how this affects mthca's req_notify performance.

This might be an interesting datapoint.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2007-01-03 Thread Steve Wise
On Wed, 2007-01-03 at 21:33 +0200, Michael S. Tsirkin wrote:
  Without extra param (1000 iterations in cycles): 
  ave 101.283 min 91 max 247
  With extra param (1000 iterations in cycles):
  ave 103.311 min 91 max 221
 
 A 2% hit then. Not huge, but 0 either.
 
  Convert cycles to ns (3466.727 MHz CPU):
  
  Without: 101.283 / 3466.727 = .02922us == 29.22ns
  With:103.311 / 3466.727 = .02980us == 29.80ns
  
  So I measure a .58ns average increase for passing in the additional
  parameter.
 
 That depends on CPU speed though. Percentage is likely to be more universal.
 
  Here is a snipit of the test:
  
  spin_lock_irq(lock);
  do_gettimeofday(start_tv);
  for (i=0; i1000; i++) {
  cycles_start[i] = get_cycles();
  ib_req_notify_cq(cb-cq, IB_CQ_NEXT_COMP);
  cycles_stop[i] = get_cycles();
  }
  do_gettimeofday(stop_tv);
  spin_unlock_irq(lock);
  
  if (stop_tv.tv_usec  start_tv.tv_usec) {
  stop_tv.tv_usec += 100;
  stop_tv.tv_sec  -= 1;
  }
  
  for (i=0; i  1000; i++) {
  cycles_t v = cycles_stop[i] - cycles_start[i];
  sum += v;
  if (v  max)
  max = v;
  if (min == 0 || v  min)
  min = v;
  }
  
  printk(KERN_ERR PFX FOO delta sec %lu usec %lu sum %llu min %llu 
  max %llu\n,
  stop_tv.tv_sec - start_tv.tv_sec,
  stop_tv.tv_usec - start_tv.tv_usec,
  (unsigned long long)sum, (unsigned long long)min,
  (unsigned long long)max);
 
 Good job, the test looks good, thanks.
 
 So what does this tell you?
 To me it looks like there's a measurable speed difference,
 and so we should find a way (e.g. what I proposed) to enable chelsio userspace
 without adding overhead to other low level drivers or indeed chelsio kernel 
 level code.
 
 What do you think? Roland?
 

I think having a 2nd function to set the udata seems onerous.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/13] Linux RDMA Core Changes

2006-12-24 Thread Michael S. Tsirkin
 diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c 
 b/drivers/infiniband/hw/mthca/mthca_cq.c
 index 283d50b..15cbd49 100644
 --- a/drivers/infiniband/hw/mthca/mthca_cq.c
 +++ b/drivers/infiniband/hw/mthca/mthca_cq.c
 @@ -722,7 +722,8 @@ repoll:
   return err == 0 || err == -EAGAIN ? npolled : err;
  }
  
 -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
 +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
 +struct ib_udata *udata)
  {
   __be32 doorbell[2];
  
 @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
   return 0;
  }
  
 -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
 +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
 +struct ib_udata *udata)
  {
   struct mthca_cq *cq = to_mcq(ibcq);
   __be32 doorbell[2];
 diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h 
 b/drivers/infiniband/hw/mthca/mthca_dev.h
 index fe5cecf..6b9ccf6 100644
 --- a/drivers/infiniband/hw/mthca/mthca_dev.h
 +++ b/drivers/infiniband/hw/mthca/mthca_dev.h
 @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
  
  int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 struct ib_wc *entry);
 -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
 +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct 
 ib_udata *udata);
 +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct 
 ib_udata *udata);
  int mthca_init_cq(struct mthca_dev *dev, int nent,
 struct mthca_ucontext *ctx, u32 pdn,
 struct mthca_cq *cq);
 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
 index 8eacc35..e3e1a2c 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -941,7 +941,8 @@ struct ib_device {
 struct ib_wc *wc);
   int(*peek_cq)(struct ib_cq *cq, int wc_cnt);
   int(*req_notify_cq)(struct ib_cq *cq,
 - enum ib_cq_notify 
 cq_notify);
 + enum ib_cq_notify cq_notify,
 + struct ib_udata *udata);
   int(*req_ncomp_notif)(struct ib_cq *cq,
 int wc_cnt);
   struct ib_mr * (*get_dma_mr)(struct ib_pd *pd,
 @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
  static inline int ib_req_notify_cq(struct ib_cq *cq,
  enum ib_cq_notify cq_notify)
  {
 - return cq-device-req_notify_cq(cq, cq_notify);
 + return cq-device-req_notify_cq(cq, cq_notify, NULL);
  }
  
  /**

Can't say I like this adding overhead in data path operations (and note this
can't be optimized out). And kernel consumers work without passing it in, so it
hurts kernel code even for Chelsio. Granted, the cost is small here, but these
things do tend to add up.

It seems all Chelsio needs is to pass in a consumer index - so, how about a new
entry point? Something like void set_cq_udata(struct ib_cq *cq, struct ib_udata 
*udata)?

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html