Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-06-01 Thread Christoph Lameter
On Fri, 29 May 2015, Doug Ledford wrote:

 The use case is clear IMO.  It's for financial trading software.  I
 don't think they really care about details like whether it's the start
 or end packet, or completion, or whatever.  They need a tie breaker
 between when they have two different buy or sell orders on the same lot
 of stock.  Any deterministic timing/ordering method will do as long as
 they consistently apply it I think.  And the faster and lower overhead
 the process, the better.  He doesn't really want a timestamp, he merely
 wants a sequence ordering.  But a timestamp is what they are using to
 get him what he needs.

 Is that a fair guess Christoph?

We want to have a time stamp when the action is complete and the data is
available to the application or the send action is complete and the CQ
entry can be reused. That is a well defined point and that is how the time
stamps function in the existing implementation. This is an obvious
understanding and its pretty clear.

The time stamp needs to be at the end of the action because the timestamp
is used to:

1. Assess the impact of network processing. This can be compared with
   packet timestamps from capture devices off the wire.
2. Delineate the start of packet processing in software.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-06-01 Thread Christoph Lameter

  This is sufficient since it can be converted to ns or whatever one wants.

 It is sufficient for your use.  It is not, however, a good API.

I hate these foggy statements that you guys come up with. Why is it not a
good API? Having a cycle counter without processiing overhead is a good
thing and the way counters are handled is pretty well established.

  The API provides the abilty to retrieve the clock freq which is
  sufficient for the user to convert the value to meaningful time values.

 I would prefer if the access to the timestamp were implemented via a
 function in libibverbs (I haven't looked at the git repo, too little
 time, I'll get to it).  Something like ibv_get_cqe_timestamp().  That
 function should be general and return a suitable, normalized value (ns
 probably).  If you just want a simple comparator without the overhead of
 normalizing to time, and are willing to accept the consequences of that,
 then I would expect you to use something like

That would introduce additional latencies and would make that feature no
longer useful to us. The advantage of this approach is that the counter is
in the same cacheline that is already used. It is very low overhead.

 ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter.  For
 the most part, the user space application should not know details like
 we are using a cycle counter in the HCA processor for timestamping,

Why not? A cycle counter is the general way of providing
timestamps in hardware. RDTSC is such a cycle counter as well. There are
numerous examples of counters like that.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-06-01 Thread Christoph Lameter
On Fri, 29 May 2015, Hefty, Sean wrote:

  What is the issue here? The timestamp is created when the processing is
  finished by the nic and the completion entry becomes available.

 The timestamp is generated when a work completion entry is written.  If the=
 re's a clear use case for this, it hasn't been described.  The use case you=
  mentioned only works if there is a 1:1 relationship between a packet and a=
  work completion.  That is not what is being defined here.

It does make sense to have a timestamp when the work described by a CQ has
been completed. For that you do not need a 1:1 correspondence.

My use case is focused on single packet processing because that is what we
do here.

  It is exactly defined like any other cycle counters in hardware and it is
  exposed using an API that allows multiple vendors to use these cycle
  counters without regard to a particular vendors implementation.

 I disagree.  This is associated with a specific implementation.  It assumes=
  that the counter is part of a CQ entry, and that the counter is written wh=
 en the completion is written.  There's nothing that requires other vendors =
 to follow that model, nor is it clear that this is a generic or useful enou=
 gh operation that other vendors would want to follow this model.  Why not h=
 ave the time stamp record the start of the transaction?  The end?  Have two=
  stamps, once for the first packet, and one for the last?  Limit this to si=
 ngle packet operations only?

Why would you have a timestamp at the beginning of the transaction? That
is useless because you can use packet capture devices to establish that
timepoint. At that point you have not identified the CQ anyways.

Having a timestamp when an action is complete makes sense, is generic
and general.

Adding the timestamp to the data means that the application now has to
separate the metadata (timestamp) from the data stream. Not good.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Doug Ledford
On Mon, 2015-06-01 at 12:30 +0300, Matan Barak wrote:
 On Sun, May 31, 2015 at 10:00 PM, Doug Ledford dledf...@redhat.com wrote:
  On Sun, 2015-05-31 at 15:14 +0300, Or Gerlitz wrote:
  Hi Doug,
 
  This patchset adds completion timestamping supports for verbs consumers.
 
  Reviewing the weekend threads, we've changed the flag time to reflect
  that this is completion time-stamp and folded the mlx4 actual support
  into one patch.
 
  Regarding the related user-space support, it's possible to add what you
  were suggesting, ibv_get_raw_cqe_timestamp() -- returns ts in cycles and
  ibv_get_cqe_timestamp() -- returns ts in ns, this makes the value returned
  by the poll cq verb an opaque one that must go through one of  the 
  convertors.
 
  We would to go for one helper ibv_get_timestamp(uint64_t raw_time, flag) 
  which
  could get the raw time-stamp and one of the following flags: RAW_TIME, 
  RAW_NS_TIME.
 
  I'm theoretically OK with something similar to the above.  However, the
  NS time should not be raw.  It should be cooked and should be able to be
  valid to compare between different adapters.  Right now, the cycle
  counter that you are exposing is only useful for ordering between
  packets received on a single adapter where the cycle counter is the same
  on all packets.  Throw in a different vendor's card, or two of your own
  cards, and the issue gets much more complex.  The cooked value should be
  an actual, real time that can be used across these more complex
  environments.  Because of that, it really shouldn't be called RAW.
 
 
 Thanks for the feedback Doug.
 We wanted to add RAW_NS in order to free the user from calculating it by 
 himself
 (dividing the cycles value in the core_clock).

What's the point?  If it's raw, it's raw.  It's not coordinated between
adapters.  Whether it's in ns or ps or flipflops doesn't matter, it's a
flat number that has no reference to anything else, so the only thing
that matters is  another version of itself or not.

 In addition to this, it's possible to implement a future NS_TIME
 (without the raw), which
 will convert the opaque time to system wide ns.
 
  So, if you want a single entry point, I would suggest something like
  this:
 
  enum ib_timestamp_flags {
  IB_TIMESTAMP_COMPLETION = (1  0), // specify on create_cq
  IB_TIMESTAMP_WQE_BEGIN =  (1  1), // specify on create qp?
  IB_TIMESTAMP_WQE_END =(1  2), // specify on create qp?
  IB_TIMESTAMP_RAW =(1  31)
  };
 
  enum ib_cq_creation_flags {
  IB_CQ_FLAGS_TIMESTAMP_COMPLETION = (1  0)
  };
 
  /**
   * ibv_get_timestamp - Return the requested timestamp for the given wc
   * @wc - work completion to get timestamp results from
   * @ts - struct timespec to return timestamp in
   * @flags - which timestamp to return and in what form
   *
   * Depending on the flags used to create the queue pair/completion
   * queue, different timestamps might be available.  Callers should
   * specify which timestamp they are interested in using the flags
   * element, and if they wish either a cooked or raw timestamp.  A
   * raw timestamp is implementation defined and will be passed back
   * in the tv_nsec portion of the struct timespec.  A raw timestamp
   * can not be relied upon to have any ordering value between more
   * than one HCA or driver.  A cooked timestamp will return a valid
   * struct timespec normalized as closely as possible to the return
   * value for CLOCK_MONOTONIC of clock_gettime at the time of the
   * timestamp.
   */
  int ibv_get_timestamp(struct ibv_wc *wc, struct timespec *ts, int
  flags);
 
 
 We wanted to divide the flow here:
 In create_cq, the user notifies the kernel/HCA which timestamp he
 would like to get.

Correction, which timestamp*s*.

 It could be a completion timestamp, a start of WQE timestamp or
 whatever he wants.
 The timestamp the user gets in the WQE is opaque. Every vendor could
 implement it
 as it wants - in order to have minimal implication in performance.

Again, timestamp(s).

 The second part is ibv_get_timestamp. It gets an opaque timestamp

No.  As you've already pointed out, how each vendor implements returning
the timestamp(s) could be totally different.  There are no timestamp
entries in the existing wc struct.  Expecting the user to pass the raw
value to the ibv_get_timestamp function makes no sense and violates the
attempted abstraction of ibverbs.  Passing in the wc struct allows the
driver to internally allocate a wc struct with extra private elements
and pass that back to the user, when the user passes it back to
ibv_get_timestamp the elements are there in the private portion of the
struct.

  and
 outputs a converted value in respect to the time the user wanted to get.
 For example, if IB_TIMESTAMP_NS_TIME is given, the function should output
 a system-wide NS value (we would like to implement this only in the future).
 Currently, only RAW and RAW_NS will be supported, while RAW gives the 

RE: Current patch statuses

2015-06-01 Thread Alex Vainman
Hi Doug,

Thanks for the feedback.

The ability to have RSS via hashed receives and multiple flows is a start,
now you just need to add pinning the flows to specific sockets that are
adjacent to the PCI bus the controller is on and populating the number
of threads to use for RSS based upon the number of cores/threads per
socket.
I think that the proposed API and the overall solution allows that.

I'm not sure I like the specific
implementation.  It puts a lot of new, complex configuration on the
application.  
I think that it is important that verbs API, that comes to provide direct 
access to HW will explicitly expose RSS settings and allow verbs applications 
to adjust RSS configuration to their needs in order to get the 100% performance
benefit from the HW capabilities.
The ability to control RSS hash function and the Indirection table properties
can be important for different fast packet processing solutions:
- For example bump in the wire applications can prefer symmetric 
  RSS so exposing interface that allows to configure the hash function is
  valuable.
-The applications can know which traffic type it will handle and adjust 
  which packets fields should participate in the RSS hash to better 
  distribute the traffic.
-Exposing API that allows to set RSS hash properties and RSS Indirection Table
 allows better verbs integration with such frameworks as DPDK. 

Notice that we do think that more high level RSS API that requires less 
changes
in application level should be supported by rdma_cm.

On other words, I would like to see an application
get the benefit of this without having to be recoded.  Maybe they only
get a default benefit and not an optimized benefit, and maybe the user
has to set an environment variable for libibverbs before it gets enabled
by default, but I would still like to see a non-recompiled app benefit
from this.
I am not sure that I understand how does the “no recoding” 
approach can work:
what “default” benefit can get application that works in polling mode? 
In polling mode application threads affinities define which CPU core are 
utilized for completions and packet processing. 
For example if application decides to open a single thread that polls CQ
and process completions I don’t see how is it possible to better distribute
the processing between CPU core without changing the application.
For Interrupt driven applications we can get some partial “default” benefit:
we can load balance Interrupt handler and “bottom half” part of 
packet processing but not the processing that is done in user space. 
Still this requires API changes or maybe adding some new environment
parameter, as you have mentioned, since application on CQ creation provides
comp_vector and running completion event handler on different cores 
(as the result of RSS) actually breaks this API.

To summarize, in my opinion verbs is the right place to expose RSS interface 
that gives verbs application full control over different RSS properties 
and more high level and simple RSS API can be exposed via rdma_cm.
This will allow from one side to answer the needs of solutions that requires
the ability to directly access and configure the HW, 
as user space NICs solutions, and on other side this answer the needs 
of applications that look for more high level API.

Thanks,
AlexV

-Original Message-
From: Doug Ledford [mailto:dledf...@redhat.com] 
Sent: Saturday, May 30, 2015 12:57 AM
To: RDMA mailing list
Cc: Weiny, Ira; Matan Barak; Or Gerlitz; Haggai Eran; Yishai Hadas; Alex Vainman
Subject: Current patch statuses

* Ira's 3 patch const cleanup: generally approved, waiting latest
version
* Ira's 14 patch OPA set: I haven't reviewed yet, waiting for next
version
* Matan's 14 patch RoCE GID set: Changes were requested, awaiting next
version
* Or's 11 patch timestamp set: The general idea is OK, creating an
extension is OK, but I think the actual uAPI needs nailed down a little
more.  Right now it's too vendor/model/driver specific for a general
uAPI.
* Haggai's 12 patch cma namespace set: Serious concerns over the
suitability of creating more than one link per unique guid/pkey.
Requesting that Haggai investigate using alias GUIDs for containers
instead.
* Or's 3 patch SRIOV set: both changes and coordination with
netdev/iproute2 folks requested, awaiting results of that interaction
and update for 8byte GUID management
* Yishai's 3 patch hot removal set: changes in locking requested
* Alex's Verbs RSS RFC: I read through the proposal, but I didn't
research it in enough detail to provide high quality feedback.  But I
will give a little low quality feedback.  First the good: I'm generally
receptive to anything that improves our NUMA operation.  While I didn't
see that the RSS necessarily took NUMA into consideration, the overall
framework looked like a good starting point for doing exactly that.  The
ability to have RSS via hashed receives and multiple flows is a start,
now you just need to add pinning the flows to 

Re: [PATCH for-next V2 6/9] IB/core: Pass hardware specific data in query_device

2015-06-01 Thread Devesh Sharma
ocrdma part Looks good.

Reviewed-By: Devesh Sharma devesh.sha...@avagotech.com

On Sun, May 31, 2015 at 5:44 PM, Or Gerlitz ogerl...@mellanox.com wrote:
 From: Matan Barak mat...@mellanox.com

 Vendors should be able to pass vendor specific data to/from
 user-space via query_device uverb. In order to do this,
 we need to pass the vendors' specific udata.

 Signed-off-by: Matan Barak mat...@mellanox.com
 Signed-off-by: Or Gerlitz ogerl...@mellanox.com
 ---
  drivers/infiniband/core/device.c |4 +++-
  drivers/infiniband/core/uverbs_cmd.c |2 +-
  drivers/infiniband/hw/amso1100/c2_provider.c |7 +--
  drivers/infiniband/hw/cxgb3/iwch_provider.c  |8 ++--
  drivers/infiniband/hw/cxgb4/provider.c   |8 ++--
  drivers/infiniband/hw/ehca/ehca_hca.c|6 +-
  drivers/infiniband/hw/ehca/ehca_iverbs.h |3 ++-
  drivers/infiniband/hw/ipath/ipath_verbs.c|7 +--
  drivers/infiniband/hw/mlx4/main.c|6 +-
  drivers/infiniband/hw/mlx5/main.c|9 +++--
  drivers/infiniband/hw/mthca/mthca_provider.c |7 +--
  drivers/infiniband/hw/nes/nes_verbs.c|6 +-
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |6 +-
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h  |3 ++-
  drivers/infiniband/hw/qib/qib_verbs.c|6 --
  drivers/infiniband/hw/usnic/usnic_ib_verbs.c |6 +-
  drivers/infiniband/hw/usnic/usnic_ib_verbs.h |3 ++-
  include/rdma/ib_verbs.h  |3 ++-
  18 files changed, 75 insertions(+), 25 deletions(-)

 diff --git a/drivers/infiniband/core/device.c 
 b/drivers/infiniband/core/device.c
 index 568cb41..694bd66 100644
 --- a/drivers/infiniband/core/device.c
 +++ b/drivers/infiniband/core/device.c
 @@ -539,9 +539,11 @@ EXPORT_SYMBOL(ib_dispatch_event);
  int ib_query_device(struct ib_device *device,
 struct ib_device_attr *device_attr)
  {
 +   struct ib_udata uhw = {.outlen = 0, .inlen = 0};
 +
 memset(device_attr, 0, sizeof(*device_attr));

 -   return device-query_device(device, device_attr);
 +   return device-query_device(device, device_attr, uhw);
  }
  EXPORT_SYMBOL(ib_query_device);

 diff --git a/drivers/infiniband/core/uverbs_cmd.c 
 b/drivers/infiniband/core/uverbs_cmd.c
 index 11ee298..bbb02ff 100644
 --- a/drivers/infiniband/core/uverbs_cmd.c
 +++ b/drivers/infiniband/core/uverbs_cmd.c
 @@ -3428,7 +3428,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file 
 *file,

 memset(attr, 0, sizeof(attr));

 -   err = device-query_device(device, attr);
 +   err = device-query_device(device, attr, uhw);
 if (err)
 return err;

 diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
 b/drivers/infiniband/hw/amso1100/c2_provider.c
 index a43e022..382f109 100644
 --- a/drivers/infiniband/hw/amso1100/c2_provider.c
 +++ b/drivers/infiniband/hw/amso1100/c2_provider.c
 @@ -63,13 +63,16 @@
  #include c2_provider.h
  #include c2_user.h

 -static int c2_query_device(struct ib_device *ibdev,
 -  struct ib_device_attr *props)
 +static int c2_query_device(struct ib_device *ibdev, struct ib_device_attr 
 *props,
 +  struct ib_udata *uhw)
  {
 struct c2_dev *c2dev = to_c2dev(ibdev);

 pr_debug(%s:%u\n, __func__, __LINE__);

 +   if (uhw-inlen || uhw-outlen)
 +   return -EINVAL;
 +
 *props = c2dev-props;
 return 0;
  }
 diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
 b/drivers/infiniband/hw/cxgb3/iwch_provider.c
 index 2eaf7e8..c4b5936 100644
 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
 +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
 @@ -1150,13 +1150,17 @@ static u64 fw_vers_string_to_u64(struct iwch_dev 
 *iwch_dev)
(fw_mic  0x);
  }

 -static int iwch_query_device(struct ib_device *ibdev,
 -struct ib_device_attr *props)
 +static int iwch_query_device(struct ib_device *ibdev, struct ib_device_attr 
 *props,
 +struct ib_udata *uhw)
  {

 struct iwch_dev *dev;
 +
 PDBG(%s ibdev %p\n, __func__, ibdev);

 +   if (uhw-inlen || uhw-outlen)
 +   return -EINVAL;
 +
 dev = to_iwch_dev(ibdev);
 memset(props, 0, sizeof *props);
 memcpy(props-sys_image_guid, dev-rdev.t3cdev_p-lldev-dev_addr, 
 6);
 diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
 b/drivers/infiniband/hw/cxgb4/provider.c
 index ef08a9f..05a96a5 100644
 --- a/drivers/infiniband/hw/cxgb4/provider.c
 +++ b/drivers/infiniband/hw/cxgb4/provider.c
 @@ -301,13 +301,17 @@ static int c4iw_query_gid(struct ib_device *ibdev, u8 
 port, int index,
 return 0;
  }

 -static int c4iw_query_device(struct ib_device *ibdev,
 -struct ib_device_attr *props)
 +static int c4iw_query_device(struct ib_device *ibdev, struct ib_device_attr 
 

Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Matan Barak
On Sun, May 31, 2015 at 10:00 PM, Doug Ledford dledf...@redhat.com wrote:
 On Sun, 2015-05-31 at 15:14 +0300, Or Gerlitz wrote:
 Hi Doug,

 This patchset adds completion timestamping supports for verbs consumers.

 Reviewing the weekend threads, we've changed the flag time to reflect
 that this is completion time-stamp and folded the mlx4 actual support
 into one patch.

 Regarding the related user-space support, it's possible to add what you
 were suggesting, ibv_get_raw_cqe_timestamp() -- returns ts in cycles and
 ibv_get_cqe_timestamp() -- returns ts in ns, this makes the value returned
 by the poll cq verb an opaque one that must go through one of  the 
 convertors.

 We would to go for one helper ibv_get_timestamp(uint64_t raw_time, flag) 
 which
 could get the raw time-stamp and one of the following flags: RAW_TIME, 
 RAW_NS_TIME.

 I'm theoretically OK with something similar to the above.  However, the
 NS time should not be raw.  It should be cooked and should be able to be
 valid to compare between different adapters.  Right now, the cycle
 counter that you are exposing is only useful for ordering between
 packets received on a single adapter where the cycle counter is the same
 on all packets.  Throw in a different vendor's card, or two of your own
 cards, and the issue gets much more complex.  The cooked value should be
 an actual, real time that can be used across these more complex
 environments.  Because of that, it really shouldn't be called RAW.


Thanks for the feedback Doug.
We wanted to add RAW_NS in order to free the user from calculating it by himself
(dividing the cycles value in the core_clock).
In addition to this, it's possible to implement a future NS_TIME
(without the raw), which
will convert the opaque time to system wide ns.

 So, if you want a single entry point, I would suggest something like
 this:

 enum ib_timestamp_flags {
 IB_TIMESTAMP_COMPLETION = (1  0), // specify on create_cq
 IB_TIMESTAMP_WQE_BEGIN =  (1  1), // specify on create qp?
 IB_TIMESTAMP_WQE_END =(1  2), // specify on create qp?
 IB_TIMESTAMP_RAW =(1  31)
 };

 enum ib_cq_creation_flags {
 IB_CQ_FLAGS_TIMESTAMP_COMPLETION = (1  0)
 };

 /**
  * ibv_get_timestamp - Return the requested timestamp for the given wc
  * @wc - work completion to get timestamp results from
  * @ts - struct timespec to return timestamp in
  * @flags - which timestamp to return and in what form
  *
  * Depending on the flags used to create the queue pair/completion
  * queue, different timestamps might be available.  Callers should
  * specify which timestamp they are interested in using the flags
  * element, and if they wish either a cooked or raw timestamp.  A
  * raw timestamp is implementation defined and will be passed back
  * in the tv_nsec portion of the struct timespec.  A raw timestamp
  * can not be relied upon to have any ordering value between more
  * than one HCA or driver.  A cooked timestamp will return a valid
  * struct timespec normalized as closely as possible to the return
  * value for CLOCK_MONOTONIC of clock_gettime at the time of the
  * timestamp.
  */
 int ibv_get_timestamp(struct ibv_wc *wc, struct timespec *ts, int
 flags);


We wanted to divide the flow here:
In create_cq, the user notifies the kernel/HCA which timestamp he
would like to get.
It could be a completion timestamp, a start of WQE timestamp or
whatever he wants.
The timestamp the user gets in the WQE is opaque. Every vendor could
implement it
as it wants - in order to have minimal implication in performance.

The second part is ibv_get_timestamp. It gets an opaque timestamp and
outputs a converted value in respect to the time the user wanted to get.
For example, if IB_TIMESTAMP_NS_TIME is given, the function should output
a system-wide NS value (we would like to implement this only in the future).
Currently, only RAW and RAW_NS will be supported, while RAW gives the time
in cycles and RAW_NS gives a NS value with an unknown time reference.

We think ibv_get_timestamp shouldn't get a wqe but a 64bit opaque value.
The reason for this is that it could be used in order to translate query_values
current time to different types of timestamp.
What do you think?

 We think this would address the reviewer comments for the kernel submission.

 The user-space code is in (still uses IB_CQ_FLAGS_TIMESTAMP and miss the
 conversion functions)

  https://github.com/matanb10/libibverbs timestamp-v1
  https://github.com/matanb10/libmlx4 timestamp-v1

 Timestamping is used by applications in order to know when a WQE was
 received/transmitted by the HW. The value is given is HCA hardware cycles,
 but could be easily converted as the hardware's core clock frequecny is
 available through extension of query device.

 Moreover, we add an ability to read the HCA's current clock. This could be
 useful on order to synchronize events to the wall clock.

 This functionality is achieved by adding/extending the 

Re: [PATCH for-next V2 1/9] IB/core: Change provider's API of create_cq to be extendible

2015-06-01 Thread Devesh Sharma
Looks good.

Reviewed-By: Devesh Sharma devesh.sha...@avagotech.com

On Sun, May 31, 2015 at 5:44 PM, Or Gerlitz ogerl...@mellanox.com wrote:
 From: Matan Barak mat...@mellanox.com

 Add a new ib_cq_init_attr structure which contains the
 previous cqe (minimum number of CQ entries) and comp_vector
 (completion vector) in addition to a new flags field.
 All vendors' create_cq callbacks are changed in order
 to work with the new API.

 This commit does not change any functionality.

 Signed-off-by: Matan Barak mat...@mellanox.com
 Signed-off-by: Or Gerlitz ogerl...@mellanox.com
 ---
  drivers/infiniband/core/uverbs_cmd.c |6 --
  drivers/infiniband/core/verbs.c  |3 ++-
  drivers/infiniband/hw/amso1100/c2_provider.c |7 ++-
  drivers/infiniband/hw/cxgb3/iwch_provider.c  |   11 ---
  drivers/infiniband/hw/cxgb4/cq.c |9 +++--
  drivers/infiniband/hw/cxgb4/iw_cxgb4.h   |8 
  drivers/infiniband/hw/ehca/ehca_cq.c |7 ++-
  drivers/infiniband/hw/ehca/ehca_iverbs.h |3 ++-
  drivers/infiniband/hw/ipath/ipath_cq.c   |9 +++--
  drivers/infiniband/hw/ipath/ipath_verbs.h|3 ++-
  drivers/infiniband/hw/mlx4/cq.c  |8 +++-
  drivers/infiniband/hw/mlx4/mlx4_ib.h |3 ++-
  drivers/infiniband/hw/mlx5/cq.c  |   10 --
  drivers/infiniband/hw/mlx5/main.c|3 ++-
  drivers/infiniband/hw/mlx5/mlx5_ib.h |5 +++--
  drivers/infiniband/hw/mthca/mthca_provider.c |8 ++--
  drivers/infiniband/hw/nes/nes_verbs.c|   11 ---
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |7 ++-
  drivers/infiniband/hw/ocrdma/ocrdma_verbs.h  |6 --
  drivers/infiniband/hw/qib/qib_cq.c   |   11 ---
  drivers/infiniband/hw/qib/qib_verbs.h|5 +++--
  drivers/infiniband/hw/usnic/usnic_ib_verbs.c |   10 +++---
  drivers/infiniband/hw/usnic/usnic_ib_verbs.h |7 ---
  include/rdma/ib_verbs.h  |   10 --
  24 files changed, 124 insertions(+), 46 deletions(-)

 diff --git a/drivers/infiniband/core/uverbs_cmd.c 
 b/drivers/infiniband/core/uverbs_cmd.c
 index a9f0489..1954ebb 100644
 --- a/drivers/infiniband/core/uverbs_cmd.c
 +++ b/drivers/infiniband/core/uverbs_cmd.c
 @@ -1341,6 +1341,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
 struct ib_uverbs_event_file*ev_file = NULL;
 struct ib_cq   *cq;
 int ret;
 +   struct ib_cq_init_attr attr = {};

 if (out_len  sizeof resp)
 return -ENOSPC;
 @@ -1376,8 +1377,9 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file,
 INIT_LIST_HEAD(obj-comp_list);
 INIT_LIST_HEAD(obj-async_list);

 -   cq = file-device-ib_dev-create_cq(file-device-ib_dev, cmd.cqe,
 -cmd.comp_vector,
 +   attr.cqe = cmd.cqe;
 +   attr.comp_vector = cmd.comp_vector;
 +   cq = file-device-ib_dev-create_cq(file-device-ib_dev, attr,
  file-ucontext, udata);
 if (IS_ERR(cq)) {
 ret = PTR_ERR(cq);
 diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
 index 685a362..f7615d4 100644
 --- a/drivers/infiniband/core/verbs.c
 +++ b/drivers/infiniband/core/verbs.c
 @@ -1078,8 +1078,9 @@ struct ib_cq *ib_create_cq(struct ib_device *device,
void *cq_context, int cqe, int comp_vector)
  {
 struct ib_cq *cq;
 +   struct ib_cq_init_attr attr = {.cqe = cqe, .comp_vector = 
 comp_vector};

 -   cq = device-create_cq(device, cqe, comp_vector, NULL, NULL);
 +   cq = device-create_cq(device, attr, NULL, NULL);

 if (!IS_ERR(cq)) {
 cq-device= device;
 diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
 b/drivers/infiniband/hw/amso1100/c2_provider.c
 index d396c39..a43e022 100644
 --- a/drivers/infiniband/hw/amso1100/c2_provider.c
 +++ b/drivers/infiniband/hw/amso1100/c2_provider.c
 @@ -286,13 +286,18 @@ static int c2_destroy_qp(struct ib_qp *ib_qp)
 return 0;
  }

 -static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int 
 vector,
 +static struct ib_cq *c2_create_cq(struct ib_device *ibdev,
 + const struct ib_cq_init_attr *attr,
   struct ib_ucontext *context,
   struct ib_udata *udata)
  {
 +   int entries = attr-cqe;
 struct c2_cq *cq;
 int err;

 +   if (attr-flags)
 +   return ERR_PTR(-EINVAL);
 +
 cq = kmalloc(sizeof(*cq), GFP_KERNEL);
 if (!cq) {
 pr_debug(%s: Unable to allocate CQ\n, __func__);
 diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
 b/drivers/infiniband/hw/cxgb3/iwch_provider.c
 index 

Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Jason Gunthorpe
On Mon, Jun 01, 2015 at 07:25:04AM -0400, Doug Ledford wrote:

 attempted abstraction of ibverbs.  Passing in the wc struct allows the
 driver to internally allocate a wc struct with extra private elements
 and pass that back to the user, when the user passes it back to
 ibv_get_timestamp the elements are there in the private portion of the
 struct.

wc structures are allocated by the caller, there is no option for the
driver to create private elements.

AFAIK, Christoph's use case is essentially the only meaningful use
case for this feature, generalizing too much may destroy the
performance that is valuable here.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Doug Ledford
On Mon, 2015-06-01 at 10:43 -0600, Jason Gunthorpe wrote:
 On Mon, Jun 01, 2015 at 07:25:04AM -0400, Doug Ledford wrote:
 
  attempted abstraction of ibverbs.  Passing in the wc struct allows the
  driver to internally allocate a wc struct with extra private elements
  and pass that back to the user, when the user passes it back to
  ibv_get_timestamp the elements are there in the private portion of the
  struct.
 
 wc structures are allocated by the caller, there is no option for the
 driver to create private elements.

You're right, the data would have to be housed somewhere in the driver
private completion structs (for example, in the CQE the card posts to
memory that the driver then massages into a WC).

 AFAIK, Christoph's use case is essentially the only meaningful use
 case for this feature, generalizing too much may destroy the
 performance that is valuable here.

I'm not convinced of that.  Steve has already spoke up about the
timestamps available in cxgb4.  Those are very different and yet still
highly valuable to someone investigating performance of their RDMA
application.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Doug Ledford
On Mon, 2015-06-01 at 08:58 -0500, Christoph Lameter wrote:
 On Mon, 1 Jun 2015, Doug Ledford wrote:
 
  What's the point?  If it's raw, it's raw.  It's not coordinated between
  adapters.  Whether it's in ns or ps or flipflops doesn't matter, it's a
  flat number that has no reference to anything else, so the only thing
  that matters is  another version of itself or not.
 
 It can be coordinated between different adapter through the use of time
 software that can work with cycles and frequencies to scale the value of
 the cycles to realtime.

And that is precisely what the cooked values should be.

  Software like that is available in ptpd,
 timekeeper etc. Each NIC basically has its own clock and the timekeeping
 software would have to track the scaling and the aberration factor over
 time in order to come up with accurate absolute time values derived from
 the cycle counters of these NICs. Since we are dealing here with values
 that need to be accurate to within less than 100ns this is not trivial and
 one can easily get a ns value that is absolutely useless.

Agreed.  The cooked value is not going to be a simple thing.  I fail to
see how this is making a case that we should duplicate that code in
every app that uses a timestamp versus getting it right once in
libibverbs.

 Since it is not trivial its better kept out of the timestamp support in
 the RDMA API. If the app developer wants a trivial conversion then they
 can opencode a simple multiplication by the frequency. At that point it
 should be clear though that this raw time value is of limited use given
 its inaccuracy  and the dependence on the NIC clock.

The raw value is just that: raw.  And it *is* of limited use unless you
only have one adapter.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH for-next V2 8/9] IB/mlx4: Support extended create_cq and query_device uverbs

2015-06-01 Thread Jason Gunthorpe
On Sun, May 31, 2015 at 03:14:16PM +0300, Or Gerlitz wrote:
 From: Matan Barak mat...@mellanox.com
 
 Add support for ib_uverbs_ex_create_cq and ib_uverbs_ex_query_device
 by setting the appropriate bit in uverbs_ex_cmd_mask.

Why is this a seperate patch? Surely the bits should be or'd in the patches
that actually include the code to do the new commands?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 2/9] IB/core: Change ib_create_cq to use struct ib_cq_init_attr

2015-06-01 Thread Jason Gunthorpe
On Sun, May 31, 2015 at 03:14:10PM +0300, Or Gerlitz wrote:

 + struct ib_cq_init_attr cq_attr;
  
   /* Create new device info */
   port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL);
 @@ -2943,9 +2944,11 @@ static int ib_mad_port_open(struct ib_device *device,
   if (has_smi)
   cq_size *= 2;
  
 + memset(cq_attr, 0, sizeof(cq_attr));
 + cq_attr.cqe = cq_size;

Why does this patch switch to using memset when the prior patch used
= {} ?

 @@ -1075,12 +1075,11 @@ EXPORT_SYMBOL(ib_destroy_qp);
  struct ib_cq *ib_create_cq(struct ib_device *device,
  ib_comp_handler comp_handler,
  void (*event_handler)(struct ib_event *, void *),
 -void *cq_context, int cqe, int comp_vector)
 +void *cq_context, struct ib_cq_init_attr *cq_attr)
  {
   struct ib_cq *cq;
 - struct ib_cq_init_attr attr = {.cqe = cqe, .comp_vector = comp_vector};
  
 - cq = device-create_cq(device, attr, NULL, NULL);
 + cq = device-create_cq(device, cq_attr, NULL, NULL);

How does this compile without warnings?

The prior patch did:

-   struct ib_cq * (*create_cq)(struct ib_device *device, int 
cqe,
-   int comp_vector,
+   struct ib_cq * (*create_cq)(struct ib_device *device,
+   const struct ib_cq_init_attr 
*attr,
struct ib_ucontext *context,
struct ib_udata *udata);

Otherwise looks OK.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 1/9] IB/core: Change provider's API of create_cq to be extendible

2015-06-01 Thread Jason Gunthorpe
On Sun, May 31, 2015 at 03:14:09PM +0300, Or Gerlitz wrote:
 From: Matan Barak mat...@mellanox.com
 
 Add a new ib_cq_init_attr structure which contains the
 previous cqe (minimum number of CQ entries) and comp_vector
 (completion vector) in addition to a new flags field.
 All vendors' create_cq callbacks are changed in order
 to work with the new API.
 
 This commit does not change any functionality.
 
 Signed-off-by: Matan Barak mat...@mellanox.com
 Signed-off-by: Or Gerlitz ogerl...@mellanox.com

Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-06-01 Thread Hefty, Sean
 We want to have a time stamp when the action is complete and the data is
 available to the application or the send action is complete and the CQ
 entry can be reused.

This is what polling the completion from the CQ tells you, independent of there 
being a time stamp.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Christoph Lameter
On Mon, 1 Jun 2015, Doug Ledford wrote:

 What's the point?  If it's raw, it's raw.  It's not coordinated between
 adapters.  Whether it's in ns or ps or flipflops doesn't matter, it's a
 flat number that has no reference to anything else, so the only thing
 that matters is  another version of itself or not.

It can be coordinated between different adapter through the use of time
software that can work with cycles and frequencies to scale the value of
the cycles to realtime. Software like that is available in ptpd,
timekeeper etc. Each NIC basically has its own clock and the timekeeping
software would have to track the scaling and the aberration factor over
time in order to come up with accurate absolute time values derived from
the cycle counters of these NICs. Since we are dealing here with values
that need to be accurate to within less than 100ns this is not trivial and
one can easily get a ns value that is absolutely useless.

Since it is not trivial its better kept out of the timestamp support in
the RDMA API. If the app developer wants a trivial conversion then they
can opencode a simple multiplication by the frequency. At that point it
should be clear though that this raw time value is of limited use given
its inaccuracy  and the dependence on the NIC clock.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Add const to various core functions

2015-06-01 Thread Hal Rosenstock
On 5/31/2015 5:15 PM, ira.we...@intel.com wrote:
 From: Ira Weiny ira.we...@intel.com
 
 In order to support some of Jason's comments to add const to the functions I 
 am
 adding for OPA the following patches clean up the call trees of the currently
 defined functions.
 
 They stand on their own as clean up patches.  Therefore, I am submitting them
 separately from the OPA patches.
 
 ---
 Changes from V1:
   Clean up 0-day build
   Clean up commit messages
 
 Ira Weiny (3):
   IB/core cleanup: Add const to RDMA helpers
   IB/core cleanup: Add const on args - device-process_mad
   IB/core cleanup: Add const to args - agent_send_response
 
  drivers/infiniband/core/addr.c|4 +-
  drivers/infiniband/core/agent.c   |8 ++--
  drivers/infiniband/core/agent.h   |4 +-
  drivers/infiniband/core/cache.c   |8 ++--
  drivers/infiniband/core/verbs.c   |9 +++--
  drivers/infiniband/hw/amso1100/c2_provider.c  |6 ++--
  drivers/infiniband/hw/cxgb3/iwch_provider.c   |6 ++--
  drivers/infiniband/hw/cxgb4/provider.c|5 ++-
  drivers/infiniband/hw/ehca/ehca_iverbs.h  |4 +-
  drivers/infiniband/hw/ehca/ehca_sqp.c |   14 
  drivers/infiniband/hw/ipath/ipath_mad.c   |8 ++--
  drivers/infiniband/hw/ipath/ipath_verbs.h |6 ++--
  drivers/infiniband/hw/mlx4/mad.c  |   21 +++--
  drivers/infiniband/hw/mlx4/mlx4_ib.h  |8 ++--
  drivers/infiniband/hw/mlx5/mad.c  |8 ++--
  drivers/infiniband/hw/mlx5/mlx5_ib.h  |8 ++--
  drivers/infiniband/hw/mthca/mthca_cmd.c   |4 +-
  drivers/infiniband/hw/mthca/mthca_cmd.h   |4 +-
  drivers/infiniband/hw/mthca/mthca_dev.h   |6 ++--
  drivers/infiniband/hw/mthca/mthca_mad.c   |   10 +++---
  drivers/infiniband/hw/nes/nes_verbs.c |4 +-
  drivers/infiniband/hw/ocrdma/ocrdma_ah.c  |6 ++--
  drivers/infiniband/hw/ocrdma/ocrdma_ah.h  |6 ++--
  drivers/infiniband/hw/qib/qib_mad.c   |   10 +++---
  drivers/infiniband/hw/qib/qib_verbs.h |4 +-
  drivers/net/ethernet/mellanox/mlx5/core/mad.c |2 +-
  include/linux/mlx5/driver.h   |2 +-
  include/rdma/ib_addr.h|6 ++--
  include/rdma/ib_cache.h   |8 ++--
  include/rdma/ib_verbs.h   |   39 
 +
  30 files changed, 121 insertions(+), 117 deletions(-)

Reviewed-by: Hal Rosenstock h...@mellanox.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Add const to various core functions

2015-06-01 Thread Jason Gunthorpe
On Sun, May 31, 2015 at 05:15:28PM -0400, ira.we...@intel.com wrote:
 From: Ira Weiny ira.we...@intel.com
 
 In order to support some of Jason's comments to add const to the functions I 
 am
 adding for OPA the following patches clean up the call trees of the currently
 defined functions.

This is longer than I imagined, but looks good.

Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 0/9] Add completion timestamping support

2015-06-01 Thread Jason Gunthorpe
On Mon, Jun 01, 2015 at 01:00:57PM -0400, Doug Ledford wrote:
  case for this feature, generalizing too much may destroy the
  performance that is valuable here.
 
 I'm not convinced of that.  Steve has already spoke up about the
 timestamps available in cxgb4.  Those are very different and yet still
 highly valuable to someone investigating performance of their RDMA
 application.

? cxgb4 looks nearly identical to me. There is only one HW time stamp
'cqe_sge_ts', which occurs at some point in the flow, and is written
the CQE. The current cycle counter can be read from SGE_TIMESTAMP_LO
registers. Same as mlx4, really.

The rest is just bookkeeping and logging that doesn't require special
verbs support for an app to implement.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] libibverbs init.c: conditionally emit warning if no userspace driver found

2015-06-01 Thread Jeff Squyres (jsquyres)
On May 22, 2015, at 9:44 AM, Doug Ledford dledf...@redhat.com wrote:
 
 Did that happen yet?
 
 I don't think so.  I didn't file a specific ticket for it at k.o yet
 (the k.o tickets take a while to process, so I didn't want to file it
 until after the comment period here on list).

Ping.

This is just a periodic query to see if there has been any progress on 
accepting this patch into libibverbs.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html