Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Fri, 29 May 2015, Doug Ledford wrote: The use case is clear IMO. It's for financial trading software. I don't think they really care about details like whether it's the start or end packet, or completion, or whatever. They need a tie breaker between when they have two different buy or sell orders on the same lot of stock. Any deterministic timing/ordering method will do as long as they consistently apply it I think. And the faster and lower overhead the process, the better. He doesn't really want a timestamp, he merely wants a sequence ordering. But a timestamp is what they are using to get him what he needs. Is that a fair guess Christoph? We want to have a time stamp when the action is complete and the data is available to the application or the send action is complete and the CQ entry can be reused. That is a well defined point and that is how the time stamps function in the existing implementation. This is an obvious understanding and its pretty clear. The time stamp needs to be at the end of the action because the timestamp is used to: 1. Assess the impact of network processing. This can be compared with packet timestamps from capture devices off the wire. 2. Delineate the start of packet processing in software. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
This is sufficient since it can be converted to ns or whatever one wants. It is sufficient for your use. It is not, however, a good API. I hate these foggy statements that you guys come up with. Why is it not a good API? Having a cycle counter without processiing overhead is a good thing and the way counters are handled is pretty well established. The API provides the abilty to retrieve the clock freq which is sufficient for the user to convert the value to meaningful time values. I would prefer if the access to the timestamp were implemented via a function in libibverbs (I haven't looked at the git repo, too little time, I'll get to it). Something like ibv_get_cqe_timestamp(). That function should be general and return a suitable, normalized value (ns probably). If you just want a simple comparator without the overhead of normalizing to time, and are willing to accept the consequences of that, then I would expect you to use something like That would introduce additional latencies and would make that feature no longer useful to us. The advantage of this approach is that the counter is in the same cacheline that is already used. It is very low overhead. ibv_get_raw_cqe_timestamp() to get the unadulterated cycle counter. For the most part, the user space application should not know details like we are using a cycle counter in the HCA processor for timestamping, Why not? A cycle counter is the general way of providing timestamps in hardware. RDTSC is such a cycle counter as well. There are numerous examples of counters like that. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
On Fri, 29 May 2015, Hefty, Sean wrote: What is the issue here? The timestamp is created when the processing is finished by the nic and the completion entry becomes available. The timestamp is generated when a work completion entry is written. If the= re's a clear use case for this, it hasn't been described. The use case you= mentioned only works if there is a 1:1 relationship between a packet and a= work completion. That is not what is being defined here. It does make sense to have a timestamp when the work described by a CQ has been completed. For that you do not need a 1:1 correspondence. My use case is focused on single packet processing because that is what we do here. It is exactly defined like any other cycle counters in hardware and it is exposed using an API that allows multiple vendors to use these cycle counters without regard to a particular vendors implementation. I disagree. This is associated with a specific implementation. It assumes= that the counter is part of a CQ entry, and that the counter is written wh= en the completion is written. There's nothing that requires other vendors = to follow that model, nor is it clear that this is a generic or useful enou= gh operation that other vendors would want to follow this model. Why not h= ave the time stamp record the start of the transaction? The end? Have two= stamps, once for the first packet, and one for the last? Limit this to si= ngle packet operations only? Why would you have a timestamp at the beginning of the transaction? That is useless because you can use packet capture devices to establish that timepoint. At that point you have not identified the CQ anyways. Having a timestamp when an action is complete makes sense, is generic and general. Adding the timestamp to the data means that the application now has to separate the metadata (timestamp) from the data stream. Not good. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, 2015-06-01 at 12:30 +0300, Matan Barak wrote: On Sun, May 31, 2015 at 10:00 PM, Doug Ledford dledf...@redhat.com wrote: On Sun, 2015-05-31 at 15:14 +0300, Or Gerlitz wrote: Hi Doug, This patchset adds completion timestamping supports for verbs consumers. Reviewing the weekend threads, we've changed the flag time to reflect that this is completion time-stamp and folded the mlx4 actual support into one patch. Regarding the related user-space support, it's possible to add what you were suggesting, ibv_get_raw_cqe_timestamp() -- returns ts in cycles and ibv_get_cqe_timestamp() -- returns ts in ns, this makes the value returned by the poll cq verb an opaque one that must go through one of the convertors. We would to go for one helper ibv_get_timestamp(uint64_t raw_time, flag) which could get the raw time-stamp and one of the following flags: RAW_TIME, RAW_NS_TIME. I'm theoretically OK with something similar to the above. However, the NS time should not be raw. It should be cooked and should be able to be valid to compare between different adapters. Right now, the cycle counter that you are exposing is only useful for ordering between packets received on a single adapter where the cycle counter is the same on all packets. Throw in a different vendor's card, or two of your own cards, and the issue gets much more complex. The cooked value should be an actual, real time that can be used across these more complex environments. Because of that, it really shouldn't be called RAW. Thanks for the feedback Doug. We wanted to add RAW_NS in order to free the user from calculating it by himself (dividing the cycles value in the core_clock). What's the point? If it's raw, it's raw. It's not coordinated between adapters. Whether it's in ns or ps or flipflops doesn't matter, it's a flat number that has no reference to anything else, so the only thing that matters is another version of itself or not. In addition to this, it's possible to implement a future NS_TIME (without the raw), which will convert the opaque time to system wide ns. So, if you want a single entry point, I would suggest something like this: enum ib_timestamp_flags { IB_TIMESTAMP_COMPLETION = (1 0), // specify on create_cq IB_TIMESTAMP_WQE_BEGIN = (1 1), // specify on create qp? IB_TIMESTAMP_WQE_END =(1 2), // specify on create qp? IB_TIMESTAMP_RAW =(1 31) }; enum ib_cq_creation_flags { IB_CQ_FLAGS_TIMESTAMP_COMPLETION = (1 0) }; /** * ibv_get_timestamp - Return the requested timestamp for the given wc * @wc - work completion to get timestamp results from * @ts - struct timespec to return timestamp in * @flags - which timestamp to return and in what form * * Depending on the flags used to create the queue pair/completion * queue, different timestamps might be available. Callers should * specify which timestamp they are interested in using the flags * element, and if they wish either a cooked or raw timestamp. A * raw timestamp is implementation defined and will be passed back * in the tv_nsec portion of the struct timespec. A raw timestamp * can not be relied upon to have any ordering value between more * than one HCA or driver. A cooked timestamp will return a valid * struct timespec normalized as closely as possible to the return * value for CLOCK_MONOTONIC of clock_gettime at the time of the * timestamp. */ int ibv_get_timestamp(struct ibv_wc *wc, struct timespec *ts, int flags); We wanted to divide the flow here: In create_cq, the user notifies the kernel/HCA which timestamp he would like to get. Correction, which timestamp*s*. It could be a completion timestamp, a start of WQE timestamp or whatever he wants. The timestamp the user gets in the WQE is opaque. Every vendor could implement it as it wants - in order to have minimal implication in performance. Again, timestamp(s). The second part is ibv_get_timestamp. It gets an opaque timestamp No. As you've already pointed out, how each vendor implements returning the timestamp(s) could be totally different. There are no timestamp entries in the existing wc struct. Expecting the user to pass the raw value to the ibv_get_timestamp function makes no sense and violates the attempted abstraction of ibverbs. Passing in the wc struct allows the driver to internally allocate a wc struct with extra private elements and pass that back to the user, when the user passes it back to ibv_get_timestamp the elements are there in the private portion of the struct. and outputs a converted value in respect to the time the user wanted to get. For example, if IB_TIMESTAMP_NS_TIME is given, the function should output a system-wide NS value (we would like to implement this only in the future). Currently, only RAW and RAW_NS will be supported, while RAW gives the
RE: Current patch statuses
Hi Doug, Thanks for the feedback. The ability to have RSS via hashed receives and multiple flows is a start, now you just need to add pinning the flows to specific sockets that are adjacent to the PCI bus the controller is on and populating the number of threads to use for RSS based upon the number of cores/threads per socket. I think that the proposed API and the overall solution allows that. I'm not sure I like the specific implementation. It puts a lot of new, complex configuration on the application. I think that it is important that verbs API, that comes to provide direct access to HW will explicitly expose RSS settings and allow verbs applications to adjust RSS configuration to their needs in order to get the 100% performance benefit from the HW capabilities. The ability to control RSS hash function and the Indirection table properties can be important for different fast packet processing solutions: - For example bump in the wire applications can prefer symmetric RSS so exposing interface that allows to configure the hash function is valuable. -The applications can know which traffic type it will handle and adjust which packets fields should participate in the RSS hash to better distribute the traffic. -Exposing API that allows to set RSS hash properties and RSS Indirection Table allows better verbs integration with such frameworks as DPDK. Notice that we do think that more high level RSS API that requires less changes in application level should be supported by rdma_cm. On other words, I would like to see an application get the benefit of this without having to be recoded. Maybe they only get a default benefit and not an optimized benefit, and maybe the user has to set an environment variable for libibverbs before it gets enabled by default, but I would still like to see a non-recompiled app benefit from this. I am not sure that I understand how does the “no recoding” approach can work: what “default” benefit can get application that works in polling mode? In polling mode application threads affinities define which CPU core are utilized for completions and packet processing. For example if application decides to open a single thread that polls CQ and process completions I don’t see how is it possible to better distribute the processing between CPU core without changing the application. For Interrupt driven applications we can get some partial “default” benefit: we can load balance Interrupt handler and “bottom half” part of packet processing but not the processing that is done in user space. Still this requires API changes or maybe adding some new environment parameter, as you have mentioned, since application on CQ creation provides comp_vector and running completion event handler on different cores (as the result of RSS) actually breaks this API. To summarize, in my opinion verbs is the right place to expose RSS interface that gives verbs application full control over different RSS properties and more high level and simple RSS API can be exposed via rdma_cm. This will allow from one side to answer the needs of solutions that requires the ability to directly access and configure the HW, as user space NICs solutions, and on other side this answer the needs of applications that look for more high level API. Thanks, AlexV -Original Message- From: Doug Ledford [mailto:dledf...@redhat.com] Sent: Saturday, May 30, 2015 12:57 AM To: RDMA mailing list Cc: Weiny, Ira; Matan Barak; Or Gerlitz; Haggai Eran; Yishai Hadas; Alex Vainman Subject: Current patch statuses * Ira's 3 patch const cleanup: generally approved, waiting latest version * Ira's 14 patch OPA set: I haven't reviewed yet, waiting for next version * Matan's 14 patch RoCE GID set: Changes were requested, awaiting next version * Or's 11 patch timestamp set: The general idea is OK, creating an extension is OK, but I think the actual uAPI needs nailed down a little more. Right now it's too vendor/model/driver specific for a general uAPI. * Haggai's 12 patch cma namespace set: Serious concerns over the suitability of creating more than one link per unique guid/pkey. Requesting that Haggai investigate using alias GUIDs for containers instead. * Or's 3 patch SRIOV set: both changes and coordination with netdev/iproute2 folks requested, awaiting results of that interaction and update for 8byte GUID management * Yishai's 3 patch hot removal set: changes in locking requested * Alex's Verbs RSS RFC: I read through the proposal, but I didn't research it in enough detail to provide high quality feedback. But I will give a little low quality feedback. First the good: I'm generally receptive to anything that improves our NUMA operation. While I didn't see that the RSS necessarily took NUMA into consideration, the overall framework looked like a good starting point for doing exactly that. The ability to have RSS via hashed receives and multiple flows is a start, now you just need to add pinning the flows to
Re: [PATCH for-next V2 6/9] IB/core: Pass hardware specific data in query_device
ocrdma part Looks good. Reviewed-By: Devesh Sharma devesh.sha...@avagotech.com On Sun, May 31, 2015 at 5:44 PM, Or Gerlitz ogerl...@mellanox.com wrote: From: Matan Barak mat...@mellanox.com Vendors should be able to pass vendor specific data to/from user-space via query_device uverb. In order to do this, we need to pass the vendors' specific udata. Signed-off-by: Matan Barak mat...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/device.c |4 +++- drivers/infiniband/core/uverbs_cmd.c |2 +- drivers/infiniband/hw/amso1100/c2_provider.c |7 +-- drivers/infiniband/hw/cxgb3/iwch_provider.c |8 ++-- drivers/infiniband/hw/cxgb4/provider.c |8 ++-- drivers/infiniband/hw/ehca/ehca_hca.c|6 +- drivers/infiniband/hw/ehca/ehca_iverbs.h |3 ++- drivers/infiniband/hw/ipath/ipath_verbs.c|7 +-- drivers/infiniband/hw/mlx4/main.c|6 +- drivers/infiniband/hw/mlx5/main.c|9 +++-- drivers/infiniband/hw/mthca/mthca_provider.c |7 +-- drivers/infiniband/hw/nes/nes_verbs.c|6 +- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |6 +- drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |3 ++- drivers/infiniband/hw/qib/qib_verbs.c|6 -- drivers/infiniband/hw/usnic/usnic_ib_verbs.c |6 +- drivers/infiniband/hw/usnic/usnic_ib_verbs.h |3 ++- include/rdma/ib_verbs.h |3 ++- 18 files changed, 75 insertions(+), 25 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 568cb41..694bd66 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -539,9 +539,11 @@ EXPORT_SYMBOL(ib_dispatch_event); int ib_query_device(struct ib_device *device, struct ib_device_attr *device_attr) { + struct ib_udata uhw = {.outlen = 0, .inlen = 0}; + memset(device_attr, 0, sizeof(*device_attr)); - return device-query_device(device, device_attr); + return device-query_device(device, device_attr, uhw); } EXPORT_SYMBOL(ib_query_device); diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 11ee298..bbb02ff 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -3428,7 +3428,7 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, memset(attr, 0, sizeof(attr)); - err = device-query_device(device, attr); + err = device-query_device(device, attr, uhw); if (err) return err; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index a43e022..382f109 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -63,13 +63,16 @@ #include c2_provider.h #include c2_user.h -static int c2_query_device(struct ib_device *ibdev, - struct ib_device_attr *props) +static int c2_query_device(struct ib_device *ibdev, struct ib_device_attr *props, + struct ib_udata *uhw) { struct c2_dev *c2dev = to_c2dev(ibdev); pr_debug(%s:%u\n, __func__, __LINE__); + if (uhw-inlen || uhw-outlen) + return -EINVAL; + *props = c2dev-props; return 0; } diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index 2eaf7e8..c4b5936 100644 --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c @@ -1150,13 +1150,17 @@ static u64 fw_vers_string_to_u64(struct iwch_dev *iwch_dev) (fw_mic 0x); } -static int iwch_query_device(struct ib_device *ibdev, -struct ib_device_attr *props) +static int iwch_query_device(struct ib_device *ibdev, struct ib_device_attr *props, +struct ib_udata *uhw) { struct iwch_dev *dev; + PDBG(%s ibdev %p\n, __func__, ibdev); + if (uhw-inlen || uhw-outlen) + return -EINVAL; + dev = to_iwch_dev(ibdev); memset(props, 0, sizeof *props); memcpy(props-sys_image_guid, dev-rdev.t3cdev_p-lldev-dev_addr, 6); diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c index ef08a9f..05a96a5 100644 --- a/drivers/infiniband/hw/cxgb4/provider.c +++ b/drivers/infiniband/hw/cxgb4/provider.c @@ -301,13 +301,17 @@ static int c4iw_query_gid(struct ib_device *ibdev, u8 port, int index, return 0; } -static int c4iw_query_device(struct ib_device *ibdev, -struct ib_device_attr *props) +static int c4iw_query_device(struct ib_device *ibdev, struct ib_device_attr
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Sun, May 31, 2015 at 10:00 PM, Doug Ledford dledf...@redhat.com wrote: On Sun, 2015-05-31 at 15:14 +0300, Or Gerlitz wrote: Hi Doug, This patchset adds completion timestamping supports for verbs consumers. Reviewing the weekend threads, we've changed the flag time to reflect that this is completion time-stamp and folded the mlx4 actual support into one patch. Regarding the related user-space support, it's possible to add what you were suggesting, ibv_get_raw_cqe_timestamp() -- returns ts in cycles and ibv_get_cqe_timestamp() -- returns ts in ns, this makes the value returned by the poll cq verb an opaque one that must go through one of the convertors. We would to go for one helper ibv_get_timestamp(uint64_t raw_time, flag) which could get the raw time-stamp and one of the following flags: RAW_TIME, RAW_NS_TIME. I'm theoretically OK with something similar to the above. However, the NS time should not be raw. It should be cooked and should be able to be valid to compare between different adapters. Right now, the cycle counter that you are exposing is only useful for ordering between packets received on a single adapter where the cycle counter is the same on all packets. Throw in a different vendor's card, or two of your own cards, and the issue gets much more complex. The cooked value should be an actual, real time that can be used across these more complex environments. Because of that, it really shouldn't be called RAW. Thanks for the feedback Doug. We wanted to add RAW_NS in order to free the user from calculating it by himself (dividing the cycles value in the core_clock). In addition to this, it's possible to implement a future NS_TIME (without the raw), which will convert the opaque time to system wide ns. So, if you want a single entry point, I would suggest something like this: enum ib_timestamp_flags { IB_TIMESTAMP_COMPLETION = (1 0), // specify on create_cq IB_TIMESTAMP_WQE_BEGIN = (1 1), // specify on create qp? IB_TIMESTAMP_WQE_END =(1 2), // specify on create qp? IB_TIMESTAMP_RAW =(1 31) }; enum ib_cq_creation_flags { IB_CQ_FLAGS_TIMESTAMP_COMPLETION = (1 0) }; /** * ibv_get_timestamp - Return the requested timestamp for the given wc * @wc - work completion to get timestamp results from * @ts - struct timespec to return timestamp in * @flags - which timestamp to return and in what form * * Depending on the flags used to create the queue pair/completion * queue, different timestamps might be available. Callers should * specify which timestamp they are interested in using the flags * element, and if they wish either a cooked or raw timestamp. A * raw timestamp is implementation defined and will be passed back * in the tv_nsec portion of the struct timespec. A raw timestamp * can not be relied upon to have any ordering value between more * than one HCA or driver. A cooked timestamp will return a valid * struct timespec normalized as closely as possible to the return * value for CLOCK_MONOTONIC of clock_gettime at the time of the * timestamp. */ int ibv_get_timestamp(struct ibv_wc *wc, struct timespec *ts, int flags); We wanted to divide the flow here: In create_cq, the user notifies the kernel/HCA which timestamp he would like to get. It could be a completion timestamp, a start of WQE timestamp or whatever he wants. The timestamp the user gets in the WQE is opaque. Every vendor could implement it as it wants - in order to have minimal implication in performance. The second part is ibv_get_timestamp. It gets an opaque timestamp and outputs a converted value in respect to the time the user wanted to get. For example, if IB_TIMESTAMP_NS_TIME is given, the function should output a system-wide NS value (we would like to implement this only in the future). Currently, only RAW and RAW_NS will be supported, while RAW gives the time in cycles and RAW_NS gives a NS value with an unknown time reference. We think ibv_get_timestamp shouldn't get a wqe but a 64bit opaque value. The reason for this is that it could be used in order to translate query_values current time to different types of timestamp. What do you think? We think this would address the reviewer comments for the kernel submission. The user-space code is in (still uses IB_CQ_FLAGS_TIMESTAMP and miss the conversion functions) https://github.com/matanb10/libibverbs timestamp-v1 https://github.com/matanb10/libmlx4 timestamp-v1 Timestamping is used by applications in order to know when a WQE was received/transmitted by the HW. The value is given is HCA hardware cycles, but could be easily converted as the hardware's core clock frequecny is available through extension of query device. Moreover, we add an ability to read the HCA's current clock. This could be useful on order to synchronize events to the wall clock. This functionality is achieved by adding/extending the
Re: [PATCH for-next V2 1/9] IB/core: Change provider's API of create_cq to be extendible
Looks good. Reviewed-By: Devesh Sharma devesh.sha...@avagotech.com On Sun, May 31, 2015 at 5:44 PM, Or Gerlitz ogerl...@mellanox.com wrote: From: Matan Barak mat...@mellanox.com Add a new ib_cq_init_attr structure which contains the previous cqe (minimum number of CQ entries) and comp_vector (completion vector) in addition to a new flags field. All vendors' create_cq callbacks are changed in order to work with the new API. This commit does not change any functionality. Signed-off-by: Matan Barak mat...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com --- drivers/infiniband/core/uverbs_cmd.c |6 -- drivers/infiniband/core/verbs.c |3 ++- drivers/infiniband/hw/amso1100/c2_provider.c |7 ++- drivers/infiniband/hw/cxgb3/iwch_provider.c | 11 --- drivers/infiniband/hw/cxgb4/cq.c |9 +++-- drivers/infiniband/hw/cxgb4/iw_cxgb4.h |8 drivers/infiniband/hw/ehca/ehca_cq.c |7 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h |3 ++- drivers/infiniband/hw/ipath/ipath_cq.c |9 +++-- drivers/infiniband/hw/ipath/ipath_verbs.h|3 ++- drivers/infiniband/hw/mlx4/cq.c |8 +++- drivers/infiniband/hw/mlx4/mlx4_ib.h |3 ++- drivers/infiniband/hw/mlx5/cq.c | 10 -- drivers/infiniband/hw/mlx5/main.c|3 ++- drivers/infiniband/hw/mlx5/mlx5_ib.h |5 +++-- drivers/infiniband/hw/mthca/mthca_provider.c |8 ++-- drivers/infiniband/hw/nes/nes_verbs.c| 11 --- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |7 ++- drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |6 -- drivers/infiniband/hw/qib/qib_cq.c | 11 --- drivers/infiniband/hw/qib/qib_verbs.h|5 +++-- drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 10 +++--- drivers/infiniband/hw/usnic/usnic_ib_verbs.h |7 --- include/rdma/ib_verbs.h | 10 -- 24 files changed, 124 insertions(+), 46 deletions(-) diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index a9f0489..1954ebb 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -1341,6 +1341,7 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, struct ib_uverbs_event_file*ev_file = NULL; struct ib_cq *cq; int ret; + struct ib_cq_init_attr attr = {}; if (out_len sizeof resp) return -ENOSPC; @@ -1376,8 +1377,9 @@ ssize_t ib_uverbs_create_cq(struct ib_uverbs_file *file, INIT_LIST_HEAD(obj-comp_list); INIT_LIST_HEAD(obj-async_list); - cq = file-device-ib_dev-create_cq(file-device-ib_dev, cmd.cqe, -cmd.comp_vector, + attr.cqe = cmd.cqe; + attr.comp_vector = cmd.comp_vector; + cq = file-device-ib_dev-create_cq(file-device-ib_dev, attr, file-ucontext, udata); if (IS_ERR(cq)) { ret = PTR_ERR(cq); diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 685a362..f7615d4 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1078,8 +1078,9 @@ struct ib_cq *ib_create_cq(struct ib_device *device, void *cq_context, int cqe, int comp_vector) { struct ib_cq *cq; + struct ib_cq_init_attr attr = {.cqe = cqe, .comp_vector = comp_vector}; - cq = device-create_cq(device, cqe, comp_vector, NULL, NULL); + cq = device-create_cq(device, attr, NULL, NULL); if (!IS_ERR(cq)) { cq-device= device; diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index d396c39..a43e022 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -286,13 +286,18 @@ static int c2_destroy_qp(struct ib_qp *ib_qp) return 0; } -static struct ib_cq *c2_create_cq(struct ib_device *ibdev, int entries, int vector, +static struct ib_cq *c2_create_cq(struct ib_device *ibdev, + const struct ib_cq_init_attr *attr, struct ib_ucontext *context, struct ib_udata *udata) { + int entries = attr-cqe; struct c2_cq *cq; int err; + if (attr-flags) + return ERR_PTR(-EINVAL); + cq = kmalloc(sizeof(*cq), GFP_KERNEL); if (!cq) { pr_debug(%s: Unable to allocate CQ\n, __func__); diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c index
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, Jun 01, 2015 at 07:25:04AM -0400, Doug Ledford wrote: attempted abstraction of ibverbs. Passing in the wc struct allows the driver to internally allocate a wc struct with extra private elements and pass that back to the user, when the user passes it back to ibv_get_timestamp the elements are there in the private portion of the struct. wc structures are allocated by the caller, there is no option for the driver to create private elements. AFAIK, Christoph's use case is essentially the only meaningful use case for this feature, generalizing too much may destroy the performance that is valuable here. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, 2015-06-01 at 10:43 -0600, Jason Gunthorpe wrote: On Mon, Jun 01, 2015 at 07:25:04AM -0400, Doug Ledford wrote: attempted abstraction of ibverbs. Passing in the wc struct allows the driver to internally allocate a wc struct with extra private elements and pass that back to the user, when the user passes it back to ibv_get_timestamp the elements are there in the private portion of the struct. wc structures are allocated by the caller, there is no option for the driver to create private elements. You're right, the data would have to be housed somewhere in the driver private completion structs (for example, in the CQE the card posts to memory that the driver then massages into a WC). AFAIK, Christoph's use case is essentially the only meaningful use case for this feature, generalizing too much may destroy the performance that is valuable here. I'm not convinced of that. Steve has already spoke up about the timestamps available in cxgb4. Those are very different and yet still highly valuable to someone investigating performance of their RDMA application. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, 2015-06-01 at 08:58 -0500, Christoph Lameter wrote: On Mon, 1 Jun 2015, Doug Ledford wrote: What's the point? If it's raw, it's raw. It's not coordinated between adapters. Whether it's in ns or ps or flipflops doesn't matter, it's a flat number that has no reference to anything else, so the only thing that matters is another version of itself or not. It can be coordinated between different adapter through the use of time software that can work with cycles and frequencies to scale the value of the cycles to realtime. And that is precisely what the cooked values should be. Software like that is available in ptpd, timekeeper etc. Each NIC basically has its own clock and the timekeeping software would have to track the scaling and the aberration factor over time in order to come up with accurate absolute time values derived from the cycle counters of these NICs. Since we are dealing here with values that need to be accurate to within less than 100ns this is not trivial and one can easily get a ns value that is absolutely useless. Agreed. The cooked value is not going to be a simple thing. I fail to see how this is making a case that we should duplicate that code in every app that uses a timestamp versus getting it right once in libibverbs. Since it is not trivial its better kept out of the timestamp support in the RDMA API. If the app developer wants a trivial conversion then they can opencode a simple multiplication by the frequency. At that point it should be clear though that this raw time value is of limited use given its inaccuracy and the dependence on the NIC clock. The raw value is just that: raw. And it *is* of limited use unless you only have one adapter. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part
Re: [PATCH for-next V2 8/9] IB/mlx4: Support extended create_cq and query_device uverbs
On Sun, May 31, 2015 at 03:14:16PM +0300, Or Gerlitz wrote: From: Matan Barak mat...@mellanox.com Add support for ib_uverbs_ex_create_cq and ib_uverbs_ex_query_device by setting the appropriate bit in uverbs_ex_cmd_mask. Why is this a seperate patch? Surely the bits should be or'd in the patches that actually include the code to do the new commands? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 2/9] IB/core: Change ib_create_cq to use struct ib_cq_init_attr
On Sun, May 31, 2015 at 03:14:10PM +0300, Or Gerlitz wrote: + struct ib_cq_init_attr cq_attr; /* Create new device info */ port_priv = kzalloc(sizeof *port_priv, GFP_KERNEL); @@ -2943,9 +2944,11 @@ static int ib_mad_port_open(struct ib_device *device, if (has_smi) cq_size *= 2; + memset(cq_attr, 0, sizeof(cq_attr)); + cq_attr.cqe = cq_size; Why does this patch switch to using memset when the prior patch used = {} ? @@ -1075,12 +1075,11 @@ EXPORT_SYMBOL(ib_destroy_qp); struct ib_cq *ib_create_cq(struct ib_device *device, ib_comp_handler comp_handler, void (*event_handler)(struct ib_event *, void *), -void *cq_context, int cqe, int comp_vector) +void *cq_context, struct ib_cq_init_attr *cq_attr) { struct ib_cq *cq; - struct ib_cq_init_attr attr = {.cqe = cqe, .comp_vector = comp_vector}; - cq = device-create_cq(device, attr, NULL, NULL); + cq = device-create_cq(device, cq_attr, NULL, NULL); How does this compile without warnings? The prior patch did: - struct ib_cq * (*create_cq)(struct ib_device *device, int cqe, - int comp_vector, + struct ib_cq * (*create_cq)(struct ib_device *device, + const struct ib_cq_init_attr *attr, struct ib_ucontext *context, struct ib_udata *udata); Otherwise looks OK. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 1/9] IB/core: Change provider's API of create_cq to be extendible
On Sun, May 31, 2015 at 03:14:09PM +0300, Or Gerlitz wrote: From: Matan Barak mat...@mellanox.com Add a new ib_cq_init_attr structure which contains the previous cqe (minimum number of CQ entries) and comp_vector (completion vector) in addition to a new flags field. All vendors' create_cq callbacks are changed in order to work with the new API. This commit does not change any functionality. Signed-off-by: Matan Barak mat...@mellanox.com Signed-off-by: Or Gerlitz ogerl...@mellanox.com Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device
We want to have a time stamp when the action is complete and the data is available to the application or the send action is complete and the CQ entry can be reused. This is what polling the completion from the CQ tells you, independent of there being a time stamp. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, 1 Jun 2015, Doug Ledford wrote: What's the point? If it's raw, it's raw. It's not coordinated between adapters. Whether it's in ns or ps or flipflops doesn't matter, it's a flat number that has no reference to anything else, so the only thing that matters is another version of itself or not. It can be coordinated between different adapter through the use of time software that can work with cycles and frequencies to scale the value of the cycles to realtime. Software like that is available in ptpd, timekeeper etc. Each NIC basically has its own clock and the timekeeping software would have to track the scaling and the aberration factor over time in order to come up with accurate absolute time values derived from the cycle counters of these NICs. Since we are dealing here with values that need to be accurate to within less than 100ns this is not trivial and one can easily get a ns value that is absolutely useless. Since it is not trivial its better kept out of the timestamp support in the RDMA API. If the app developer wants a trivial conversion then they can opencode a simple multiplication by the frequency. At that point it should be clear though that this raw time value is of limited use given its inaccuracy and the dependence on the NIC clock. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Add const to various core functions
On 5/31/2015 5:15 PM, ira.we...@intel.com wrote: From: Ira Weiny ira.we...@intel.com In order to support some of Jason's comments to add const to the functions I am adding for OPA the following patches clean up the call trees of the currently defined functions. They stand on their own as clean up patches. Therefore, I am submitting them separately from the OPA patches. --- Changes from V1: Clean up 0-day build Clean up commit messages Ira Weiny (3): IB/core cleanup: Add const to RDMA helpers IB/core cleanup: Add const on args - device-process_mad IB/core cleanup: Add const to args - agent_send_response drivers/infiniband/core/addr.c|4 +- drivers/infiniband/core/agent.c |8 ++-- drivers/infiniband/core/agent.h |4 +- drivers/infiniband/core/cache.c |8 ++-- drivers/infiniband/core/verbs.c |9 +++-- drivers/infiniband/hw/amso1100/c2_provider.c |6 ++-- drivers/infiniband/hw/cxgb3/iwch_provider.c |6 ++-- drivers/infiniband/hw/cxgb4/provider.c|5 ++- drivers/infiniband/hw/ehca/ehca_iverbs.h |4 +- drivers/infiniband/hw/ehca/ehca_sqp.c | 14 drivers/infiniband/hw/ipath/ipath_mad.c |8 ++-- drivers/infiniband/hw/ipath/ipath_verbs.h |6 ++-- drivers/infiniband/hw/mlx4/mad.c | 21 +++-- drivers/infiniband/hw/mlx4/mlx4_ib.h |8 ++-- drivers/infiniband/hw/mlx5/mad.c |8 ++-- drivers/infiniband/hw/mlx5/mlx5_ib.h |8 ++-- drivers/infiniband/hw/mthca/mthca_cmd.c |4 +- drivers/infiniband/hw/mthca/mthca_cmd.h |4 +- drivers/infiniband/hw/mthca/mthca_dev.h |6 ++-- drivers/infiniband/hw/mthca/mthca_mad.c | 10 +++--- drivers/infiniband/hw/nes/nes_verbs.c |4 +- drivers/infiniband/hw/ocrdma/ocrdma_ah.c |6 ++-- drivers/infiniband/hw/ocrdma/ocrdma_ah.h |6 ++-- drivers/infiniband/hw/qib/qib_mad.c | 10 +++--- drivers/infiniband/hw/qib/qib_verbs.h |4 +- drivers/net/ethernet/mellanox/mlx5/core/mad.c |2 +- include/linux/mlx5/driver.h |2 +- include/rdma/ib_addr.h|6 ++-- include/rdma/ib_cache.h |8 ++-- include/rdma/ib_verbs.h | 39 + 30 files changed, 121 insertions(+), 117 deletions(-) Reviewed-by: Hal Rosenstock h...@mellanox.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Add const to various core functions
On Sun, May 31, 2015 at 05:15:28PM -0400, ira.we...@intel.com wrote: From: Ira Weiny ira.we...@intel.com In order to support some of Jason's comments to add const to the functions I am adding for OPA the following patches clean up the call trees of the currently defined functions. This is longer than I imagined, but looks good. Reviewed-By: Jason Gunthorpe jguntho...@obsidianresearch.com Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 0/9] Add completion timestamping support
On Mon, Jun 01, 2015 at 01:00:57PM -0400, Doug Ledford wrote: case for this feature, generalizing too much may destroy the performance that is valuable here. I'm not convinced of that. Steve has already spoke up about the timestamps available in cxgb4. Those are very different and yet still highly valuable to someone investigating performance of their RDMA application. ? cxgb4 looks nearly identical to me. There is only one HW time stamp 'cqe_sge_ts', which occurs at some point in the flow, and is written the CQE. The current cycle counter can be read from SGE_TIMESTAMP_LO registers. Same as mlx4, really. The rest is just bookkeeping and logging that doesn't require special verbs support for an app to implement. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] libibverbs init.c: conditionally emit warning if no userspace driver found
On May 22, 2015, at 9:44 AM, Doug Ledford dledf...@redhat.com wrote: Did that happen yet? I don't think so. I didn't file a specific ticket for it at k.o yet (the k.o tickets take a while to process, so I didn't want to file it until after the comment period here on list). Ping. This is just a periodic query to see if there has been any progress on accepting this patch into libibverbs. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html